In [12]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


from sklearn.datasets import fetch_california_housing
X,y = fetch_california_housing(return_X_y=True)


# Select only the first 700 samples (to reduce time for showing example only )
X = X[:700]
y = y[:700]


In [2]:
X.shape

(700, 8)

In [3]:
y.shape

(700,)

In [4]:
X

array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   2.6464    ,   37.        ,    3.95469256, ...,    2.52750809,
          37.69      , -122.11      ],
       [   3.4643    ,   42.        ,    4.76375405, ...,    2.48543689,
          37.69      , -122.11      ],
       [   2.3973    ,   10.        ,    4.71822034, ...,    2.41525424,
          37.69      , -122.12      ]])

In [5]:
# Importing necessary libraries
from sklearn.linear_model import LinearRegression  # Linear Regression for regression tasks
from sklearn.tree import DecisionTreeRegressor  # Decision Tree Regressor for regression tasks
from sklearn.svm import SVR  # Support Vector Regressor for regression tasks
from sklearn.model_selection import cross_val_score  # For performing cross-validation

# Initializing the individual regression models
lr = LinearRegression()  # Linear Regression model (simple regression)
dt = DecisionTreeRegressor()  # Decision Tree Regressor model
svr = SVR()  # Support Vector Regressor model

# Creating a list of estimators, associating each model with a name
estimators = [('lr', lr), ('dt', dt), ('svr', svr)]  # List of tuples (model name, model)



### **Explanation**
1. **`from sklearn.linear_model import LinearRegression`**
   - Imports the **LinearRegression** model from the `sklearn` library.
   - **Linear Regression** is used for predicting a continuous target variable based on one or more independent variables (predictors).

2. **`from sklearn.tree import DecisionTreeRegressor`**
   - Imports the **DecisionTreeRegressor** model from the `sklearn` library.
   - **Decision Trees** are used for regression tasks where the model splits data into subsets to predict the target variable.

3. **`from sklearn.svm import SVR`**
   - Imports the **SVR (Support Vector Regressor)** from the `sklearn` library.
   - **SVR** is used for regression tasks where the model tries to fit the best hyperplane to predict a continuous target variable, focusing on maximizing the margin while allowing for some errors.

4. **`from sklearn.model_selection import cross_val_score`**
   - Imports the **cross_val_score** function, which helps evaluate the model by performing **cross-validation** (splitting the data into multiple subsets to ensure model reliability).

5. **Model Initialization (`lr`, `dt`, `svr`)**
   - **`lr`**: Initializes a **Linear Regression** model.
   - **`dt`**: Initializes a **Decision Tree Regressor** model.
   - **`svr`**: Initializes a **Support Vector Regressor** model.

6. **Creating the `estimators` List**
   - The `estimators` list is a collection of tuples, each containing:
     - The model's name (as a string), e.g., `'lr'` for Linear Regression.
     - The model object itself, e.g., `lr` for the Linear Regression instance.
   - This allows you to loop through different models in a consistent way and evaluate their performance.

---

### **Purpose**
- **Sets up different regression models** (Linear Regression, Decision Tree, and Support Vector Regressor) to compare their performance on a regression task.
- Uses the **`estimators` list** to store these models with their names for easy iteration and evaluation.

---

### **Significance**
- **Multiple models** are initialized to explore which one works best for the given data and task.
- Using **cross-validation** later (not shown here) allows you to reliably compare the performance of each model.
- **Different models** (Linear, Tree-based, and SVM) have distinct strengths, and testing them together can reveal which one is most suitable for the data.

In [6]:
# Looping through each estimator in the 'estimators' list
for estimator in estimators:
    
    # Performing 10-fold cross-validation using R² as the scoring metric
    scores = cross_val_score(estimator[1], X, y, scoring='r2', cv=10)  
    
    # Printing the name of the model and the mean R² score (rounded to 2 decimal places)
    print(estimator[0], np.round(np.mean(scores), 2))  


lr 0.35
dt 0.09
svr -0.62



### **Explanation**
1. **`for estimator in estimators:`**
   - Loops through each model in the `estimators` list, where each element is a tuple.
   - `estimator[0]`: The model's name (e.g., `'lr'`, `'dt'`, `'svr'`).
   - `estimator[1]`: The actual model object (e.g., `LinearRegression()`, `DecisionTreeRegressor()`, `SVR()`).

2. **`scores = cross_val_score(estimator[1], X, y, scoring='r2', cv=10)`**
   - **`cross_val_score`** performs **10-fold cross-validation** on the model (`estimator[1]`).
   - **`X`**: The features (input data).
   - **`y`**: The target (output variable).
   - **`scoring='r2'`**: The performance is evaluated using the **R² (coefficient of determination)** metric, which measures how well the model fits the data (the higher, the better).
   - **`cv=10`**: Specifies that **10-fold cross-validation** is used to split the data into 10 parts, training and testing the model 10 times.

3. **`print(estimator[0], np.round(np.mean(scores), 2))`**
   - Prints the model name (`estimator[0]`) and the **mean R² score** (calculated as the average of the R² scores across all 10 folds).
   - **`np.round(np.mean(scores), 2)`**: Rounds the mean R² score to 2 decimal places for readability.

---

### **Purpose**
- **Evaluates the performance** of different regression models (Linear Regression, Decision Tree Regressor, and Support Vector Regressor) using **cross-validation**.
- **Uses R²** as the scoring metric, which is commonly used for regression tasks to evaluate model accuracy.
- **Compares the models** based on their R² values to determine which performs best.

---

### **Significance**
- **R² score** indicates how well the model explains the variability of the target variable:
  - **R² = 1** means perfect fit.
  - **R² = 0** means the model performs no better than simply predicting the mean of the target variable.
  - **Negative R²** indicates that the model is worse than simply predicting the mean.
- **Cross-validation** ensures that the model performance is consistent across different subsets of data, providing more reliable results.
- This process helps **identify the most suitable model** for the regression task and whether more complex models like Decision Trees or Support Vector Regressors perform better than simpler ones like Linear Regression.



In [8]:
from sklearn.ensemble import VotingRegressor

# Creating a Voting Regressor using the individual estimators
vr = VotingRegressor(estimators)  

# Performing 10-fold cross-validation on the Voting Regressor using R² as the scoring metric
scores = cross_val_score(vr, X, y, scoring='r2', cv=10)  

# Printing the mean R² score of the Voting Regressor (rounded to 2 decimal places)
print("Voting Regressor", np.round(np.mean(scores), 2))  


Voting Regressor 0.32



### **Explanation**
1. **`vr = VotingRegressor(estimators)`**
   - Creates a **Voting Regressor (`vr`)** that combines the predictions of multiple regression models (the estimators in the `estimators` list).
   - The Voting Regressor will use **averaging** to combine the predictions of the individual models. It uses the predicted values of each model, averaging them to make the final prediction.

2. **`scores = cross_val_score(vr, X, y, scoring='r2', cv=10)`**
   - Performs **10-fold cross-validation** on the **Voting Regressor**.
   - **R²** is used as the evaluation metric (`scoring='r2'`), which will measure the goodness of fit for the model on each fold of the dataset.
   - The performance scores (R² values) across the 10 folds are stored in the `scores` array.

3. **`print("Voting Regressor", np.round(np.mean(scores), 2))`**
   - Prints the name of the model (`Voting Regressor`) along with its **mean R² score** (calculated by averaging the scores from the 10 folds).
   - The R² score is rounded to 2 decimal places for better presentation.

---

### **Purpose**
- **Evaluates the performance** of the **Voting Regressor**, which combines multiple regression models, by performing **cross-validation**.
- **Uses R²** as the metric to assess how well the combined predictions fit the data, providing insight into the effectiveness of the ensemble model.

---

### **Significance**
- **Voting Regressor** combines the strengths of multiple individual regression models, potentially improving predictive performance by reducing model bias or variance.
- **Cross-validation** provides a more reliable measure of the model's generalization ability by testing it on different subsets of the data.
- **R² score** offers a clear indication of how well the Voting Regressor is able to explain the variation in the target variable, with a higher score indicating better model performance.



In [9]:
# Iterating through possible weight combinations (1, 2, and 3 for each estimator)
for i in range(1, 4):  
    for j in range(1, 4):  
        for k in range(1, 4):  
            
            # Creating a Voting Regressor with specific weights for each estimator
            vr = VotingRegressor(estimators, weights=[i, j, k])  
            
            # Performing 10-fold cross-validation on the Voting Regressor with R² as the scoring metric
            scores = cross_val_score(vr, X, y, scoring='r2', cv=10)  
            
            # Printing the combination of weights (i, j, k) and the mean R² score (rounded to 2 decimal places)
            print("For i={},j={},k={}".format(i, j, k), np.round(np.mean(scores), 2))  


For i=1,j=1,k=1 0.31
For i=1,j=1,k=2 0.17
For i=1,j=1,k=3 0.07
For i=1,j=2,k=1 0.36
For i=1,j=2,k=2 0.27
For i=1,j=2,k=3 0.19
For i=1,j=3,k=1 0.32
For i=1,j=3,k=2 0.3
For i=1,j=3,k=3 0.26
For i=2,j=1,k=1 0.36
For i=2,j=1,k=2 0.26
For i=2,j=1,k=3 0.17
For i=2,j=2,k=1 0.36
For i=2,j=2,k=2 0.3
For i=2,j=2,k=3 0.23
For i=2,j=3,k=1 0.38
For i=2,j=3,k=2 0.34
For i=2,j=3,k=3 0.27
For i=3,j=1,k=1 0.37
For i=3,j=1,k=2 0.3
For i=3,j=1,k=3 0.23
For i=3,j=2,k=1 0.42
For i=3,j=2,k=2 0.34
For i=3,j=2,k=3 0.27
For i=3,j=3,k=1 0.4
For i=3,j=3,k=2 0.36
For i=3,j=3,k=3 0.31




### **Explanation**
1. **`for i in range(1, 4):`**
   - This loop iterates over values for the first weight (`i`), ranging from 1 to 3. Each value represents a weight for the first estimator in the `estimators` list (e.g., Linear Regression).

2. **`for j in range(1, 4):`**
   - Similarly, this loop iterates over values for the second weight (`j`), representing a weight for the second estimator in the `estimators` list (e.g., Decision Tree Regressor).

3. **`for k in range(1, 4):`**
   - This loop iterates over values for the third weight (`k`), representing a weight for the third estimator in the `estimators` list (e.g., Support Vector Regressor).

4. **`vr = VotingRegressor(estimators, weights=[i, j, k])`**
   - The `Voting Regressor` is created with **custom weights** for each model (based on `i`, `j`, and `k`).
   - The `weights` list determines how much influence each model's prediction has on the final output of the ensemble. Higher weights give more influence to a specific model.

5. **`scores = cross_val_score(vr, X, y, scoring='r2', cv=10)`**
   - **10-fold cross-validation** is performed on the **Voting Regressor** (`vr`).
   - **R²** is used as the evaluation metric to measure how well the model explains the variance in the target variable (`y`).

6. **`print("For i={},j={},k={}".format(i, j, k), np.round(np.mean(scores), 2))`**
   - Prints the current weight combination (`i`, `j`, `k`) and the **mean R² score** obtained from the 10 folds.
   - The R² score is rounded to 2 decimal places for readability.

---

### **Purpose**
- **Explores different weight combinations** for the estimators in the **Voting Regressor** to determine which weighting scheme provides the best performance.
- **Evaluates each weight combination** using **cross-validation** to ensure a reliable measure of the model’s generalization performance.
- **R²** is used as the metric to assess the fit of the model.

---

### **Significance**
- **Custom weights** allow each estimator in the **Voting Regressor** to contribute differently to the final prediction, enabling you to experiment and potentially improve the model's performance.
- **Cross-validation** ensures that the evaluation is robust and not dependent on a single train-test split, providing a more reliable performance estimate.
- The goal is to find the best combination of weights that maximizes the model’s performance, as indicated by the **R² score**, and improves the overall accuracy of the **Voting Regressor**.



In [10]:
# Defining a list of Decision Tree Regressors with different maximum depths
dt1 = DecisionTreeRegressor(max_depth=1)  
dt2 = DecisionTreeRegressor(max_depth=3)  
dt3 = DecisionTreeRegressor(max_depth=5)  
dt4 = DecisionTreeRegressor(max_depth=7)  
dt5 = DecisionTreeRegressor(max_depth=None)  

# Creating a list of tuples where each tuple contains the model name and the model itself
estimators = [('dt1', dt1), ('dt2', dt2), ('dt3', dt3), ('dt4', dt4), ('dt5', dt5)]  

# Looping through each estimator in the 'estimators' list
for estimator in estimators:
    
    # Performing 10-fold cross-validation using R² as the scoring metric
    scores = cross_val_score(estimator[1], X, y, scoring='r2', cv=10)  
    
    # Printing the name of the model and the mean R² score (rounded to 2 decimal places)
    print(estimator[0], np.round(np.mean(scores), 2))  


dt1 -0.01
dt2 0.21
dt3 0.22
dt4 0.14
dt5 0.15



### **Explanation**
1. **`dt1 = DecisionTreeRegressor(max_depth=1)`** through **`dt5 = DecisionTreeRegressor(max_depth=None)`**
   - Defines a list of five **Decision Tree Regressor** models with different maximum depths:
     - **`max_depth=1`**: A very shallow tree (underfitting, simple model).
     - **`max_depth=3`** to **`max_depth=7`**: Increasing depths, which means the tree becomes more complex (potentially overfitting).
     - **`max_depth=None`**: The tree grows until all leaves are pure (maximal complexity).

2. **`estimators = [('dt1', dt1), ('dt2', dt2), ('dt3', dt3), ('dt4', dt4), ('dt5', dt5)]`**
   - This creates a list of tuples, where each tuple contains the model’s name and the corresponding model object. This structure is useful for iterating over the models.

3. **`for estimator in estimators:`**
   - Iterates over each estimator in the `estimators` list.
   - For each estimator, the loop will perform cross-validation and print the results.

4. **`scores = cross_val_score(estimator[1], X, y, scoring='r2', cv=10)`**
   - Performs **10-fold cross-validation** on the current model (`estimator[1]`).
   - **R²** is used as the scoring metric to evaluate the model’s performance. R² tells us how well the model explains the variance of the target variable.
   - **`X`** is the input features, and **`y`** is the target variable.

5. **`print(estimator[0], np.round(np.mean(scores), 2))`**
   - Prints the name of the model (`estimator[0]`) and its **mean R² score** (calculated by averaging the R² scores across all 10 folds).
   - **`np.round(np.mean(scores), 2)`** rounds the mean R² score to 2 decimal places for better readability.

---

### **Purpose**
- **Evaluates the performance** of different **Decision Tree Regressors** with varying maximum depths using **cross-validation**.
- The **R² score** is used as the evaluation metric to measure the fit of each model, providing an indicator of how well each tree explains the variance in the target variable.

---

### **Significance**
- **Varying the depth of the Decision Tree** can have a significant effect on its performance:
  - **Shallow trees (max_depth=1)** tend to underfit the data, leading to poor performance.
  - **Deeper trees (max_depth=7 or None)** may overfit the data, capturing noise and leading to a lower generalization ability.
  - **Balanced trees** (intermediate depths) might strike a better trade-off between bias and variance, leading to better generalization.
- **Cross-validation** ensures that the model's performance is not overly dependent on a single train-test split, giving a more reliable estimate of how well the model will perform on unseen data.
- **R² score** helps in determining which depth of the tree best explains the variability in the target variable. A higher R² score indicates a better model fit.



In [11]:
# Creating a Voting Regressor with a list of estimators
vr = VotingRegressor(estimators)

# Performing 10-fold cross-validation using R² as the scoring metric
scores = cross_val_score(vr, X, y, scoring='r2', cv=10)

# Printing the mean R² score of the Voting Regressor (rounded to 2 decimal places)
print("Voting Regressor", np.round(np.mean(scores), 2))


Voting Regressor 0.39




### **Purpose**
- Evaluate the **Voting Regressor** model using **10-fold cross-validation** and the **R² score** to assess its performance.

---

### **Significance**
- **Voting Regressor** combines multiple models to make predictions, which can improve performance by leveraging the strengths of different algorithms.
- **R²** measures how well the ensemble model explains the variance in the target variable.
- **Cross-validation** ensures a robust evaluation by testing the model on different data splits.

---

