<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center>
    <h1><font color="red">Regression with Scikit-Learn</font></h1>
</center>

# <font color="red">Objective</font>

Use a dataset of home sales in a city to:
- Preprocess the data and perform EDA.
- Build a predictive regression model to estimate housing prices based on various features of the houses.
- Perform the k-fold cross validation on various models to identify the one with the best score.

## Package Requirements

- NumPy
- scipy
- matplotlib
- pandas
- scikit-learn
- seaborn

In [None]:
try:
    import google.colab
    print("Running in Google Colab")
except:
    print("Not running in Google Colab")
else:
    print("Installing modules in Google Colab")
    !pip install seaborn
    !pip install -U scikit-learn

In [None]:
import warnings
from warnings import simplefilter
warnings.filterwarnings("ignore")
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [None]:
%matplotlib inline
import numpy as np
import scipy.stats as stats

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

In [None]:
import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import BayesianRidge

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
from sklearn.svm import SVR

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
print(f"Numpy version:        {np.__version__}")
print(f"Pandas version:       {pd.__version__}")
print(f"Seaborn version:      {sns.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")

# <font color="red">City housing dataset</font>

- Contains information about different aspects of residential homes in Ames, Iowa.
- There are 1460 observations and 79 feature variables in this dataset.
- [Information on the dataset can be done here.](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

We want to predict the value of prices of the house using the given features. 

__We use this particular dataset in our EDA presentation.__

## <font color="blue"> Obtain the Dataset</font>

In [None]:
ames_url = "https://raw.githubusercontent.com/astg606/py_materials/refs/heads/master/machine_learning/data/housing_data.csv"

ames_df = pd.read_csv(ames_url)

In [None]:
ames_df

## <font color="blue"> Features of the dataset and first data cleaning</font>

In [None]:
ames_df.info()

- The target is the `SalePrice` represented in the last column
- 37 columns have numerical values
- 43 columns have `object` as data type. Are we going to use them for our analysis?
- There are many missing values. How are we going to treat them?
- From the data, the following columns have far fewer quantities and may not not be relevant for the model we want to build:
   - `MiscFeature` (54)
   - `Fence` (281)
   - `PoolQC` (7)
   - `Alley` (91) 
   
We can drop the four columns with a lot of missing values. We also drop the `Id` column.

In [None]:
dropped_cols = ['Id', 'MiscFeature', 'Fence', 'PoolQC', 'Alley']
ames_df.drop(dropped_cols, axis=1, inplace=True)
ames_df

**To facilitate the analysis, we are only going to consider columns with numerical values:**

In [None]:
ames_df_num = ames_df.select_dtypes(include=['float64', 'int64'])
ames_df_num

In [None]:
feature_names = list(ames_df_num.columns)
feature_names.pop(-1)
feature_names

# <font color="red">Exploratory Data Analysis</font>

- Important step before training the model. 
- We use statistical analysis and visualizations to understand the relationship of the target variable with other features.

## <font color="blue"> Obtain basic statistics on the data</font>

In [None]:
ames_df_num

In [None]:
ames_df_num.describe().transpose()

- The average sale price of a house in our dataset is close to $\$180,921$, with most of the values falling within the $\$129,975$ to $\$214,000$ range.
- The fact the sale price standard deviation is $\$79442$ indicates a large spread of the sale price and the exisitence of outliers.
- There might be many mixing values in `LotFrontage` (Linear feet of street connected to property). Do we need to keep this column?

## <font color="blue"> Check Missing Values</font>
It is a good practice to see if there are any missing values in the data. 

Count the number of missing values for each feature

In [None]:
ames_df_num.isnull().sum()

We can also determine the perecentage of missing values in each column:

In [None]:
total = ames_df_num.isnull().sum().sort_values(ascending=False)
percent = (ames_df_num.isnull().sum()/ames_df_num.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, 
                         keys=['Total', 'Percent'])
missing_data.head(ames_df_num.shape[1])

What are we going to do with the missing values in?
- `LotFrontage` (259): Linear feet of street connected to property
- `GarageYrBlt` (81): Year garage was built
- `MasVnrArea` (8): Masonry veneer area in square feet

**We choose to drop the rows with missing values.**

In [None]:
ames_df_num.shape

In [None]:
ames_df_nonan = ames_df_num.dropna()

In [None]:
ames_df_nonan.shape

## <font color="blue"> Distribution of the target variable<font>

In [None]:
plt.figure(figsize=(8, 6));
sns.distplot(ames_df_nonan['SalePrice']);

From the above output we can see that the values of `SalePrice` are skewed to the left and have some outliers.

## <font color="blue"> Heatmap: two-dimensional graphical representation</font>
- Represent the individual values that are contained in a matrix as colors.
- Create a correlation matrix that measures the linear relationships between the variables.
- We want to identify strong linear correlations.

__You may choose to display only correlations that verify specific conditions:__

In [None]:
plt.figure(figsize=(22, 11));
correlation_matrix = ames_df_nonan.corr().round(2);
sns.heatmap(correlation_matrix[(correlation_matrix >= 0.7) | 
                               (correlation_matrix <= -0.7)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 7}, square=True);

- **OverallQual** and **GrLivArea** have a strong positive correlation with **SalePrice** (0.8 and 0.71 respectively).
- The features **GrLivArea** & **TotRmsAbvGrd**, **GarageCars** & **GarageArea** and **TotalBsmtSF** & **1stFlrSF** have a correlation of at least 0.7. These feature pairs are strongly correlated to each other. This can affect the model. 

In [None]:
subset_feature_names = ['OverallQual', 'GrLivArea', 'TotRmsAbvGrd', 
                        'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF']
fig,axs= plt.subplots(4, 2, figsize=(20, 30))

# adjust horizontal space between plots 
fig.subplots_adjust(hspace=0.6)

# We need to flatten the axes for iterating over them. Here the axes in the dimension [12, 3] is transformed to a vector consisting of 12*3 = 36 values.
for i, ax in zip(subset_feature_names, axs.flatten()):
    sns.scatterplot(x=i, y='SalePrice', hue='SalePrice',data=ames_df_nonan, ax=ax, palette='viridis_r')
    plt.xlabel(i,fontsize=12)
    plt.ylabel('SalePrice',fontsize=12)

    # ax.set_yticks(np.arange(0,900001,100000))
    ax.set_title(f'SalePrice - {i}', fontweight='bold',size=20)

- The sale prices increase as the value of `GrLivArea` increases linearly. - There are few outliers.

Based on the above observations we will plot an `lmplot` between **GrLivArea** and **SalePrice** to see the relationship between the two more clearly.

In [None]:
sns.lmplot(x = 'GrLivArea', y = 'SalePrice', data = ames_df_nonan);

# <font color="red">Model selection process</font>

- A ML algorithm needs to be trained on a set of data to learn the relationships between different features and how these features affect the target variable. 
- We need to divide the entire data set into two sets:
    + Training set on which we are going to train our algorithm to build a model. 
    + Testing set on which we will test our model to see how accurate its predictions are.
- Before we create the two sets, we need to identify the algorithm (estimator) we will use for our model.
- We use the `machine_learning_map` map (shown below) as a cheat sheet to shortlist the algorithms that we can try out to build our prediction model. 

![fig_estimators](https://scikit-learn.org/stable/_downloads/b82bf6cd7438a351f19fac60fbc0d927/ml_map.svg)    

Using the checklist let’s see under which category our current dataset falls into:
- Do we have more than 50 samples (**1121** samples) ? (**Yes**)
- Are we predicting a category? (**No**)
- Are we predicting a quantity? (**Yes**)

Based on the checklist that we prepared above and going by the `machine_learning_map` we can try out **regression methods** such as:

- Linear Regression 
- Lasso
- ElasticNet Regression
- Ridge Regression: 
- K Neighbors Regressor
- Decision Tree Regressor
- Simple Vector Regression (SVR)
- Ada Boost Regressor
- Gradient Boosting Regressor
- Random Forest Regression
- Extra Trees Regressor

__Check the following documents on regresssion__: 

- <a href="https://scikit-learn.org/stable/supervised_learning.html">Supervised learning--scikit-learn</a>
- <a href="https://developer.ibm.com/technologies/data-science/tutorials/learn-regression-algorithms-using-python-and-scikit-learn/">Learn regression algorithms using Python and scikit-learn</a>
- <a href="https://www.pluralsight.com/guides/non-linear-">Non-Linear Regression Trees with scikit-learn</a>.

# <font color="red">Simple Linear Model</font>
- It is difficult to visualize the multiple features.
- We want to predict the house price with just one variable and then move to the regression with all features.
- Because **GrLivArea** shows positive correlation with **SalePrice**, we will use **GrLivArea** for the model.

In [None]:
X_garage = ames_df_nonan.GrLivArea
y_price = ames_df_nonan.SalePrice


X_garage = np.array(X_garage).reshape(-1,1)
y_price = np.array(y_price).reshape(-1,1)

print(X_garage.shape)
print(y_price.shape)

## <font color="blue"> Splitting the data into training and testing sets</font>
- We use the `train_test_split()` function to split the data into training and testing sets. 
- We train the model with 80% of the samples and test with the remaining 20%. 
- We do this to assess the model’s performance on unseen data.

In [None]:
X_train_1, X_test_1, Y_train_1, Y_test_1 = \
             train_test_split(X_garage, y_price, 
                              test_size = 0.2, random_state=5)

In [None]:
print(f"Shape of the training features: {X_train_1.shape}")
print(f"Shape of the training target: {Y_train_1.shape}")
print(f"Shape of the test features: {X_test_1.shape}")
print(f"Shape of the test target: {Y_test_1.shape}")

## <font color="blue"> Training and testing the model</font>
- We use scikit-learn’s LinearRegression to train our model on both the training and check it on the test sets.
- We check the model performance on the train dataset.

__Create the model with a `sklearn` estimator__

In [None]:
reg_1 = LinearRegression()

__Train the model__

- We use the `fit()` method.

In [None]:
reg_1.fit(X_train_1, Y_train_1)

__Make a prediction on the training set__

- We use the `predict()` method.

In [None]:
y_train_predict_1 = reg_1.predict(X_train_1)

__Determine the performance of the model on the training set__

- We use the root mean squared error and the R2 score (best possible R2 score is 1.0 and it can be negative because a model can be arbitrarily worse).

In [None]:
rmse_train = (np.sqrt(metrics.mean_squared_error(Y_train_1, y_train_predict_1)))

In [None]:
r2_train = round(reg_1.score(X_train_1, Y_train_1),2)

In [None]:
print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse_train}')
print(f'R2 score is {r2_train}')

__Model evaluation on the test set__

In [None]:
y_pred_1 = reg_1.predict(X_test_1)

In [None]:
rmse_test = (np.sqrt(metrics.mean_squared_error(Y_test_1, y_pred_1)))

In [None]:
r2_test = round(reg_1.score(X_test_1, Y_test_1),2)

In [None]:
print(f"The model performance for test set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse_test}")
print(f"R^2: {r2_test}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(Y_test_1, y_pred_1) :.4f}')

### <font color="green">45-Degree plot</font>

- A scatter plot comparing a model's predicted values on one axis to the actual, true values on the other.
- A perfectly accurate model will have all its data points falling directly on the 45-degree line ($y=x$), indicating that the predictions perfectly match the actual values.
- This plot is a key diagnostic tool to quickly assess a regression model's performance.
- __Interpreting the plot__:
   - _Points on the line_: If the data points cluster tightly around the 45-degree line, it indicates a good model with high accuracy.
   - _Points deviating from the line_: If the points are widely scattered or form a pattern, it suggests that the model is not predicting accurately for some data points. The spread and pattern of these deviations can help diagnose problems with the model.
   - _Mean error approaching zero_: When the points are closer to the 45-degree line, the model's average error is approaching zero. 

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(Y_test_1, y_pred_1);
plt.plot(y_price, y_price, '--k');
plt.axis('tight');
plt.xlabel("Actual Sale Prices");
plt.ylabel("Predicted House Prices");
plt.title("Actual Prices vs Predicted prices");
plt.tight_layout();

# <font color="red">Linear regression model with all features</font>
- In the previous example, we used one feature (__GrLivArea__) to create a model for predicting the sale price (__SalePrice__).
   - We observed that accuracy was not not good.
- Now, we create a model using all the features in the dataset.

__Create the training and testing sets__

In [None]:
X = ames_df_nonan.drop('SalePrice', axis = 1)
y = ames_df_nonan['SalePrice']

- Use the `train_test_split` to split the data into random train and test subsets.
- Everytime you run it without specifying `random_state`, you will get a different result.
- If you use `random_state=some_number`, then you can guarantee the split will be always the same.
- It doesn't matter what the value of `random_state` is:  42, 0, 21, ...
- This is useful if you want reproducible results.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

__Create a linear regression model__

In [None]:
reg_all = LinearRegression()

__Train the model__

In [None]:
reg_all.fit(X_train, y_train)

__Model evaluation on the training Set__

In [None]:
y_train_predict = reg_all.predict(X_train)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_train, y_train_predict)))
r2 = round(reg_all.score(X_train, y_train),2)

In [None]:
print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

__Model evaluation on the test set__

In [None]:
y_pred = reg_all.predict(X_test)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
r2 = round(reg_all.score(X_test, y_test),2)

In [None]:
print(f"The model performance for training set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2: {r2}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(y_test, y_pred) :.4f}')

__Error distribution on the test set__

In [None]:
sns.distplot(y_test - y_pred);

__45-Degree plot__

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(y_test, y_pred);
plt.plot(y, y, '--k');
plt.axis('tight');
plt.xlabel("Actual House Prices");
plt.ylabel("Predicted House Prices");
plt.title("Actual Prices vs Predicted Prices");
plt.tight_layout();

In [None]:
print("RMS: %r " % np.sqrt(np.mean((y_test - y_pred) ** 2)))

In [None]:
df1 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df2 = df1.head(10)
df2

In [None]:
df2.plot(kind='bar');

# <font color="red">Choosing the Best Model:</font> k-Fold Cross-Validation

- Cross-validation is a resampling procedure used to evaluate ML models on a limited data sample.
- It is primarily used in applied ML to estimate the skill of a machine learning model on unseen data.
- We use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- The biggest advantage of this method is that every data point is used for validation exactly once and for training `k-1` times.
- To choose the final model to use, **we select the one that has the lowest validation error**.

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into `k` groups
3. For each unique group:
   
   3.1 Take the group as a hold out or test data set
   
   3.2 Take the remaining `k-1` groups as a training data set
   
   3.3 Fit a model on the training set and evaluate it on the test set
   
   3.4 Retain the evaluation score and discard the model
   
5. Summarize the skill of the model using the sample of model evaluation scores

__How to choose the value of `k`?__
- A poorly chosen value for `k` may result in a mis-representative idea of the skill of the model, such as a score with a high variance, or a high bias.
- The choice of `k` is usually 5 or 10, but there is no formal rule. As `k` gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.
- A value of `k=10` is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

Below is the visualization of a k-fold validation when k=5.
![FIG_kFold](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
Image Source: https://scikit-learn.org/

__Set parameters__

In [None]:
seed    = 9
folds   = 10

__Set the scoring metric__

In [None]:
metric  = "r2" # "neg_mean_squared_error"

__Hold different regression models in a single dictionary__

In [None]:
models = dict()
models["Linear"]        = LinearRegression()
models["Lasso"]         = Lasso()
models["ElasticNet"]    = ElasticNet()
models["Ridge"]         = Ridge()
models["BayesianRidge"] = BayesianRidge()
models["KNN"]           = KNeighborsRegressor()
models["DecisionTree"]  = DecisionTreeRegressor()
models["SVR"]           = SVR()
models["AdaBoost"]      = AdaBoostRegressor()
models["GradientBoost"] = GradientBoostingRegressor()
models["RandomForest"]  = RandomForestRegressor()

__Loop over all the models to perform a 10-fold cross validation__

- The scoring parameter in `sklearn.model_selection.cross_val_score` determines the metric used to evaluate the performance of an estimator during cross-validation.
- We define the scoring metric using the `scoring` parameter. `sklearn` provides a wide range of predefined scoring metrics that can be passed as strings. Examples include:
   - _Classification_: `'accuracy'`, `'precision'`, `'recall'`, `'f1'`, `'roc_auc'`, `'neg_log_loss'`, etc.
   - _Regression_: `'r2'`, `'neg_mean_squared_error'`, `'neg_mean_absolute_error'`, etc.

In [None]:
model_results = list()
model_names   = list()

print(f"{'Model name':>20} {'Metric mean':>16} {'Metric std':>16}")

for model_name in models:
    model   = models[model_name]
    k_fold  = KFold(n_splits=folds, random_state=seed, shuffle=True)
    results = cross_val_score(model, X_train, y_train, cv=k_fold, scoring=metric)
    
    model_results.append(results)
    model_names.append(model_name)
    print(f"{model_name:>20}: {round(results.mean(), 3):16.2f} {round(results.std(), 3):16.2f}")

__Create a box-whisker plot to compare regression models__

In [None]:
figure = plt.figure(figsize=(12, 9));
figure.suptitle('Regression models comparison');
ax = figure.add_subplot(111);
plt.boxplot(model_results);
ax.set_xticklabels(model_names, rotation = 45, ha="right");
ax.set_ylabel("Mean Squared Error (MSE)");
plt.margins(0.05, 0.1);
#plt.savefig("model_mse_scores.png")
plt.show();

__Observations__

- The best model is the one with the highest r2 score.
- **Based on the above comparison, we can see that `Gradient Boosting Regression` model outperforms all the other regression models:** it has the largest r2 mean.

# <font color="red">Model with Gradient Boosting Regression</font>

```python
GradientBoostingRegressor(*, loss='squared_error', learning_rate=0.1, 
                          n_estimators=100, subsample=1.0, criterion='friedman_mse', 
                          min_samples_split=2, min_samples_leaf=1, 
                          min_weight_fraction_leaf=0.0, max_depth=3, 
                          min_impurity_decrease=0.0, init=None, 
                          random_state=None, max_features=None, 
                          alpha=0.9, verbose=0, max_leaf_nodes=None, 
                          warm_start=False, validation_fraction=0.1, 
                          n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
```

__Define the `GradientBoostingRegressor` estimator__

In [None]:
gbr = GradientBoostingRegressor(random_state=42)

__Specify the parameter grid for `GridSearchCV`__

Create a dictionary where keys are the hyperparameter names as strings and values are lists of the values to try for each hyperparameter. Common parameters for `GradientBoostingRegressor` to tune include: 

- `n_estimators`: Number of boosting stages.
- `learning_rate`: Shrinks the contribution of each tree.
- `max_depth`: Maximum depth of the individual regression estimators.
- `subsample`: Fraction of samples to be used for fitting the individual base learners.
- `loss`: The loss function to be optimized.

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 0.9, 1.0]
}

__Initialize `GridSearchCV`__

- `estimator`: The `GradientBoostingRegressor` object.
- `param_grid`: The dictionary of hyperparameters and their values.
- `cv`: Number of folds for cross-validation.
- `scoring`: The metric to evaluate the models.
- `n_jobs`: Number of CPU cores to use (-1 means all available cores).
- `verbose`: Controls the verbosity of the outp

In [None]:
grid_search = GridSearchCV(
    estimator=gbr, 
    param_grid=param_grid, 
    cv=5, 
    scoring=metric, #'neg_mean_squared_error', 
    n_jobs=-1, 
    verbose=1)

__Fit `GridSearchCV` to your training data__

In [None]:
grid_search.fit(X_train, y_train)

__Retrieve the best parameters and best score__

In [None]:
print(f"Best parameters found: \n\t {grid_search.best_params_}")
print(f"Best score found: \n\t {grid_search.best_score_}")

__Retrieve the best estimator__

In [None]:
best_gbr = grid_search.best_estimator_

__Evaluate the best estimator on the test set__

In [None]:
gbr_predicted = best_gbr.predict(X_test)

In [None]:
test_score = best_gbr.score(X_test, y_test)
print(f"R-squared on test set: {test_score}")

__Visualizing predictions__

- Visualizing predictions helps us understand how well our model is performing and identify any patterns or discrepancies between the actual and predicted values.
- By plotting the actual values against the predicted values, we can visually assess the model's accuracy and spot areas where the predictions may be off.
- This is crucial for interpreting the effectiveness of our hyperparameter tuning and understanding the model's behavior.

__The closer these points are together, the better the model's predictive performance.__

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(range(len(y_test)), y_test, label='Actual', alpha=0.7)
plt.scatter(range(len(y_test)), gbr_predicted, label='Predicted', alpha=0.7)
plt.title('Actual vs Predicted Values with Tuned Hyperparameters')
plt.xlabel('Sample Index')
plt.ylabel('Value')
plt.legend()
plt.show()

__Error distribution__

In [None]:
gbr_expected = y_test

In [None]:
sns.distplot(gbr_expected - gbr_predicted);

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(gbr_expected, gbr_predicted)
plt.plot(y, y, '--k');
plt.axis('tight');
plt.xlabel('True price ($1000s)');
plt.ylabel('Predicted price ($1000s)');
plt.tight_layout();

__Feature importance__
- Once we have a trained model, we can understand feature importance (or variable importance) of the dataset which tells us how important each feature is, to predict the target.

In [None]:
plt.figure(figsize=(20, 11));

feature_importance = best_gbr.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos        = np.arange(sorted_idx.shape[0]) + .5

np_feature_names = np.array(feature_names)
plt.barh(pos, feature_importance[sorted_idx], align='center');
plt.yticks(pos, np_feature_names[sorted_idx]);
plt.xlabel('Relative Importance');
plt.title('Variable Importance');

**Plot training deviance:**

In [None]:
n_estimators = 100
# compute test set deviance
test_score = np.zeros((n_estimators,), dtype=np.float64)

gbr.fit(X_train, y_train)

for i, y_pred in enumerate(gbr.staged_predict(X_test)):
    test_score[i] =  mean_squared_error(gbr_expected, y_pred)

plt.figure(figsize=(12, 6));
plt.subplot(1, 1, 1);
plt.title('Deviance');
plt.plot(np.arange(n_estimators) + 1,  gbr.train_score_, 'b-',
         label='Training Set Deviance');
plt.plot(np.arange(n_estimators) + 1, test_score, 'r-',
         label='Test Set Deviance');
plt.legend(loc='upper right');
plt.xlabel('Boosting Iterations');
plt.ylabel('Deviance');

# <font color="red">Useful links</font>

- <a href="https://medium.com/towards-artificial-intelligence/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660">Calculating Simple Linear Regression and Linear Best Fit an In-depth Tutorial with Math and Python</a>
- <a href="https://scikit-learn.org/stable/tutorial/index.html">scikit-learn Tutorials</a>
- <a href="https://medium.com/@amitg0161/sklearn-linear-regression-tutorial-with-boston-house-dataset-cde74afd460a">Sklearn Linear Regression Tutorial with Boston House Dataset</a>
- <a href="https://www.dataquest.io/blog/sci-kit-learn-tutorial/">Scikit-learn Tutorial: Machine Learning in Python</a>
- <a href="https://davidburn.github.io/notebooks/mnist-numbers/MNIST%20Handwrititten%20numbers/">MNIST handwritten number identification</a>
- [K-Fold Cross-Validation in Python Using SKLearn](https://www.askpython.com/python/examples/k-fold-cross-validation)
- [Ames Housing Price Prediction Project](https://github.com/sinhasagar507/Ames-house-price-prediction) by Sagar Sinha.
- [Ames Housing Prediction](https://deepnote.com/app/suh-sean-8d86/Ames-Housing-Prediction-ca5b5a44-e02e-4bb5-9a81-8e2b89593d92) by Suh Sean.