<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center>
    <h1><font color="red">Regression with Scikit-Learn</font></h1>
</center>

In [None]:
%%html
<!DOCTYPE html>
<html lang="en">
  <head> </head>
  <body>
<script src="https://bot.voiceatlas.mysmce.com/v1/chatlas.js"></script>
<app-chatlas
	atlas-id="f759a188-f8bb-46bb-9046-3b1b961bd6aa"
	widget-background-color="#3f51b5ff"
	widget-text-color="#ffffffff"
	widget-title="Chatlas">
</app-chatlas>
  </body>
</html>

## Useful Links

- <a href="https://medium.com/towards-artificial-intelligence/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660">Calculating Simple Linear Regression and Linear Best Fit an In-depth Tutorial with Math and Python</a>
- <a href="https://scikit-learn.org/stable/tutorial/index.html">scikit-learn Tutorials</a>
- <a href="https://medium.com/@amitg0161/sklearn-linear-regression-tutorial-with-boston-house-dataset-cde74afd460a">Sklearn Linear Regression Tutorial with Boston House Dataset</a>
- <a href="https://www.dataquest.io/blog/sci-kit-learn-tutorial/">Scikit-learn Tutorial: Machine Learning in Python</a>
- <a href="https://debuggercafe.com/image-classification-with-mnist-dataset/">Image Classification with MNIST Dataset</a>
- <a href="https://davidburn.github.io/notebooks/mnist-numbers/MNIST%20Handwrititten%20numbers/">MNIST handwritten number identification</a>
- [K-Fold Cross-Validation in Python Using SKLearn](https://www.askpython.com/python/examples/k-fold-cross-validation)

# <font color="red">Scikit-Learn</font>

- Scikit-learn is a free machine learning library for Python. 
- Provides a selection of efficient tools for machine learning and statistical modeling including: 
     - **Classification:** Identifying which category an object belongs to. Example: Spam detection
     - **Regression:** Predicting a continuous variable based on relevant independent variables. Example: Stock price predictions
     - **Clustering:** Automatic grouping of similar objects into different clusters. Example: Customer segmentation 
     - **Dimensionality Reduction:** Seek to reduce the number of input variables in training data by preserving the salient relationships in the data
- Features various algorithms like support vector machine, random forests, and k-neighbours.
- Supports Python numerical and scientific libraries like NumPy and SciPy.


Some popular groups of models provided by scikit-learn include:

- **Clustering:** Group unlabeled data such as KMeans.
- **Cross Validation:** Estimate the performance of supervised models on unseen data.
- **Datasets:** for test datasets and for generating datasets with specific properties for investigating model behavior.
- **Dimensionality Reduction:** Reduce the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
- **Ensemble Methods:** Combine the predictions of multiple supervised models.
- **Feature Extraction:** Define attributes in image and text data.
- **Feature Selection:** Identify meaningful attributes from which to create supervised models.
- **Parameter Tuning:** Get the most out of supervised models.
- **Manifold Learning:** Summarize and depicting complex multi-dimensional data.
- **Supervised Models:** A vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.
- **Unsupervised Learning Algorithms:** − They include clustering, factor analysis, PCA (Principal Component Analysis), unsupervised neural networks.


![fig_sckl](https://ulhpc-tutorials.readthedocs.io/en/latest/python/advanced/scikit-learn/images/scikit.png)
Image Source: ulhpc-tutorials.readthedocs.io

## Package Requirements

- Numpy
- scipy
- matplotlib
- pandas
- scikit-learn
- seaborn

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline
import numpy as np
import scipy.stats as stats

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [None]:
print(f"Numpy version:        {np.__version__}")
print(f"Pandas version:       {pd.__version__}")
print(f"Seaborn version:      {sns.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")

# <font color="blue">Numerical Data</font>

## <font color="red">Ames, Iowa Dataset</font>
- Contains information about different aspects of residential homes in Ames, Iowa.
- There are 1460 observations and 79 feature variables in this dataset.
- [Information on the dataset can be done here.](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).

We want to predict the value of prices of the house using the given features. 

### Obtain the Dataset

In [None]:
ames_df = pd.read_csv('data/housing_data.csv')
ames_df

### Features of the Dataset and First Data Cleaning

In [None]:
ames_df.info()

- The target is the `SalePrice` represented in the last column
- 37 columns have numerical values
- 43 columns have `object` as data type. Are we going to use them for our analysis?
- There are many missing values. How are we going to treat them?
- From the data, the following columns have far fewer quantities and may not not be relevant for the model we want to build:
   - `MiscFeature` (54)
   - `Fence` (281)
   - `PoolQC` (7)
   - `Alley` (91) 
   
We can drop the four columns with a lot of missing values. We also drop the `Id` column.

In [None]:
dropped_cols = ['Id', 'MiscFeature', 'Fence', 'PoolQC', 'Alley']
ames_df.drop(dropped_cols, axis=1, inplace=True)
ames_df

**To facilitate the analysis, we are only going to consider columns with numerical values:**

In [None]:
ames_df_num = ames_df.select_dtypes(include=['float64', 'int64'])
ames_df_num

In [None]:
feature_names = list(ames_df_num.columns)
feature_names.pop(-1)
feature_names

## <font color="red">Exploratory Data Analysis</font>

- Important step before training the model. 
- We use statistical analysis and visualizations to understand the relationship of the target variable with other features.

#### Obtain basic statistics on the data

In [None]:
ames_df_num

In [None]:
ames_df_num.describe().transpose()

- The average sale price of a house in our dataset is close to $\$180,921$, with most of the values falling within the $\$129,975$ to $\$214,000$ range.
- The fact the sale price standard deviation is $\$79442$ indicates a large spread of the sale price and the exisitence of outliers.
- There might be many mixing values in `LotFrontage` (Linear feet of street connected to property). Do we need to keep this column?

#### Check Missing Values
It is a good practice to see if there are any missing values in the data. 

Count the number of missing values for each feature

In [None]:
ames_df_num.isnull().sum()

We can also determine the perecentage of missing values in each column:

In [None]:
total = ames_df_num.isnull().sum().sort_values(ascending=False)
percent = (ames_df_num.isnull().sum()/ames_df_num.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, 
                         keys=['Total', 'Percent'])
missing_data.head(ames_df_num.shape[1])

What are we going to do with the missing values in?
- `LotFrontage` (259): Linear feet of street connected to property
- `GarageYrBlt` (81): Year garage was built
- `MasVnrArea` (8): Masonry veneer area in square feet

**We choose to drop the rows with missing values.**

In [None]:
ames_df_num.shape

In [None]:
ames_df_nonan = ames_df_num.dropna()

In [None]:
ames_df_nonan.shape

#### Distribution of the target variable

In [None]:
plt.figure(figsize=(8, 6));
plt.hist(ames_df_nonan['SalePrice']);
plt.title('Ames Housing Prices and Count Histogram');
plt.xlabel('Sale Price');
plt.ylabel('Count');

In [None]:
plt.figure(figsize=(8, 6));
sns.distplot(ames_df_nonan['SalePrice']);

From the above output we can see that the values of `SalePrice` are skewed to the left and have some outliers.

#### Heatmap: Two-Dimensional Graphical Representation
- Represent the individual values that are contained in a matrix as colors.
- Create a correlation matrix that measures the linear relationships between the variables.

In [None]:
plt.figure(figsize=(22, 11));
correlation_matrix = ames_df_nonan.corr().round(1);
sns.heatmap(correlation_matrix, cmap="YlGnBu", annot=True);

You may choose to select only correlations that verify specific conditions:

In [None]:
plt.figure(figsize=(22, 11));
sns.heatmap(correlation_matrix[(correlation_matrix >= 0.5) | 
                               (correlation_matrix <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

- **OverallQual** and **GrLivArea** have a strong positive correlation with **SalePrice** (0.8 and 0.7 respectively).
- The features **GrLivArea** & **2ndFlrSF**, **BsmtFullBath** & **BsmtFinSF1** and **TotRmsAbvGrrd** & **BedroomAbvGrd** have a correlation of at least 0.7. These feature pairs are strongly correlated to each other. This can affect the model. 
- The predictor variables such as **1stFlrSF**, **TotalBsmtSF**, **GarageArea**, **GarageCars**, etc., have good positive correlation with the target. Increase in any of them leads to the increase in the price of the house.

In [None]:
for feature_name in feature_names:
    plt.figure(figsize=(5, 4));
    plt.scatter(ames_df_nonan[feature_name], ames_df_nonan['SalePrice']);
    plt.ylabel('Sale Price', size=12);
    plt.xlabel(feature_name, size=12);
plt.show();

- The sale prices increase as the value of GrLivArea increases linearly. - There are few outliers.

Based on the above observations we will plot an `lmplot` between **GrLivArea** and **SalePrice** to see the relationship between the two more clearly.

In [None]:
sns.lmplot(x = 'GrLivArea', y = 'SalePrice', data = ames_df_nonan);

## <font color="blue">Model Selection Process</font>

![fig_skl](https://miro.medium.com/max/1400/1*LixatBxkewppAhv1Mm5H2w.jpeg)
Image Source: Christophe Bourguignat

- A Machine Learning algorithm needs to be trained on a set of data to learn the relationships between different features and how these features affect the target variable. 
- We need to divide the entire data set into two sets:
    + Training set on which we are going to train our algorithm to build a model. 
    + Testing set on which we will test our model to see how accurate its predictions are.
    
Before we create the two sets, we need to identify the algorithm we will use for our model.
We can use the `machine_learning_map` map (shown at the top of this page) as a cheat sheet to shortlist the algorithms that we can try out to build our prediction model. Using the checklist let’s see under which category our current dataset falls into:
- We have **1121** samples: >50? (**Yes**)
- Are we predicting a category? (**No**)
- Are we predicting a quantity? (**Yes**)

Based on the checklist that we prepared above and going by the `machine_learning_map` we can try out **regression methods** such as:

- Linear Regression 
- Lasso
- ElasticNet Regression
- Ridge Regression: 
- K Neighbors Regressor
- Decision Tree Regressor
- Simple Vector Regression (SVR)
- Ada Boost Regressor
- Gradient Boosting Regressor
- Random Forest Regression
- Extra Trees Regressor

Check the following documents on regresssion: 
<a href="https://scikit-learn.org/stable/supervised_learning.html">Supervised learning--scikit-learn</a>,
<a href="https://developer.ibm.com/technologies/data-science/tutorials/learn-regression-algorithms-using-python-and-scikit-learn/">Learn regression algorithms using Python and scikit-learn</a>,
<a href="https://www.pluralsight.com/guides/non-linear-">Non-Linear Regression Trees with scikit-learn</a>.

## <font color="red">Simple Linear Model</font>
- It is difficult to visualize the multiple features.
- We want to predict the house price with just one variable and then move to the regression with all features.
- Because **GrLivArea** shows positive correlation with **SalePrice**, we will use **GrLivArea** for the model.

In [None]:
X_garage = ames_df_nonan.GrLivArea
y_price = ames_df_nonan.SalePrice


X_garage = np.array(X_garage).reshape(-1,1)
y_price = np.array(y_price).reshape(-1,1)

print(X_garage.shape)
print(y_price.shape)

#### Splitting the data into training and testing sets
- We split the data into training and testing sets. 
- We train the model with 80% of the samples and test with the remaining 20%. 
- We do this to assess the model’s performance on unseen data.

In [None]:
X_train_1, X_test_1, Y_train_1, Y_test_1 = \
             train_test_split(X_garage, y_price, 
                              test_size = 0.2, random_state=5)

print(X_train_1.shape)
print(Y_train_1.shape)
print(X_test_1.shape)
print(Y_test_1.shape)

#### Training and testing the model
- We use scikit-learn’s LinearRegression to train our model on both the training and check it on the test sets.
- We check the model performance on the train dataset.

In [None]:
reg_1 = LinearRegression()
reg_1.fit(X_train_1, Y_train_1)

y_train_predict_1 = reg_1.predict(X_train_1)
rmse = (np.sqrt(metrics.mean_squared_error(Y_train_1, y_train_predict_1)))
r2 = round(reg_1.score(X_train_1, Y_train_1),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

#### Model Evaluation for Test Set

In [None]:
y_pred_1 = reg_1.predict(X_test_1)
rmse = (np.sqrt(metrics.mean_squared_error(Y_test_1, y_pred_1)))
r2 = round(reg_1.score(X_test_1, Y_test_1),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2: {r2}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(Y_test_1, y_pred_1) :.4f}')

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(Y_test_1, y_pred_1);
plt.plot(y_price, y_price, '--k');
plt.axis('tight');
plt.xlabel("Actual Sale Prices");
plt.ylabel("Predicted House Prices");
#plt.xticks(range(0, int(max(y_test)),2));
#plt.yticks(range(0, int(max(y_test)),2));
plt.title("Actual Prices vs Predicted prices");
plt.tight_layout();

## <font color="red">Linear Regression Model with All Variables</font>
- We want to create a model considering all the features in the dataset.

#### Create the Model

In [None]:
X = ames_df_nonan.drop('SalePrice', axis = 1)
y = ames_df_nonan['SalePrice']

- Use the `train_test_split` to split the data into random train and test subsets.
- Everytime you run it without specifying `random_state`, you will get a different result.
- If you use `random_state=some_number`, then you can guarantee the split will be always the same.
- It doesn't matter what the value of `random_state` is:  42, 0, 21, ...
- This is useful if you want reproducible results.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

The linear regression model:

In [None]:
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)

#### Model Evaluation for Training Set

In [None]:
y_train_predict = reg_all.predict(X_train)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_train, y_train_predict)))
r2 = round(reg_all.score(X_train, y_train),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

#### Model Evaluation for Test Set

In [None]:
y_pred = reg_all.predict(X_test)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
r2 = round(reg_all.score(X_test, y_test),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2: {r2}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(y_test, y_pred) :.4f}')

#### Error Distribution

In [None]:
sns.distplot(y_test - y_pred);

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(y_test, y_pred);
plt.plot(y, y, '--k');
plt.axis('tight');
plt.xlabel("Actual House Prices ($1000)");
plt.ylabel("Predicted House Prices: ($1000)");
#plt.xticks(range(0, int(max(y_test)),2));
#plt.yticks(range(0, int(max(y_test)),2));
plt.title("Actual Prices vs Predicted prices");
plt.tight_layout();

In [None]:
print("RMS: %r " % np.sqrt(np.mean((y_test - y_pred) ** 2)))

In [None]:
df1 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df2 = df1.head(10)
df2

In [None]:
df2.plot(kind='bar');

## <font color="red">Choosing the Best Model:</font> k-Fold Cross-Validation

- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
- It is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data.
- We use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- The biggest advantage of this method is that every data point is used for validation exactly once and for training `k-1` times.
- To choose the final model to use, **we select the one that has the lowest validation error**.

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into `k` groups
3. For each unique group:
       3.1 Take the group as a hold out or test data set
       3.2 Take the remaining k-1 groups as a training data set
       3.3 Fit a model on the training set and evaluate it on the test set
       3.4 Retain the evaluation score and discard the model  
4. Summarize the skill of the model using the sample of model evaluation scores

How to choose **k**?
- A poorly chosen value for **k** may result in a mis-representative idea of the skill of the model, such as a score with a high variance, or a high bias.
- The choice of **k** is usually 5 or 10, but there is no formal rule. As **k** gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.
- A value of **k=10** is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

Below is the visualization of a k-fold validation when k=5.
![FIG_kFold](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
Image Source: https://scikit-learn.org/



In [None]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

# user variables to tune
seed    = 9
folds   = 10
metric  = "neg_mean_squared_error"

# hold different regression models in a single dictionary
models = dict()
models["Linear"]        = LinearRegression()
models["Lasso"]         = Lasso()
models["ElasticNet"]    = ElasticNet()
models["Ridge"]         = Ridge()
models["BayesianRidge"] = BayesianRidge()
models["KNN"]           = KNeighborsRegressor()
models["DecisionTree"]  = DecisionTreeRegressor()
models["SVR"]           = SVR()
models["AdaBoost"]      = AdaBoostRegressor()
models["GradientBoost"] = GradientBoostingRegressor()
models["RandomForest"]  = RandomForestRegressor()

# 10-fold cross validation for each model
model_results = list()
model_names   = list()
for model_name in models:
    model   = models[model_name]
    k_fold  = KFold(n_splits=folds, random_state=seed, shuffle=True)
    results = cross_val_score(model, X_train, y_train, cv=k_fold, scoring=metric)
    
    model_results.append(results)
    model_names.append(model_name)
    print("{:>20}: {:16.2f}, {:15.2f}".format(model_name, round(results.mean(), 3), 
                                  round(results.std(), 3)))

# box-whisker plot to compare regression models
figure = plt.figure(figsize=(12, 9));
figure.suptitle('Regression models comparison');
ax = figure.add_subplot(111);
plt.boxplot(model_results);
ax.set_xticklabels(model_names, rotation = 45, ha="right");
ax.set_ylabel("Mean Squared Error (MSE)");
plt.margins(0.05, 0.1);
#plt.savefig("model_mse_scores.png")
plt.show();

**Based on the above comparison, we can see that `Gradient Boosting Regression` model outperforms all the other regression models:** it has the smallest mean.

## <font color="red">Model with Gradient Boosting Regression</font>


In [None]:
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

gbr_predicted = gbr.predict(X_test)
gbr_expected = y_test

**Root Mean Square Error:**

In [None]:
print("RMS: %r " % np.sqrt(np.mean((gbr_predicted - gbr_expected) ** 2)))

**The coefficient of determination**: (1 is perfect prediction)

In [None]:
print('Coeff of determination: {:.4f}'.format(metrics.r2_score(gbr_expected, gbr_predicted)))

#### Error Distribution

In [None]:
sns.distplot(gbr_expected - gbr_predicted);

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(gbr_expected, gbr_predicted)
plt.plot(y, y, '--k');
plt.axis('tight');
plt.xlabel('True price ($1000s)');
plt.ylabel('Predicted price ($1000s)');
plt.tight_layout();

**Zoom in:**

In [None]:
df1 = pd.DataFrame({'Actual': gbr_expected, 'Predicted': gbr_predicted})
df2 = df1.head(10)
df2

In [None]:
df2.plot(kind='bar');

#### Feature Importance
- Once we have a trained model, we can understand feature importance (or variable importance) of the dataset which tells us how important each feature is, to predict the target.

In [None]:
plt.figure(figsize=(20, 11));

feature_importance = gbr.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos        = np.arange(sorted_idx.shape[0]) + .5

np_feature_names = np.array(feature_names)
plt.barh(pos, feature_importance[sorted_idx], align='center');
plt.yticks(pos, np_feature_names[sorted_idx]);
plt.xlabel('Relative Importance');
plt.title('Variable Importance');

**Plot training deviance:**

In [None]:
n_estimators = 100
# compute test set deviance
test_score = np.zeros((n_estimators,), dtype=np.float64)

for i, y_pred in enumerate(gbr.staged_predict(X_test)):
    test_score[i] = gbr.loss_(gbr_expected, y_pred)

plt.figure(figsize=(12, 6));
plt.subplot(1, 1, 1);
plt.title('Deviance');
plt.plot(np.arange(n_estimators) + 1, 
         gbr.train_score_, 'b-',
         label='Training Set Deviance');
plt.plot(np.arange(n_estimators) + 1, 
         test_score, 'r-',
         label='Test Set Deviance');
plt.legend(loc='upper right');
plt.xlabel('Boosting Iterations');
plt.ylabel('Deviance');