<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<center>
    <h1><font color="red">Machine Learning with Scikit-Learn</font></h1>
</center>

In [None]:
%%html
<!DOCTYPE html>
<html lang="en">
  <head> </head>
  <body>
<script src="https://bot.voiceatlas.mysmce.com/v1/chatlas.js"></script>
<app-chatlas
	atlas-id="f759a188-f8bb-46bb-9046-3b1b961bd6aa"
	widget-background-color="#3f51b5ff"
	widget-text-color="#ffffffff"
	widget-title="Chatlas">
</app-chatlas>
  </body>
</html>

## Useful Links

- <a href="https://medium.com/towards-artificial-intelligence/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660">Calculating Simple Linear Regression and Linear Best Fit an In-depth Tutorial with Math and Python</a>
- <a href="https://scikit-learn.org/stable/tutorial/index.html">scikit-learn Tutorials</a>
- <a href="https://medium.com/@amitg0161/sklearn-linear-regression-tutorial-with-boston-house-dataset-cde74afd460a">Sklearn Linear Regression Tutorial with Boston House Dataset</a>
- <a href="https://www.dataquest.io/blog/sci-kit-learn-tutorial/">Scikit-learn Tutorial: Machine Learning in Python</a>
- <a href="https://debuggercafe.com/image-classification-with-mnist-dataset/">Image Classification with MNIST Dataset</a>
- <a href="https://davidburn.github.io/notebooks/mnist-numbers/MNIST%20Handwrititten%20numbers/">MNIST handwritten number identification</a>

# <font color="red">Scikit-Learn</font>

- Scikit-learn is a free machine learning library for Python. 
- Provides a selection of efficient tools for machine learning and statistical modeling including: 
     - **Classification:** Identifying which category an object belongs to. Example: Spam detection
     - **Regression:** Predicting a continuous variable based on relevant independent variables. Example: Stock price predictions
     - **Clustering:** Automatic grouping of similar objects into different clusters. Example: Customer segmentation 
     - **Dimensionality Reduction:** Seek to reduce the number of input variables in training data by preserving the salient relationships in the data
- Features various algorithms like support vector machine, random forests, and k-neighbours.
- Supports Python numerical and scientific libraries like NumPy and SciPy.


Some popular groups of models provided by scikit-learn include:

- **Clustering:** Group unlabeled data such as KMeans.
- **Cross Validation:** Estimate the performance of supervised models on unseen data.
- **Datasets:** for test datasets and for generating datasets with specific properties for investigating model behavior.
- **Dimensionality Reduction:** Reduce the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
- **Ensemble Methods:** Combine the predictions of multiple supervised models.
- **Feature Extraction:** Define attributes in image and text data.
- **Feature Selection:** Identify meaningful attributes from which to create supervised models.
- **Parameter Tuning:** Get the most out of supervised models.
- **Manifold Learning:** Summarize and depicting complex multi-dimensional data.
- **Supervised Models:** A vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.
- **Unsupervised Learning Algorithms:** − They include clustering, factor analysis, PCA (Principal Component Analysis), unsupervised neural networks.


![fig_sckl](https://ulhpc-tutorials.readthedocs.io/en/latest/python/advanced/scikit-learn/images/scikit.png)
Image Source: ulhpc-tutorials.readthedocs.io

## Package Requirements

- Numpy
- scipy
- matplotlib
- pandas
- scikit-learn
- seaborn

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline
import numpy as np
import scipy.stats as stats

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [None]:
print(f"Numpy version:        {np.__version__}")
print(f"Pandas version:       {pd.__version__}")
print(f"Seaborn version:      {sns.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")

# <font color="blue">Numerical Data</font>

## <font color="red">Diabetes Dataset</font>
- Contains information about the health data of diabetes patients.
- There are 442 samples and 10 feature variables in this dataset. 
- Data have been modified (so that the only contain numerical values) and normalized (to have values between -0.2 and 0.2).

We want to predict the diabetes progression (values between 25 and 346) one year after the baseline, given the features. 

### Obtain the Dataset

In [None]:
from sklearn.datasets import load_diabetes
diabetes_data = load_diabetes()

In [None]:
print(diabetes_data.DESCR)

### Features of the Dataset

In [None]:
print(f"Keys: {diabetes_data.keys()}")

In [None]:
print(f"Shape: {diabetes_data.data.shape}")

In [None]:
print(f"Feature Names: {diabetes_data.feature_names}")

#### Attribute Information:
| Acronym | Description |
| --- | --- |
| **age** | age in years |
| **sex** | male or female |
| **bmi** | body mass index |
| **bp** | average blood pressure |
| **s3** | hdl, high-density lipoproteins |
| **s4** | tch, total cholesterol / HDL |
| **s5** | ltg, possibly log of serum triglycerides level |
| **s6** | glu, blood sugar level |



## <font color="red">Extract Data</font>

**Pass the data into a Pandas dataframe**

In [None]:
diabetes_df = pd.DataFrame(diabetes_data.data)
diabetes_df.head()

#### Relabel the columns using the Boston dataset feature names

In [None]:
diabetes_df.columns = diabetes_data.feature_names
diabetes_df.head()

#### Add Diabetes `progression` data  to the Pandas DataFrame

In [None]:
diabetes_data.target[:5]

In [None]:
print(f"Shape of the target data:  {diabetes_data.target.shape}")

In [None]:
diabetes_df['diabetes_progression']=diabetes_data.target
diabetes_df.head()

Check the types of features:

In [None]:
diabetes_df.dtypes

## <font color="red">Exploratory Data Analysis</font>

- Important step before training the model. 
- We use statistical analysis and visualizations to understand the relationship of the target variable with other features.

#### Check Missing Values
It is a good practice to see if there are any missing values in the data. 

Count the number of missing values for each feature

In [None]:
diabetes_df.isnull().sum()

#### Number of unique elements in each column

In [None]:
diabetes_df.nunique()

#### Obtain basic statistics on the data

In [None]:
diabetes_df

In [None]:
diabetes_df.describe().transpose()

#### Distribution of the target variable

In [None]:
plt.figure(figsize=(8, 6));
plt.hist(diabetes_df['diabetes_progression']);
plt.title('Diabetes Progression and Count Histogram');
plt.xlabel('Diabetes Progression');
plt.ylabel('count');
plt.show();

In [None]:
plt.figure(figsize=(8, 6));
sns.distplot(diabetes_df['diabetes_progression']);

#### Heatmap: Two-Dimensional Graphical Representation
- Represent the individual values that are contained in a matrix as colors.
- Create a correlation matrix that measures the linear relationships between the variables.

In [None]:
plt.figure(figsize=(12, 9));
correlation_matrix = diabetes_df.corr().round(2);
sns.heatmap(correlation_matrix, cmap="YlGnBu", annot=True);

In [None]:
sns.heatmap(correlation_matrix[(correlation_matrix >= 0.5) | (correlation_matrix <= -0.4)], 
            cmap='YlGnBu', annot=True) 

- **bmi** (0.59) and **s6** (0.57) have a moderate positive correlation with **diabetes_progression**.
- The features **s1** and **s2** have a correlation of 0.9. These feature pairs are strongly correlated to each other. This can affect the model. Same goes for the features **s3** and **s4** which have a correlation of -0.74.
- The predictor variable **s3** has a negative correlation on the target. Increase it leads to the decrease in the progression of the disease.
- The predictor variables such as **bmi**, **bp**, **s4**, **s5** have good positive correlation with the target. Increase in any of them leads to the increase in the progression of the disease.

In [None]:
for feature_name in diabetes_data.feature_names:
    plt.figure(figsize=(5, 4));
    plt.scatter(diabetes_df[feature_name], 
                diabetes_df['diabetes_progression']);
    plt.ylabel('Diabetes Progression', size=12);
    plt.xlabel(feature_name, size=12);
plt.show();

- The progression of the disease appears to increase as the value of **bmi** increases. 
- The same can be said with **bp**.

Based on the above observations we will plot an `lmplot` between **bmi** and **diabetes_progression** to see the relationship between the two more clearly.

In [None]:
sns.lmplot(x = 'bmi', y = 'diabetes_progression', data = diabetes_df);

## <font color="blue">Model Selection Process</font>

![fig_skl](https://miro.medium.com/max/1400/1*LixatBxkewppAhv1Mm5H2w.jpeg)
Image Source: Christophe Bourguignat

- A Machine Learning algorithm needs to be trained on a set of data to learn the relationships between different features and how these features affect the target variable. 
- We need to divide the entire data set into two sets:
    + Training set on which we are going to train our algorithm to build a model. 
    + Testing set on which we will test our model to see how accurate its predictions are.
    
Before we create the two sets, we need to identify the algorithm we will use for our model.
We can use the `machine_learning_map` map (shown at the top of this page) as a cheat sheet to shortlist the algorithms that we can try out to build our prediction model. Using the checklist let’s see under which category our current dataset falls into:
- We have 442 samples: >50? (**Yes**)
- Are we predicting a category? (**No**)
- Are we predicting a quantity? (**Yes**)

Based on the checklist that we prepared above and going by the `machine_learning_map` we can try out **regression methods** such as:

- Linear Regression 
- Lasso
- ElasticNet Regression
- Ridge Regression: 
- K Neighbors Regressor
- Decision Tree Regressor
- Simple Vector Regression (SVR)
- Ada Boost Regressor
- Gradient Boosting Regressor
- Random Forest Regression
- Extra Trees Regressor

Check the following documents on regresssion: 
<a href="https://scikit-learn.org/stable/supervised_learning.html">Supervised learning--scikit-learn</a>,
<a href="https://developer.ibm.com/technologies/data-science/tutorials/learn-regression-algorithms-using-python-and-scikit-learn/">Learn regression algorithms using Python and scikit-learn</a>,
<a href="https://www.pluralsight.com/guides/non-linear-">Non-Linear Regression Trees with scikit-learn</a>.

## <font color="red">Simple Linear Model</font>
- It is difficult to visualize the multiple features.
- We want to predict the diabetes progression with just one variable and then move to the regression with all features.
- Because **bmi** shows positive correlation with the **diabetes_progression**, we will use **bmi** for the model.

In [None]:
X_bmi = diabetes_df.bmi
y_dis = diabetes_df.diabetes_progression

X_bmi = np.array(X_bmi).reshape(-1,1)
y_dis = np.array(y_dis).reshape(-1,1)

print(X_bmi.shape)
print(y_dis.shape)

#### Splitting the data into training and testing sets
- We split the data into training and testing sets. 
- We train the model with 80% of the samples and test with the remaining 20%. 
- We do this to assess the model’s performance on unseen data.

In [None]:
X_train_1, X_test_1, Y_train_1, Y_test_1 = \
        train_test_split(X_bmi, y_dis, test_size = 0.2, random_state=5)

print(X_train_1.shape)
print(Y_train_1.shape)
print(X_test_1.shape)
print(Y_test_1.shape)

#### Training and testing the model
- We use scikit-learn’s LinearRegression to train our model on both the training and check it on the test sets.
- We check the model performance on the train dataset.

In [None]:
reg_1 = LinearRegression()
reg_1.fit(X_train_1, Y_train_1)

y_train_predict_1 = reg_1.predict(X_train_1)
rmse = (np.sqrt(metrics.mean_squared_error(Y_train_1, y_train_predict_1)))
r2 = round(reg_1.score(X_train_1, Y_train_1),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

#### Model Evaluation for Test Set

In [None]:
y_pred_1 = reg_1.predict(X_test_1)
rmse = (np.sqrt(metrics.mean_squared_error(Y_test_1, y_pred_1)))
r2 = round(reg_1.score(X_test_1, Y_test_1),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2: {r2}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(Y_test_1, y_pred_1) :.4f}')

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(Y_test_1, y_pred_1);
plt.plot([np.min(diabetes_data.target), np.max(diabetes_data.target)], 
         [np.min(diabetes_data.target), np.max(diabetes_data.target)], '--k');
plt.axis('tight');
plt.xlabel("Actual diabetes progression");
plt.ylabel("Predicted diabetes progression");
#plt.xticks(range(0, int(max(y_test)),2));
#plt.yticks(range(0, int(max(y_test)),2));
plt.title("Actual Progression vs Predicted Progression");
plt.tight_layout();

## <font color="red">Linear Regression Model with All Variables</font>
- We want to create a model considering all the features in the dataset.

#### Create the Model

In [None]:
X = diabetes_df.drop('diabetes_progression', axis = 1)
y = diabetes_df['diabetes_progression']

- Use the `train_test_split` to split the data into random train and test subsets.
- Everytime you run it without specifying `random_state`, you will get a different result.
- If you use `random_state=some_number`, then you can guarantee the split will be always the same.
- It doesn't matter what the value of `random_state` is:  42, 0, 21, ...
- This is useful if you want reproducible results.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

The linear regression model:

In [None]:
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)

#### Model Evaluation for Training Set

In [None]:
y_train_predict = reg_all.predict(X_train)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_train, y_train_predict)))
r2 = round(reg_all.score(X_train, y_train),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f'RMSE is {rmse}')
print(f'R2 score is {r2}')

#### Model Evaluation for Test Set

In [None]:
y_pred = reg_all.predict(X_test)

In [None]:
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
r2 = round(reg_all.score(X_test, y_test),2)

print(f"The model performance for training set")
print(f"--------------------------------------")
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2: {r2}")

The coefficient of determination: 1 is perfect prediction

In [None]:
print(f'Coefficient of determination: {metrics.r2_score(y_test, y_pred) :.4f}')

#### Error Distribution

In [None]:
sns.distplot(y_test - y_pred);

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(y_test, y_pred);
plt.plot([np.min(diabetes_data.target), np.max(diabetes_data.target)], 
         [np.min(diabetes_data.target), np.max(diabetes_data.target)], '--k');
plt.axis('tight');
plt.xlabel("Actual progression");
plt.ylabel("Predicted progression");
#plt.xticks(range(0, int(max(y_test)),2));
#plt.yticks(range(0, int(max(y_test)),2));
plt.title("Actual progression vs Predicted progression");
plt.tight_layout();

In [None]:
print("RMS: %r " % np.sqrt(np.mean((y_test - y_pred) ** 2)))

In [None]:
df1 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df2 = df1.head(10)
df2

In [None]:
df2.plot(kind='bar');

## <font color="red">Choosing the Best Model:</font> k-Fold Cross-Validation

- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
- It is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data.
- We use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- The biggest advantage of this method is that every data point is used for validation exactly once and for training `k-1` times.
- To choose the final model to use, we select the one that has the lowest validation error.

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into `k` groups
3. For each unique group:
       3.1 Take the group as a hold out or test data set
       3.2 Take the remaining k-1 groups as a training data set
       3.3 Fit a model on the training set and evaluate it on the test set
       3.4 Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

How to choose **k**?
- A poorly chosen value for **k** may result in a mis-representative idea of the skill of the model, such as a score with a high variance, or a high bias.
- The choice of **k** is usually 5 or 10, but there is no formal rule. As **k** gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.
- A value of **k=10** is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.

Below is the visualization of a k-fold validation when k=5.
![FIG_kFold](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
Image Source: https://scikit-learn.org/



In [None]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

# user variables to tune
seed    = 9
folds   = 10
metric  = "neg_mean_squared_error"

# hold different regression models in a single dictionary

models = dict()
models["SVR"] = SVR()
models["KNN"] = KNeighborsRegressor()
models["Lasso"] = Lasso()
models["Ridge"] = Ridge()
models["LassoCV"] = LassoCV()
models["Linear"] = LinearRegression()
models["AdaBoost"] = AdaBoostRegressor()
models["ElasticNet"] = ElasticNet()
models["DecisionTree"] = DecisionTreeRegressor()
models["RandomForest"] = RandomForestRegressor()
models["BayesianRidge"] = BayesianRidge()
models["GradientBoost"] = GradientBoostingRegressor()
models["LogisticRegression"] = LogisticRegression()

# 10-fold cross validation for each model
model_results = list()
model_names   = list()
for model_name in models:
    model   = models[model_name]
    k_fold  = KFold(n_splits=folds, shuffle=True, random_state=seed)
    #results = cross_val_score(model, X_train, y_train, 
    results = cross_val_score(model, X, y, 
                              cv=k_fold, scoring=metric)
    
    model_results.append(results)
    model_names.append(model_name)
    print(f"{model_name :>20}: {round(results.mean(), 3):8.2f} {round(results.std(), 3):8.2f}")

# box-whisker plot to compare regression models
figure = plt.figure();
figure.suptitle('Regression models comparison');
ax = figure.add_subplot(111);
plt.boxplot(model_results);
ax.set_xticklabels(model_names, rotation = 45, ha="right");
ax.set_ylabel("Mean Squared Error (MSE)");
plt.margins(0.05, 0.1);
plt.show();

**Based on the above comparison, we can see that `Bayesian Ridge Regression` model appears to show the best performance.**

## <font color="red">Model with Bayesian Ridge</font>


In [None]:
brg = BayesianRidge()
brg.fit(X_train, y_train)

brg_predicted = brg.predict(X_test)
brg_expected = y_test

**Root Mean Square Error:**

In [None]:
print("RMS: %r " % np.sqrt(np.mean((brg_predicted - brg_expected) ** 2)))

**The coefficient of determination**: (1 is perfect prediction)

In [None]:
print('Coeff of determination: {:.4f}'.format(metrics.r2_score(brg_expected, brg_predicted)))

#### Error Distribution

In [None]:
sns.distplot(brg_expected - brg_predicted);

#### 45-Degree Plot

In [None]:
plt.figure(figsize=(8, 5));
plt.scatter(brg_expected, brg_predicted)
plt.plot([np.min(diabetes_data.target), np.max(diabetes_data.target)], 
         [np.min(diabetes_data.target), np.max(diabetes_data.target)], '--k');
plt.axis('tight');
plt.xlabel('True diabetes progression');
plt.ylabel('Predicted diabetes progression');
plt.tight_layout();

**Zoom in:**

In [None]:
df1 = pd.DataFrame({'Actual': brg_expected, 'Predicted': brg_predicted})
df2 = df1.head(10)
df2

In [None]:
df2.plot(kind='bar');

In [None]:
brg.coef_

#### Feature Importance
- Once we have a trained model, we can understand feature importance (or variable importance) of the dataset which tells us how important each feature is, to predict the target.

In [None]:
feature_importance = brg.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
pos        = np.arange(sorted_idx.shape[0]) + .5

plt.barh(pos, feature_importance[sorted_idx], align='center');
plt.yticks(pos, diabetes_data.feature_names[sorted_idx]);
plt.xlabel('Relative Importance');
plt.title('Variable Importance');

**Plot training deviance:**

In [None]:
n_estimators = 100
# compute test set deviance
test_score = np.zeros((n_estimators,), dtype=np.float64)

for i, y_pred in enumerate(brg.staged_predict(X_test)):
    test_score[i] = brg.loss_(gbr_expected, y_pred)

plt.figure(figsize=(12, 6));
plt.subplot(1, 1, 1);
plt.title('Deviance');
plt.plot(np.arange(n_estimators) + 1, 
         brg.train_score_, 'b-',
         label='Training Set Deviance');
plt.plot(np.arange(n_estimators) + 1, 
         test_score, 'r-',
         label='Test Set Deviance');
plt.legend(loc='upper right');
plt.xlabel('Boosting Iterations');
plt.ylabel('Deviance');

# <font color="blue">Image Classification</font>

## <font color="red"> MNIST Dataset</font>

- The <A HREF="https://en.wikipedia.org/wiki/MNIST_database"> MNIST database</A> (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.
- The database is also widely used for training and testing in the field of machine learning.
- The dataset we will be using contains 70000 images of handwritten digits among which 10000 are reserved for testing.
- This dataset is  suitable for anyone who wants to get started with image classification using Scikit-Learn. 

### Obtain the Dataset

In [None]:
from sklearn.datasets import fetch_openml
mnist_data = fetch_openml('mnist_784', version=1)

In [None]:
print(mnist_data.DESCR)

### Features of the Dataset

In [None]:
print("Keys: ", mnist_data.keys())

**Note that the `data` and `target` already separated.**

In [None]:
print(f"Shape of Data: {mnist_data.data.shape}")

In [None]:
print(f"Datatype of Data: {type(mnist_data.data)}")

In [None]:
print(f"Shape of the Target Data: {mnist_data.target.shape}")

In [None]:
print(f"Datatype of Target Data: {type(mnist_data.target)}")

In [None]:
print(f"Feature Names: {mnist_data.feature_names}")

In [None]:
print(f"Url: {mnist_data.url}")

In [None]:
np_data, np_target = mnist_data['data'], mnist_data['target']

print(f' Shape of data:   {np_data.shape} \n Shape of target: {np_target.shape}')

**Checking the Data**

In [None]:
len(np.unique(np_data))

In [None]:
np_data.values[0]

In [None]:
len(np.unique(np_data.values[0]))

**Checking the Target**

In [None]:
print(f"Datatype of the target values: {np_target.dtype}")

In [None]:
type(np_target[0])

Print few values:

In [None]:
print(np_target[0:5])

Changing the labels from string to integers:

In [None]:
np_target = np_target.astype(np.uint8)

In [None]:
print(np_target[0:5])

Print the number of unique labels:

In [None]:
np.unique(np_target)

In [None]:
np_target.value_counts()

In [None]:
np_target.value_counts().sum()

<font color="blue"> 
There are 70000 numbers, each stored as an array of 784 numbers depicting the opacity of each pixel, it can be displayed by reshaping the data into a 28x28 array and plotting using matplotlib. 
</font>

In [None]:
some_index = 15657
some_digit = np_data.values[some_index]
some_digit_image = some_digit.reshape(28,28)

plt.imshow(some_digit_image, 
           cmap = matplotlib.cm.binary, 
           interpolation='nearest')
plt.axis=('off')

Let us find the target for row `some_index`:

In [None]:
np_target[some_index]

In [None]:
np_data.values.shape

**Display few images**

In [None]:
import random

def display_digits(X, y):
    """
      Given an array of images of digits X and 
      the corresponding values of the digit y,
      this function plots 96 unique randomly selected images 
      and their values.
    """
    # Figure size (width, height) in inches
    fig = plt.figure(figsize=(8, 6))

    # Adjust the subplots 
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, 
                        hspace=0.05, wspace=0.05)

    num_images = X.shape[0]
    
    num_selected_images = 96
    row_indices = random.sample(range(num_images), num_selected_images)
    
    i = 0
    for idx in row_indices:
        # Initialize the subplots: 
        # Add a subplot in the grid of 8 by 12, at the i+1-th position
        ax = fig.add_subplot(8, 12, i + 1, xticks=[], yticks=[])
        
        # Display an image at the i-th position
        ax.imshow(X[idx].reshape(28, 28), cmap=plt.cm.binary, 
                  interpolation='nearest')
       
        # label the image with the target value
        ax.text(0, 7, str(y[idx]))
        i += 1

    # Show the plot
    plt.show()

In [None]:
display_digits(np_data.values, np_target)

### <font color="red">Separating the Training and Testing Set</font>

- The first 60000 (among the 70000) images are used for training.
- The remaining 10000 images are used for validations

In [None]:
num_train = 60000

X_train = np_data.values[:num_train]
X_test  = np_data.values[num_train:]
y_train = np_target[:num_train]
y_test  = np_target[num_train:]

In [None]:
print(f' Train Data:  {X_train.shape} \n Test Data:   {X_test.shape} \n \
Train label: {y_train.shape} \n Test Label:  {y_test.shape}')

**Shuffle the training set:**

In [None]:
nn = X_train.shape[0]
shuffle_index = np.random.permutation(nn)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

###  <font color="red">Training a Binary Classifier</font>

- Binary classification means there are two classes to work with that relate to one another as `true` and `false`.
- Here, we want to identify a single digit: looking at `9`s.
- The classification will tell us if we have a `9` (true) or not (false).

**Set the target arrays as boolean arrays:** true if 9 otherwise false.

In [None]:
y_train_9 = (y_train == 9)
y_test_9 = (y_test == 9)

**Create and train the model:**

- We use the SGDClassifier that applies regularized linear model with SGD (Stochastic Gradient Descent) learning to build an estimator.
- The method helps building an estimator for classification and regression problems.

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf =SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_9)

**Make an initial prediction:**

In [None]:
some_index = 15657
some_digit = np_data.values[some_index]
print(np_target[some_index])

In [None]:
some_digit_predict = sgd_clf.predict([some_digit])

In [None]:
some_digit_predict

**Measuring accuracy using cross validation**

- The `stratifiedKfold` class performs stratified sampling to produce folds that contain a representative ratio of each class. 
- At each iteration the code creates a clone of the classifier, trains that clone on the training fold and then makes predictions on the test fold. 
- It then counts the number of correct predictions and outputs the ratio of correct predictions.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_9):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_9[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_9[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct/len(y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_9, cv=3, scoring='accuracy')

The sklearn cross_val_score in action returning the same result.

In [None]:
accuracy = (sum(np_target==9)/len(np_target))*100
print(f'94-96% accuracy might not as impressive as it sounds \n where there are {accuracy :.2f}% of 9s in the dataset')

**Confusion matrix**

- A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. 
- It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.
- The confusion matrix is a much better way to evaluate the performance of a classifier, especially when there is a skewed dataset as we have here with only 10% of the dataset being the target.
- Each row represents a class, each column a prediction:
   * The first row is negative cases (non-9s) with the top left containing all the correctly classified non-9s (True Negatives), the top right the 9s incorrectly classified as non-9s (False-Positves).
   * The second row represents the positive class, 9s in this case, bottom left contains the 9s incorrectly classified as non-9s (False Negatives), the bottom right containing the correctly classified 9s (True Positives)
   
| | Actual | |
| --- |: --- |: --- |
| **Prediction** | True Positive | False Positive |
| | False Negative | True Negative |

We first need a set of predictions to compare to the actual targets:

In [None]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3)

In [None]:
cf_matrix = metrics.confusion_matrix(y_train_9, y_train_pred)

print(f"Confusion Matrix: \n{cf_matrix}")
print(f"\n Number of images: {np.sum(cf_matrix)}")

We can visualize the confusion matrix:

In [None]:
cm = metrics.confusion_matrix(y_train_9, y_train_pred, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot() 

**Precision/Recall**

- Precision measures the number of true positives (correctly classified 9s) as a ratio of the total samples classified as a 9: $\frac{TP}{TP + FP}$
- Recall measueres the number of true positives as a ratio of the total number of positives: $\frac{TP}{TP + FN}$
- Depending on the scenario the model may be modified to try and maximise one or the other, catching all positive instances at the expense of catching some false positives. Or making sure a positive instance is never falsely identified as a negative at the expense of missing some of the positive instances.

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_9, cv=3, method='decision_function')

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_9, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(12,8))
    plt.title('Precision and recall vs decision threshold')
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0,1])

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

### <font color="red">Training and Prediction on the Entire Dataset</font>

- We will use the Stochastic Gradient Descent classifier (SGD). 
- Scikit-Learn’s SGDClassifier is a good starting point for linear classifiers. 
- Using the loss parameter we will see how Support Vector Machine (Linear SVM) and Logistic Regression perform for the same dataset.


#### Using Linear Support Vector Machine (SVM)
- We use linear SVM with stochastic gradient descent (SGD) learning.
- The gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule.
- To use the Linear SVM Classifier, we need to set the loss parameter to `hinge`. 

In [None]:
from sklearn.linear_model import SGDClassifier
 
sgd_clf = SGDClassifier(loss='hinge', random_state=42)
sgd_clf.fit(X_train, y_train)

- Before testing the model, it is a good practice to first see the cross-validation scores on the training data. 
- That you will give you a very good projection of how the model performs.

In [None]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

* For three-fold Cross-Validation you are getting around 87% – 88% accuracy. 
* Not too bad, not too good either.

We can now compute the actual test scores:

In [None]:
scoreSVM = sgd_clf.score(X_test, y_test)
print("Test score of the Linear SVM: ", scoreSVM)

In [None]:
y_predict = sgd_clf.predict(X_test)

cm = metrics.confusion_matrix(y_test, y_predict, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot()

### Using Logistic Regression

In [None]:
sgd_clf = SGDClassifier(loss='log', random_state=42)
sgd_clf.fit(X_train, y_train)

In [None]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

In [None]:
scoreLR = sgd_clf.score(X_test, y_test)
print("Test score of the Logistic Regression: ", scoreLR)

In [None]:
y_predict = sgd_clf.predict(X_test)

cm = metrics.confusion_matrix(y_test, y_predict, 
                              labels=sgd_clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=sgd_clf.classes_)
disp.plot();

### Random Forest Classifier

- Random forests is a supervised learning algorithm. 
- A forest is comprised of trees. 
- It is said that the more trees it has, the more robust a forest is. 
- Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. 
- It also provides a pretty good indicator of the feature importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 500)
forest = forest.fit(X_train, y_train)

In [None]:
forest_output = forest.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print("Random Forest with n_estimators:500")
print(accuracy_score(y_test, forest_output))

Display few true images against predictions:

In [None]:
display_digits(X_test, forest_output)

In [None]:
cm = metrics.confusion_matrix(y_test, forest_output, 
                              labels=forest.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=forest.classes_)
disp.plot()

### Gradient Boosting Classifier



In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, 
                                 max_depth=1, random_state=0).fit(X_train,y_train)

In [None]:
gradient_output = clf.predict(X_test) 

Calculate accuracy on the prediction:

In [None]:
print(accuracy_score(y_test, gradient_output))

Display few true images against predictions:

In [None]:
display_digits(X_test, gradient_output)

In [None]:
cm = metrics.confusion_matrix(y_test, gradient_output, 
                              labels=clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=clf.classes_)
disp.plot()

### MLP Classifier

- The Multi-layer Perceptron classifier relies on an underlying Neural Network to perform the task of classification.

In [None]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(solver='sgd', hidden_layer_sizes=(10,), random_state=1)
clf.fit(X_train, y_train)   
neural_output = clf.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print("sgd: ", accuracy_score(y_test, neural_output))

Display few true images against predictions:

In [None]:
display_digits(X_test, neural_output)

In [None]:
clf = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(10,), random_state=1)
clf.fit(X_train, y_train)   
neural_output = clf.predict(X_test)

Calculate accuracy on the prediction:

In [None]:
print("lbfgs: ", accuracy_score(y_test, neural_output))

Display few true images against predictions:

In [None]:
display_digits(X_test, neural_output)

In [None]:
cm = metrics.confusion_matrix(y_test, neural_output, 
                              labels=clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, 
                                      display_labels=clf.classes_)
disp.plot()