<a href="https://colab.research.google.com/github/alexisf125/alexisf125/blob/main/6C01_Machine_Learning_Coding_Tutorial_part1_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Outline for Part 1

- Exploratory Data Analysis
- Data Pre-Processing
- Model Training and Evaluating
- Pipelines
- Cross Validation and Hyperparameter Search

We will use the logistic regression model on Iris flower dataset as a running example.

Libraries used:
* [Pandas](https://pandas.pydata.org/) for data analysis and manipulation
* [Seaborn](https://seaborn.pydata.org/) and [Matplotlib](https://matplotlib.org/) for data visualization
* [Scikit-Learn](https://scikit-learn.org/stable/) for model construction, training, evaluation and hyperparameter selection
* [NumPy](https://numpy.org/) for numerical computation. Makes computation with vectors and matrices (represented as NumPy arrays) fast and easy.


# Exploratory Data Analysis

In this section, we'll load and visualize the data.

We will use a small toy dataset called the Iris plants dataset. It contains 150 examples of Iris flowers. The input variables are attributes of the Iris (e.g., sepal length, petal width). The prediction targets consist of three categories (classes) of Irises: Setosa, Versicolour, and Virginica.

In [None]:
from sklearn import datasets
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

np.random.seed(0) # set random seed to make results deterministic

In [None]:
# load the example dataset
dataset = datasets.load_iris(as_frame=True)

We will store the data as a table through the [Pandas DataFrame]((https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) data structure. Each example in the dataset corresponds to a row, and the attributes (X and y values) correspond to columns.

In [None]:
dataset_df = dataset.frame

In [None]:
# use pandas .head() to visualize the first 5 rows in the dataframe
dataset_df.head()

In [None]:
# we can use the .info() method to print a concise summary of the dataframe
dataset_df.info()

In [None]:
# show target classes
dataset_df['target'].unique()

In [None]:
# we can use the .describe() method to print summary statistics of the dataframe
dataset_df.describe().round(2)

In [None]:
# we compute the pairwise correlation of columns in the matrix using .corr()
# note: we do this only for the features, since the targets are categorical variables
feature_cols = dataset_df.columns[:-1]  # exclude the last column
correlation_matrix = dataset_df[feature_cols].corr().round(2)
correlation_matrix

We can use the Seaborn library to visualize the data.

In [None]:
# get color pallete to use to visualize positive correlations in red and negative correlations in blue
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(data=correlation_matrix, annot=True, cmap=cmap)

In [None]:
# plot pairwise relationships in the data
sns.pairplot(dataset_df)

# Data Pre-Processing

In this section, we will split the data into train, validation, and test groups, and apply min-max normalization.

Libraries used:
* [Scikit-Learn](https://scikit-learn.org/stable/)

Useful links:
* [Scikit-Learn transformations](https://scikit-learn.org/stable/data_transforms.html)

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
# separate out features and targets
X, y = dataset_df[feature_cols].to_numpy(), dataset_df['target'].to_numpy()  # .to_numpy() converts from dataframe to numpy matrix

In [None]:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# apply min-max normalization: scale data to be in 0-1 range

# find normalization paramters (i.e., max and min value) using TRAIN data
mms = MinMaxScaler()
X_train = mms.fit_transform(X_train)

In [None]:
# summary stats of scaled X_train
pd.DataFrame(X_train).describe().round(2)

In [None]:
# transform test data
X_test = mms.transform(X_test)
pd.DataFrame(X_test).describe().round(2)

# Train and Evaluate Logistic Regression Model with Scikit-Learn

In this section, we will
- Use Scikit-Learn to train and evaluate a Logistic Regression model.
- Bundle processing steps with sklearn pipelines.

Useful links:
- [Scikit-Learn Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Scikit-Learn metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) for evaluation metrics


In [None]:
from sklearn.linear_model import LogisticRegression

First, a quick example of how to train and evaluate a model without hyperparameter sweeping/selection.

In [None]:
# model construction
model = LogisticRegression(
    C=1.0, 
    fit_intercept=True, 
    solver='lbfgs', 
    multi_class='multinomial'
    )

In [None]:
# we can fit the parameters by calling the .fit() method
model.fit(X_train, y_train)

# predict on new data
y_pred = model.predict(X_test)

In [None]:
# get fitted parameters
model.coef_, model.intercept_

### Model evaluation

In [None]:
# evaluate model
y_pred = model.predict(X_test)

# display true vs. predicted classes
print(f'true:\t\t{y_test}\nprediction:\t{y_pred}')

In [None]:
# we can use accuracy_score() to score predictions against true data
from sklearn.metrics import accuracy_score
print(f'accuracy:\t{accuracy_score(y_test, y_pred)}')

For later use, we can also make a **scorer object** which scores a model against a dataset.

In [None]:
# use a scorer to calculate test set accuracy
from sklearn.metrics import make_scorer
scorer = make_scorer(accuracy_score)
scorer(model, X_test, y_test)

In [None]:
# visualize frequencies of true vs. predicted classes
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred)).plot(cmap='Blues')

# Pipelines

In this section, we will use [sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to bundle processing steps into a single model.

### Pipeline and cross validation

In [None]:
# create a pipeline model to bundle together processing steps
from sklearn.pipeline import Pipeline
pipeline = Pipeline([("MinMaxScaler", MinMaxScaler()), ("model", model)])

In [None]:
# fit the pipeline with unscaled training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
pipeline.fit(X_train, y_train)

In [None]:
# compute prediction score for fitted pipeline
scorer(pipeline, X_test, y_test)

# Hyperparameter Search via Cross Validation
In this section, we will:
- Perform k-fold cross validation on the pipelined model.
- Use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) perform a grid search over possible hyperparameter settings.

In [None]:
# cross validation score for a single hyperparameter setting
from sklearn.model_selection import cross_val_score
cross_val_score(pipeline, X_train, y_train, cv=5, scoring=scorer)

### Grid Search CV

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    # hyperparameter grid 1
    {
        'model__fit_intercept': [True, False],
        'model__C': [.5, 1, 2],
        'model__penalty': ['l2'],
        'model__solver': ['lbfgs']
    },
    # hyperparameter grid 2
    {
        'model__fit_intercept': [True, False],
        'model__C': [.5, 1, 2],
        'model__penalty': ['l1'],
        'model__solver': ['saga'],
        'model__max_iter': [5000]
    }]

search = GridSearchCV(estimator=pipeline, param_grid=param_grid, n_jobs=-1, cv=5, scoring=scorer)

# Use `pipeline.get_params()` to show available pipeline hyperparameters

In [None]:
# sweep over all hyperparameter grids, fit model for each setting
search.fit(X_train, y_train)

In [None]:
# display best model hyperparameters
print(f"Best parameters: {search.best_params_}:")
print(f"Best parameter CV Score: {search.best_score_}:")

In [None]:
# get all hyperparameter setting results
results = pd.DataFrame(search.cv_results_)
results

In [None]:
# Evaluate the best model on test data
selected_model = search.best_estimator_
print(f'accuracy: {scorer(selected_model, X_test, y_test)}\n')
print(f'confusion matrix:\n{confusion_matrix(y_test, selected_model.predict(X_test))}')