# Lecture 9

This lecture will provide you with an overview of the [scikit-learn](https://scikit-learn.org/stable/index.html) machine learning library in Python.  You will gain an understanding and experience working with a classification model using two input features along with their labels.

An artificial dataset will be used to give you experience changing the input data and seeing how it will impact the effectiveness of the classifcation model

This lecture notebook walks you through:
1. Creating two synthetic datasets each having two clusters, similar to final Rain/NoRain dataset for Assignment 2
1. Using plots to visualize the synthetic datasets
    1. change `class_sep` and `random_state` to generate different versions of synthetic data
1. Building a model by 
    1. selecting columns from cleaned data to be features and a column to be the target label.  
    1. splitting the features and target label into train and test datasets

For the synthetic dataset:
1. Use `temperature` and `pressure` as features. Use `weather_label` as target label
1. Separate data into training and test datasets
1. Build a LogisticRegression classification model to predict `weather_label` based on input features `temperature` and `pressure`
1. Calculate model accuracy on the training and test datasets

Data visualization is an important tool to show what is happening as you perform operations on data.  You will use plotting functions from the `seaborn` [graphing library](https://seaborn.pydata.org/tutorial.html) to show:
1. What the two synthetic datasets look like
1. What data was selected to be in the training and test subsets
1. What decision boundary the model uses to decide if a datapoint belongs in class 0 or class 1
1. What decision boundary the model uses to decide if a datapoint belongs in class 0 or class 1
1. Where the model made a prediction mistake in the test dataset
1. Show the result asking a model to make predictions on new data that is very different from data that it was trained on

### Helpful links

[Scikit-Learn Tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)

[Scikit-Learn Glossary of Common Terms and API Elements](https://scikit-learn.org/stable/glossary.html)

[Seaborn graphing library](https://seaborn.pydata.org/tutorial.html)

[Seaborn example gallery](https://seaborn.pydata.org/examples/index.html)


# Jupyter Settings

### Change Jupyter gui for wider cells

In [29]:
# increase Jupyter cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Import libraries needed for notebook

In [30]:
import pandas as pd
import matplotlib.pyplot as plt

# set pandas options 
pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_rows', None) # show all rows

In [31]:
from sklearn.datasets import make_classification # import function to generate synthetic (ie fake) test data
from sklearn.model_selection import train_test_split # import function to split cleaned data into train and test subsets
from sklearn.linear_model import LogisticRegression # import LogisticRegression model for classification
from sklearn.inspection import DecisionBoundaryDisplay # import DecisionBoundaryDisplay to see model decision boundary

import matplotlib.pyplot as plt # import matplotib.pyplot for plotting



import seaborn as sns # import plotting library
sns.set_palette('tab10') # set to bright 10-colour palatte see here for reference https://seaborn.pydata.org/tutorial/color_palettes.html

# Generating Synthetic Data

Use `make_classification` to create two clusters from two input features


In [32]:
from sklearn.datasets import make_classification # import function to generate synthetic (ie fake) test data

# create first dataset that has clear separation between the two classes using class_sep=1.5
# X1, y1 are numpy arrays --> X1 are the features, y1 is the label 
X1,y1 = make_classification(n_samples=1000,                   # 1000 sample rows
                            n_features=2, n_classes=2,        # each row has two input features and belong to one of two classes
                            class_sep=1.5,                    # how far the classes are from each other <<< go ahead and change this value to see how this changes model accuracy below
                            n_clusters_per_class=1, n_redundant=0, n_repeated=0, n_informative=2, flip_y=0, # keep these values as-is
                            weights=[0.5,0.5],                # set to 50%/50% for same number of samples row in each class
                            random_state=93#95, 99                   # set for repeatability <<< go ahead and change this to see different data being created
                           )

# create second dataset that has overlap between the two classes --> class_sep=0.5
# X2, y2 are numpy arrays --> X2 are the features, y2 is the label 
X2,y2 = make_classification(n_samples=1000,                   # 1000 sample rows
                            n_features=2, n_classes=2,        # each row has two input features and belong to one of two classes
                            class_sep=0.25,                    # how far the classes are from each other <<< go ahead and change this value to see how this changes model accuracy below
                            n_clusters_per_class=1, n_redundant=0, n_repeated=0, n_informative=2, flip_y=0, # keep these values as-is
                            weights=[0.5,0.5],                # set to 50%/50% for two classes that are balanced
                            random_state=43                   # set for repeatability <<< go ahead and change this to see different data being created
                           )

### Create dataframe for dataset 1

In [33]:
clean_data1_df = pd.DataFrame({'temperature': X1[:,0], 'pressure': X1[:,1], 'weather_label':y1})
print("Column names:", clean_data1_df.columns.tolist())

print("Average Temp, Press:", clean_data1_df[['temperature', 'pressure']].mean().tolist())

clean_data1_df.head(3)

Column names: ['temperature', 'pressure', 'weather_label']
Average Temp, Press: [1.536482639515635, 0.009694903615141613]


Unnamed: 0,temperature,pressure,weather_label
0,0.3688,-3.017312,1
1,3.332815,1.187653,0
2,1.79364,-1.652942,1


### Create dataframe for dataset 2

In [34]:
clean_data2_df = pd.DataFrame({'temperature': X2[:,0], 'pressure': X2[:,1], 'weather_label':y2})
print("Column names:", clean_data2_df.columns.tolist())

print("Average Temp, Press:", clean_data2_df[['temperature', 'pressure']].mean().tolist())

clean_data2_df.head(3)

Column names: ['temperature', 'pressure', 'weather_label']
Average Temp, Press: [0.04112007358369708, 0.030806103519856295]


Unnamed: 0,temperature,pressure,weather_label
0,0.886261,0.837042,1
1,-0.300718,-0.269308,0
2,0.602021,0.166823,0


### Show the two datasets side-by-side

1. Notice how the two classes are somewhat separated for Dataset 1.  This is because `class_sep` was set to 1.5
1. For Dataset 2, the two classes show quite a bit of overlap.  This is because `class_sep` was set a smaller value of 0.25 (basically less separation)

With Dataset 1, it's not too hard to use a straight line to separate the two classes.  Because of how the classes overlap for Dataset 2, it will be more challenging to determine where to draw the separation line.  

Go ahead and try changing `class_sep` above and re-run the above cells and the one below and see the effect yourself

The machine learing model will find a boundary that will give the highest accuracy on the training data

In practical applications, you're more likely to encounter situations similar to Dataset 2


In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))
g1 = sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=clean_data1_df, ax=ax[0])
g1.set(title='Dataset 1') 
g2 = sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=clean_data2_df, ax=ax[1])
g2.set(title='Dataset 2')

[Text(0.5, 1.0, 'Dataset 2')]

Error in callback <function flush_figures at 0x677cb610> (for post_execute):


# Build Model

From the synthetic data, features and target columns are selected to train the weather prediction model.

- create a dataframe `df_features` containing these fields `['temperature','pressure']`
- create a dataframe `df_target` containing labels used for classification `['weather_label']`
- split the features and target data into training and test sets using `sklearn.model_selection.train_test_split()`
- instantiate a `sklearn.linear_model.LogisticRegression()` model and train it using `df_features`, `df_target`
- check model accuracy using `.score()` function


### Create feature and target datasets

In [None]:
# create features and target dataframes
_features = ['temperature', 'pressure']
_target = 'weather_label'

df_features = clean_data1_df[_features]
df_target = clean_data1_df[_target]

### Split into train and test data subsets

The source data used to build a machine learning model is typically split into 2 subsets: training and test.

1. training data
    1. training data is used to train the model
1. test data
    1. test data is used to evaluate how well the model performs on new data.
    
Good accuracy on the test data means the model generalized well.  ie it is able to make good predictions on data it has never seen before

In [None]:
from sklearn.model_selection import train_test_split # import function to split cleaned data into train and test subsets

# split data into 80% train and 20% test datasets
percentage_for_testing = 0.2 # <<<<<<<<< set to 0.2 for 20% data for testing, 80% for training

df_features_train, df_features_test, df_target_train, df_target_test = train_test_split(
    df_features
    , df_target
    , test_size=percentage_for_testing  # set percentage of data reserved for testing
    , random_state=42 # set to any number you want for repeatability
)

### Visualize train and test datasets

Note 20% of the data are on the right chart and 80% are in the left.  

This is controlled by `percentage_for_testing` in the above cell

In [None]:
# combine df_features_train, df_target_train into one dataframe for seaborn to plot
_plot_train_df = pd.concat([df_features_train, df_target_train], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_train_df.columns =  ['temperature', 'pressure', 'weather_label'] # rename columns so proper labels appear in graph

# combine df_features_test, df_target_test into one dataframe for seaborn to plot
_plot_test_df = pd.concat([df_features_test, df_target_test], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_test_df.columns =  ['temperature', 'pressure', 'weather_label'] # rename columns so proper labels appear in graph

In [None]:
_plot_train_df.head(3) # show the data frame used for plotting

In [None]:
_plot_test_df.head(3)  # show the data frame used for plotting

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4)) # create a figure with two subplots (1 row, 2 columns)

sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=_plot_train_df, ax=ax[0]).set(title='Training data subset')
sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=_plot_test_df, ax=ax[1]).set(title='Test data subset')

### Instantiate and train logistic regression classifier model

A `LogisticRegression` classifier is instantiated and assigned to variable `clf`

In order to train the classifier, the `fit()` method is called and with feature (`X=`) and target (`y=`) variables as parameters.

In [None]:
from sklearn.linear_model import LogisticRegression # import LogisticRegression model

# instantiate and train logistic regression classifier model
clf = LogisticRegression(random_state=0)  # again, setting random_state to known value for repeatbility

clf.fit(X=df_features_train.values, y=df_target_train.values) 

### Calculate mean model accuracy on training and test data

To evaluate how well the model performs, it's accuracy is measured using training data as well as test data

Prediction accuracy is typically higher for training data than for test data, Why?
- During the `fit()` process, the model optimized for best accuracy using training data.
- As test data has not been seen by the model, imperfect model generalization produces worse accuracy

In [None]:
score_with_train_data = clf.score(df_features_train, df_target_train)
score_wtih_test_data = clf.score(df_features_test, df_target_test)
print("Average model accuracy(training data): {0}\nAverage model accuracy(test data): {1}".format(
    score_with_train_data,score_wtih_test_data )
     )

## Show decision Boundary of model

This section will show you how to visualize the decision bounday the LogisticRegression classifer found when using the two features (`temperature`, `pressure`) to predict `weather_label`

`DecisionBoundaryDisplay` will show the boundary it uses to separate the two classes.  
- For points one side of boundary, it will assign to one class (yellow)
- For points on other side of boundary, it will assign to the second class (purple)

The chart below also overlays the data points along with the actual class in bright yellow and dark purple.

For bright yellow points lying in the purple region, the model will give an incorrect prediction.

For dark purple points lying in the yellow region, the model will also give an incorrect prediction.

In [None]:
clean_data1_df.head(2) # show data frame as reminder of column names

In [None]:
import matplotlib.pyplot as plt # import matplotib.pyplot for plotting
from sklearn.inspection import DecisionBoundaryDisplay # import DecisionBoundaryDisplay to see model decision boundary

# this will produce the 2-coloured background 
disp = DecisionBoundaryDisplay.from_estimator(
    estimator=clf,                          # clf is the classifier trained above
    X=clean_data1_df[_features],            # give clf the input features to visualize how it predicts (all rows in clean_data1_df)
    response_method="predict",
    alpha=0.4 # use 40% transparency
)

# this will overlay the input features to show what labels the model will predict
# DecisionBoundaryDisplay based on matplotlib, using matplotlib's version of scatter()
disp.ax_.scatter(x=clean_data1_df['temperature'], # x-axis for scatter plot
                 y=clean_data1_df['pressure'], # y-axis for scatter plot
                 c=clean_data1_df['weather_label'], # use target label to colour data points
                 edgecolor="k" # use black to show decision boundary
                )

plt.show()

In [None]:
score_wtih_test_data

## Visualize model performance on test data

The model's score on test data was 98.5%.  Let's visualize where the model made its mistakes

In [None]:
# first, perform prediction on test set and save prediction to variable `predicted_labels`
predicted_labels = clf.predict(X=df_features_test)

### Show accuracy of model on test data

Create a new dataframe from `df_features_test` to show 'predicted' and 'actual' labels along with a column to show when `predicted!=actual`

Call this new dataframe `df_features_test_with_predictions`

In [None]:
actual_labels = df_target_test

prediction_df = pd.DataFrame({'predicted': predicted_labels, 'actual': actual_labels})

df_features_test_with_predictions = pd.concat([df_features_test, prediction_df], axis=1)

df_features_test_with_predictions['difference'] = df_features_test_with_predictions['predicted']!=df_features_test_with_predictions['actual'] # column of boolean
df_features_test_with_predictions['difference'] = df_features_test_with_predictions['difference'].astype('int') # cast boolean to integer

# show rows where incorrect prediction was made (ie where difference == 1)
df_features_test_with_predictions.loc[df_features_test_with_predictions['difference']==1].head(10)

### Show where the model made prediction mistakes on the test dataset

Yellow dots show where model made incorrect prediction (you have to squint to see the third yellow dot)

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    estimator=clf, 
    X=df_features_test,  # use features from test dataset
    response_method="predict",
    alpha=0.4 # use 40% transparency
)
disp.ax_.scatter(df_features_test_with_predictions['temperature'], df_features_test_with_predictions['pressure'], s=20, c=df_features_test_with_predictions['difference'], edgecolor="k")

plt.show()

# Saving Model for Future Use

Save this model for future prediction without having to go through the above training steps

In [None]:
from joblib import dump, load

model_filename = 'lecture9_logistic_regression_classifier.joblib'
dump(clf, model_filename) 

# Loading Existing Model and Running Prediction

In [None]:
 clf = load(model_filename) 

## Use model to for prediction with new data

What happens if we use the model trained on dataset1 with dataset2?

In [None]:
_features = ['temperature', 'pressure']
_target = 'weather_label'

df_features = clean_data2_df[_features]
df_target = clean_data2_df[_target]

### Calculate mean model accuracy on training and test data

Why is accuracy so bad?

In [None]:
_features = ['temperature', 'pressure']
_target = 'weather_label'

print("Average model accuracy(new dataset): {0}".format(clf.score(clean_data2_df[_features], clean_data2_df[_target])))
predict_data2 = clf.predict(X=clean_data2_df[_features])

### Show what the model is doing

When the model is asked to make predictions for new data that is very different from training data, you can see how the model does not generalize well

- The left chart shows data that the model was trained on.
- The middle chart shows that data points are divided according the decision boundary for Dataset1.  This boundary does not work well with Dataset 2
- The right chart shows the actual class labels for Dataset 2

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(12,4))

_plot_new_data_df = clean_data2_df[_features].copy()
_plot_new_data_df['weather_label'] = predict_data2

g1 = sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=clean_data1_df, ax=ax[0])
g1.set(title='Dataset 1')
g2 = sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=_plot_new_data_df, ax=ax[1])
g2.set(title='Prediction on Dataset 2\nwith model trained on Dataset 1')
g3 = sns.scatterplot(x='temperature', y='pressure', hue='weather_label', data=clean_data2_df, ax=ax[2])
g3.set(title='Actual class separation for Dataset 2')