<a href="https://colab.research.google.com/github/ella13162/DataScience/blob/main/ds_week_9_classification_seminar_student_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Lecture Content Flow**

#### **This lecture will classify the "survived" target variable of the "titanic" dataset and implement the following classification algorithms:**

1.   Logistic Regression
2.   Decision Tree for Classification Problems
3.   Random Forests for Classification Problems
4.   K-Nearest Neighbor for Classification Problems

including the following concept: *GridSearchCV for Hyperparameter Tuning*

and measuring the accuracy and performance of the models with the following metrics:

1.   Confusion Matrix
2.   Accuracy Score
3.   Precision
4.   Recall
5.   F1 Score

And the lecture will end with the following concept: *Automating Workflows with Pipelines*

#### **Importing necessary libraries**



In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
pd.set_option("display.max_rows",None)
from sklearn import preprocessing
import matplotlib
matplotlib.style.use('ggplot')
from sklearn.utils import shuffle
pd.options.display.float_format = '{:.2f}'.format

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")


%matplotlib inline

#### **Reading & Shuffling Data**

In [None]:
df = pd.read_csv('titanic.csv')
df = df.drop(columns = ["PassengerId", "Name", "Ticket"], axis = 1)

df = shuffle(df, random_state=42)

print('Data shape: ', df.shape)
df.head()

Data shape:  (891, 9)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
709,1,3,male,,1,1,15.25,,C
439,0,2,male,31.0,0,0,10.5,,S
840,0,3,male,20.0,0,0,7.92,,S
720,1,2,female,6.0,0,1,33.0,,S
39,1,3,female,14.0,1,0,11.24,,C


#### **Checking class balance for the "Survived" feature**

In [None]:
print(df['Survived'].value_counts())  # returns raw number
print(df['Survived'].value_counts(normalize=True))  # returns percentages

# the dataset is imbalanced: 62% to 38%.

# Once we model the data, we will perform the following metrics:
    # precision
    # recall
    # f1 score

0    549
1    342
Name: Survived, dtype: int64
0   0.62
1   0.38
Name: Survived, dtype: float64


#### **Implementing Pre-Processing Steps**

1.   Listing Missing Values
2.   "Cabin" field is dropped
3.   "Embarked" field is imputed with the mode value of the field
4.   Parch and SibSp columns are combined under the "Family" field
5.   Age column is imputed and converted into categorical data:

In [None]:
def text_line(number, text):
  print(f"\n{number}: {text}")
  print(40*"-")

# 1. Listing Missing Values
text_line(1, "Missing Values")
print(df.isnull().sum())

# 2. Cabin field is dropped
df.drop("Cabin",inplace=True,axis=1)
text_line(2, "Cabin field is dropped")

# 3. Embarked field is imputed with the mode value of the field
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)
text_line(3, "Embarked field is imputed with the mode value")

# 4. Parch and SibSp columns are combined under the "Family" field
def combine(df,col1,col2):
    df["Family"] = df[col1]+df[col2]
    df.drop([col1,col2],inplace=True,axis=1)
    return df
df = combine(df,'SibSp','Parch')
text_line(4, "Parch and SibSp columns are combined under the Family field")

# 5. Age column is imputed and converted into categorical data, then Age field is dropped:
    # -1 to 0       => Missing
    #  0 to 5       => Infant
    #  5 to 12      => Child
    # 12 to 18      => Teenager
    # 18 to 35      => Young Adult
    # 35 to 60      => Adult
    # 60 to 100     => Senior

df["Age"] = df["Age"].fillna(-0.5)
def process_age(df,cut_points,label_names):
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    return df
cut_points = [-1,0,5,12,18,35,60,100]
label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]
df = process_age(df,cut_points,label_names)
text_line(5, "Age column is imputed and converted into categorical data")

df.drop("Age",inplace=True,axis=1)
text_line(6, "Age column is dropped")

text_line(7, "Null in Training set")
print(df.isnull().sum())


1: Missing Values
----------------------------------------
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

2: Cabin field is dropped
----------------------------------------

3: Embarked field is imputed with the mode value
----------------------------------------

4: Parch and SibSp columns are combined under the Family field
----------------------------------------

5: Age column is imputed and converted into categorical data
----------------------------------------

6: Null in Training set
----------------------------------------
Survived          0
Pclass            0
Sex               0
Age               0
Fare              0
Embarked          0
Family            0
Age_categories    0
dtype: int64


#### **Getting the Numerical and Categorical Feature groups ready for Feature Scaling**

In [None]:
# Numerical Features : Survived Age Fare Family
# the scaling of target value is generally not required

col = list(df.columns)
categorical_features = df.select_dtypes(include = ["object"]).columns
# Pclass and Age_categories are redefined as categorical data
cat_features = categorical_features.append(pd.Index(["Pclass", "Age_categories"]))

numerical_features_df = df.drop(columns = cat_features)
numerical_features_df.drop(columns = ["Survived"], inplace=True)

print('Categorical Features :',*cat_features)
print('Numerical Features :',*numerical_features_df)

Categorical Features : Sex Embarked Pclass Age_categories
Numerical Features : Age Fare Family


#### **Feature Scaling**


*   Robust Scaler
*   MinMax Scaler
*   Standard Scaler

Note: implemented only for numerical fields
<hr>

##### Extra Information about ***Robust Scaler***

When working with outliers we can use Robust Scaling for scaling our data,
It scales features using statistics that are robust to outliers. This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range. This range is also called an Interquartile range.
The median and the interquartile range are then stored so that it could be used upon future data using the transform method. If outliers are present in the dataset, then the median and the interquartile range provide better results and outperform the sample mean and variance.
RobustScaler uses the interquartile range so that it is robust to outliers

In [None]:
# Data Scaling is implemented only for numerical fields

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

scaler = preprocessing.RobustScaler()
robust_df = scaler.fit_transform(numerical_features_df)
robust_df = pd.DataFrame(robust_df, columns = numerical_features_df.columns)

scaler = preprocessing.StandardScaler()
standard_df = scaler.fit_transform(numerical_features_df)
standard_df = pd.DataFrame(standard_df, columns = numerical_features_df.columns)

scaler = preprocessing.MinMaxScaler()
minmax_df = scaler.fit_transform(numerical_features_df)
minmax_df = pd.DataFrame(minmax_df, columns = numerical_features_df.columns)

fig, (ax1, ax2, ax3, ax4)  = plt.subplots(ncols = 4, figsize =(20, 5))

ax1.set_title('Before Scaling')
# sns.kdeplot(df['Age'], ax = ax1, color ='black')
sns.kdeplot(df['Fare'], ax = ax1, color ='green')
sns.kdeplot(df['Family'], ax = ax1, color ='red')

ax2.set_title('After Robust Scaling')
# sns.kdeplot(robust_df['Age'], ax = ax2, color ='black')
sns.kdeplot(robust_df['Fare'], ax = ax2, color ='green')
sns.kdeplot(robust_df['Family'], ax = ax2, color ='red')

ax3.set_title('After Standard Scaling')
# sns.kdeplot(standard_df['Age'], ax = ax3, color ='black')
sns.kdeplot(standard_df['Fare'], ax = ax3, color ='green')
sns.kdeplot(standard_df['Family'], ax = ax3, color ='red')

ax4.set_title('After Min-Max Scaling')
# sns.kdeplot(minmax_df['Age'], ax = ax4, color ='black')
sns.kdeplot(minmax_df['Fare'], ax = ax4, color ='green')
sns.kdeplot(minmax_df['Family'], ax = ax4, color ='red')
plt.show()


In [None]:
df[robust_df.columns] = robust_df.copy()

#### **Encoding Categorical Variables**



*   OneHotEncoder (this is what we are going to implement in the project)
*   LabelEncoder

Note: implemented only for categorical fields

In [None]:
# Encoding is only implemented for ... fields

...
encoder = OneHotEncoder(sparse = False, drop="first")
encoded_categories = encoder.fit_transform(df[cat_features]) # data
encoded_feature_names = encoder.get_feature_names_out(cat_features) # feature names
encoded_df = pd.DataFrame(encoded_categories, columns = encoded_feature_names) # creating a dataframe from the data with the feature names
df = pd.concat([df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis = ...)
df.drop(df[cat_features], inplace=True, axis = 1) # drop the original categorical features

#### **Splitting Data**

In [None]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# split data into input objects and output object and perform the train test split process

X = df.drop(["Survive"], axis = 1)
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify =y)

#### **Grid Search for Hyperparameter Tuning**

Grid Search is a technique used to isolate the optimal hyperparameters of a model, which results in the most accurate predictions.

Shortly, it is used to improve the accuracy of our model.<br>
**What exactly are hyperparameters?**

They are the model settings we apply before training our model.


> **k** in KNN

> **max depth** in Decision Trees

> **Number of Trees** in Random Forest

These are values that, as a data scientist, at most times, we do not know what value they should be. We are essentially guessing what it might be mostly based on our previous experience.

In [None]:
# import the functionalities for Grid Search
...

#### **Best Evaluation Metric**

*   Accuracy (for balanced data only)
*   Precision (for imbalanced data)
*   Recall (for imbalanced data)
*   F1 Score (for imbalanced data)


If we talk about classification problems, the most common metrics are:

1.   **Balanced Data (at most 60%-40%, the gap more than that represents imbalanced)**

  Consider metrics:

  - Accuracy
  - Area under the ROC (Receiver Operating Characteristic) curve or simply AUC (AUC)
  
  ROC curves are best used when the dataset is well-balanced. In other words, when the proportion of positive and negative classes are similar.
  
  In scenarios when data is imbalanced, ROC curve as well as accuracy score results can be misleading.

---

2.   **Imbalanced Data**

  Consider metrics:

  - Precision (P) (use Precision when the cost of False Positives is important)
  - Recall (R) (use Recall when the cost of False Negatives is important.)
  - F1 score (F1)  
  (2 *Recall *Precision)/(Recall+Precision)

  For cases with imbalanced data, we can consider precision, recall and f1 scores and find the probability threshold where f1 is maximized.

#### **Hyperparameter Tuning and Modelling Random Forest Classification Algorithm**

Random forest is an ensemble model consisting of many decision trees working together across different randomly selected subsets of the data, facilitating improved accuracy and stability.

In [None]:
from sklearn.ensemble import RandomForestClassifier
param_grid_cv = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_sample_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
    }


# Explanations
# n_estimators: The number of decision trees in the forest.
# Increasing this value generally improves performance but also increases computation time.

# max_depth: The maximum depth of each decision tree.
# This controls how deeply each tree is allowed to grow during training and helps prevent overfitting.

# min_samples_split: The minimum number of samples required to split an internal node.
# Increasing this value helps prevent overfitting.

# min_samples_leaf: The minimum number of samples required to be at a leaf node.
# Increasing this value helps prevent overfitting by smoothing the model.

# max_features: The number of features to consider when looking for the best split.
# It can be a fixed value or a percentage of the total number of features.

# bootstrap: Whether to bootstrap samples when building trees.
# If set to True, each tree is built on a random subset of the training data with replacement.

# remember that the more parameters you specify, the more parameter combinations of models it will need to built to train and test
# this can end up with longer time

# scoring = 'accuracy'
#           'precision'
#           'recall'
#           'f1'
#           'roc_auc'

gs_cv = GridSearchCV(
    estimator= RandomForestClassifier(random_state = 42),   # random_state=42 to get consistent result
    param_grid = param_grid_cv,
    cv = 5,                                                # cv is the number of the partitions that GridSearchCV will use for cross validation
    scoring = "precision",                                 # it is a scoring metric to validate the model
    n_jobs = -1,                                           # as this can be a computationally intensive process, we can specify another n_jobs = -1
    refit = True)                                          # which means that it will use all our computer's processes to run the task and this can help speed things up


# it might take a little bit of time as it will build, train and test the different combinations of the model
gs_cv.fit(X_train, y_train)

In [None]:
# to obtain the best/optimal parameters

print(gs_cv.best_params_)

# to get the best model from the grid search object without any further re-modelling
best_model = gs_cv.best_estimator_
best_model

# remember that any parameter not included in the param_grid_cv will have their
# default values as they are not included in the param_grid_cv dictionary

# Predicting output
y_pred = best_model.predict(X_test)

#### **Performance Measurement for Random Forest Classification Algorithm**

##### **Confusion Matrix**

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (10, 10))
plt.style.use("seaborn-poster")
plt.matshow(conf_matrix, cmap="coolwarm")
plt.gca().xaxis.tick_bottom()
plt.title("Confusion Matrix")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
print(f"\nConfusion Matrix: \n {20*'-'}\n{conf_matrix}")

for (i, j), corr_value in np.ndenumerate(conf_matrix):
    plt.text(j, i, corr_value, ha="center", va="center", fontsize = 20) # ha:horizontal alignment, va:vertical alignment

plt.show()

##### **Accuracy Score**

In [None]:
# The number of correct classification out of all attempted classifications.
print(f"\nAccuracy score of the model: {accuracy_score(y_test, y_pred)}\n")

##### **Precision**

In [None]:
# of all observations that were predicted as positive, how many were actually positive?
print(f"\nSpecificity of the model: {round(precision_score(y_test, y_pred), 2)}\n")

##### **Recall**

In [None]:
# of all actually positive observations, how many did we predict as positive?
print(f"\nSensitivity of the model: {round(recall_score(y_test, y_pred), 2)}\n")

##### **F1 Score**

In [None]:
print(f"\nF1 score of the model: {round(f1_score(y_test, y_pred, zero_division=0), 2)}\n")

#### **Hyperparameter Tuning and Modelling Logistic Regression Classification Algorithm**

In [None]:
from sklearn.linear_model import LogisticRegression

# Define the hyperparameters to search
param_grid_cv = {
    'penalty': ['l1', 'l2'],
    'C': [0.01, 0.1, 1.0, 10.0],
    'solver': ['lilinear', 'lbfgs', 'saga'],
    'max_iter': [100, 200, 300],
    'class_weight': [None, 'balanced'],
    'random_state': [42]
}

# Explanations
# penalty: Determines the regularization term to apply.
# It can be either 'l1' for L1 regularization (Lasso) or 'l2' for L2 regularization (Ridge).
# Regularization helps prevent overfitting by penalizing large coefficients.

# C: Inverse of regularization strength.
# Smaller values specify stronger regularization.
# It's the inverse of lambda in regularization equations.
# Higher values of C specify less regularization, which can lead to overfitting if the model complexity is not controlled.

# solver: Algorithm to use in optimization problem.
# The choice of solver depends on the size and structure of the data.
# For smaller datasets, 'liblinear' is usually a good choice.
# For larger datasets, 'lbfgs', 'sag', or 'saga' might be more appropriate.

# max_iter: Maximum number of iterations taken for the solvers to converge.

# tol: Tolerance for stopping criteria.

# class_weight: Weights associated with classes.
# Useful for handling imbalanced datasets.
# Options include 'balanced', which automatically adjusts weights inversely proportional to class frequencies, or you can specify your own weights.

# random_state: Seed for random number generation. It ensures reproducibility of results.

# remember that the more parameters you specify, the more parameter combinations of models it will need to built to train and test
# this can end up with longer time

# scoring = 'accuracy'
#           'precision'
#           'recall'
#           'f1'
#           'roc_auc'


gs_cv = GridSearchCV(
    estimator= LogisticRegression(random_state = 42),   # random_state=42 to get consistent result
    param_grid = param_grid_cv,
    cv = 5,                                                # cv is the number of the partitions that GridSearchCV will use for cross validation
    scoring = "recall",                                    # it is a scoring metric to validate the model
    n_jobs = -1,                                           # as this can be a computationally intensive process, we can specify another n_jobs = -1
    refit = True)                                          # which means that it will use all our computer's processes to run the task and this can help speed things up


# it might take a little bit of time as it will build, train and test the different combinations of the model
gs_cv.fit(X_train, y_train)

In [None]:
# to obtain the best/optimal parameters

print(gs_cv.best_params_)

# to get the best model from the grid search object without any further re-modelling
best_model = gs_cv.best_estimator_
best_model

# remember that any parameter not included in the param_grid_cv will have their
# default values as they are not included in the param_grid_cv dictionary

# Predicting output
y_pred = best_model.predict(X_test)

#### **Performance Measurement for Logistic Regression Classification Algorithm**

##### **Confusion Matrix**



In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

plt.style.use("seaborn-poster")
plt.matshow(conf_matrix, cmap="coolwarm")
plt.gca().xaxis.tick_bottom()
plt.title("Confusion Matrix")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
print(f"\nConfusion Matrix: \n {20*'-'}\n{conf_matrix}")

for (i, j), corr_value in np.ndenumerate(conf_matrix):
    print(f"column-row: {j} {i}, value: {corr_value}")
    plt.text(j, i, corr_value, ha="center", va="center", fontsize = 20) # ha:horizontal alignment, va:vertical alignment

plt.show()

##### **Accuracy**

In [None]:
# The number of correct classification out of all attempted classifications.
print(f"\nAccuracy score of the model: {accuracy_score(y_test, y_pred)}\n")

##### **Precision**

In [None]:
# of all observations that were predicted as positive, how many were actually positive?
print(f"\nPrecision score of the model: {round(precision_score(y_test, y_pred), 2)}\n")

# each time we predicted positive class, we were corrected by 75%

##### **Recall**

In [None]:
# of all actually positive observations, how many did we predict as positive?
print(f"\nRecall score of the model: {round(recall_score(y_test, y_pred), 2)}\n")

##### **F1 Score**

In [None]:
print(f"\nF1 score of the model: {round(f1_score(y_test, y_pred, zero_division=0), 2)}\n")


#### **Hyperparameter Tuning and Modelling Decision Tree Classification Algorithm**

In [None]:
# Your turn
...

# Explanations
# Criterion: This hyperparameter measures the quality of a split.
# Typical options are "gini" for the Gini impurity and "entropy" for information gain.

# Max Depth: It specifies the maximum depth of the tree.
# A deeper tree can capture more complex relationships in the data but may also lead to overfitting.

# Min Samples Split: The minimum number of samples required to split an internal node.
# Increasing this value can prevent overfitting by ensuring that each node has enough samples to split.

# Min Samples Leaf: The minimum number of samples required to be at a leaf node.
# It prevents the tree from creating nodes with very few samples, which can lead to overfitting.

# Max Features: The number of features to consider when looking for the best split.
# It helps to reduce the number of features considered, which can speed up the training process and reduce overfitting.

...

In [None]:
# to obtain the best/optimal parameters

print(gs_cv.best_params_)

# to get the best model from the grid search object without any further re-modelling
best_model = gs_cv.best_estimator_
best_model

# remember that any parameter not included in the param_grid_cv will have their
# default values as they are not included in the param_grid_cv dictionary

# Predicting output
y_pred = best_model.predict(X_test)

#### **Performance Measurement for Decision Tree Classification Algorithm**

##### **Confusion Matrix**

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (10, 10))
plt.style.use("seaborn-poster")
plt.matshow(conf_matrix, cmap="coolwarm")
plt.gca().xaxis.tick_bottom()
plt.title("Confusion Matrix")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
print(f"\nConfusion Matrix: \n {20*'-'}\n{conf_matrix}")

for (i, j), corr_value in np.ndenumerate(conf_matrix):
    plt.text(j, i, corr_value, ha="center", va="center", fontsize = 20) # ha:horizontal alignment, va:vertical alignment

plt.show()

##### **Accuracy Score**

In [None]:
# The number of correct classification out of all attempted classifications.
print(f"\nAccuracy score of the model: {accuracy_score(y_test, y_pred)}\n")

##### **Precision**

In [None]:
# of all observations that were predicted as positive, how many were actually positive?
print(f"\nPrecision score of the model: {round(precision_score(y_test, y_pred), 2)}\n")

# each time we predicted positive class, we were corrected by 93%

##### **Recall**

In [None]:
# of all actually positive observations, how many did we predict as positive?
print(f"\nRecall score of the model: {round(recall_score(y_test, y_pred), 2)}\n")

##### **F1 Score**

In [None]:
f1_score(y_test, y_pred, zero_division=0)

#### **Hyperparameter Tuning and Modelling KNN Classification Algorithm**

The KNN algorithm predicts a class for an unknown data point using the most popular class of a number of nearby known data points.

The number of nearby data points used to form the prediction is denoted by k.

Make sure you handle outliers when performing this algorithm as this is a distance-based model. Outliers can cause issues particularly when scaling the data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Define the hyperparameters to search
param_grid_cv = {
    ...
}

# Explanations
# n_neighbors: Number of neighbors to consider.
# It's a crucial hyperparameter that controls the model complexity.
# Higher values lead to smoother decision boundaries but can result in underfitting, while lower values can lead to overfitting.

# weights: Determines how the neighbors are weighted.
# Possible options are 'uniform' (all neighbors are weighted equally) and 'distance' (closer neighbors have more influence).

# algorithm: Specifies the algorithm used to compute the nearest neighbors.
# Options include 'auto', 'ball_tree', 'kd_tree', and 'brute'.
# The 'auto' option automatically selects the most appropriate algorithm based on the input data.

# metric: The distance metric used to measure the distance between instances.
# Common options include 'euclidean', 'manhattan', and 'minkowski'.

# leaf_size: Leaf size passed to BallTree or KDTree.
# It affects the speed of the construction and query but may not have a significant impact on the quality of the model.

# remember that the more parameters you specify, the more parameter combinations of models it will need to built to train and test
# this can end up with longer time

# scoring = 'accuracy'
#           'precision'
#           'recall'
#           'f1'
#           'roc_auc'

gs_cv = GridSearchCV(
    ...
    )


# it might take a little bit of time as it will build, train and test the different combinations of the model
gs_cv.fit(X_train, y_train)

In [None]:
# to obtain the best/optimal parameters

print(gs_cv.best_params_)

# to get the best model from the grid search object without any further re-modelling
best_model = gs_cv.best_estimator_
best_model

# remember that any parameter not included in the param_grid_cv will have their
# default values as they are not included in the param_grid_cv dictionary

# Predicting output
y_pred = best_model.predict(X_test)

#### **Performance Measurement for KNN Classification Algorithm**

##### **Confusion Matrix**

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (10, 10))
plt.style.use("seaborn-poster")
plt.matshow(conf_matrix, cmap="coolwarm")
plt.gca().xaxis.tick_bottom()
plt.title("Confusion Matrix")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
print(f"\nConfusion Matrix: \n {20*'-'}\n{conf_matrix}")

for (i, j), corr_value in np.ndenumerate(conf_matrix):
    print(f"column-row: {j} {i}, value: {corr_value}")
    plt.text(j, i, corr_value, ha="center", va="center", fontsize = 20) # ha:horizontal alignment, va:vertical alignment

plt.show()

##### **Accuracy**

In [None]:
# The number of correct classification out of all attempted classifications.
print(f"\nAccuracy score of the model: {accuracy_score(y_test, y_pred)}\n")

##### **Precision**

In [None]:
# of all observations that were predicted as positive, how many were actually positive?
print(f"\nPrecision score of the model: {round(precision_score(y_test, y_pred), 2)}\n")

##### **Recall**

In [None]:
# of all actually positive observations, how many did we predict as positive?
print(f"\nRecall score of the model: {round(recall_score(y_test, y_pred), 2)}\n")

##### **F1 Score**

In [None]:
f1_score(y_test, y_pred, zero_division=0)

#### **Automating Workflows with Pipelines**

In [None]:
# pipelines are great for keeping the workflow from start to finish clean and effective.
# so far, we have kept each step in the process separate including data preparation step where
# we handled different tasks such as handling missing values, implemening encoding and so on.

# pipelines tie all these steps together.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

**Setup Pipelines**

We will go through a very small dataset where we will be able to apply a lot of column transofrmations. This will be a dataset for demonstration purposes only.

Once you are comfortable with the Pipeline concept, we will perform the Pipeline object in our datasets that we have seen so far.

In [None]:
# Reading Data

df = pd.read_csv('pipeline_data.csv')

df = shuffle(df, random_state=42)

print('Data shape: ', df.shape)

print(df.isna().sum())
df.head()

Data shape:  (100, 4)
purchase        0
age             3
gender          3
credit_score    5
dtype: int64


Unnamed: 0,purchase,age,gender,credit_score
83,1,34.0,F,161.0
53,1,35.0,M,318.0
70,1,26.0,M,123.0
45,0,41.0,F,517.0
44,1,26.0,M,250.0


In [None]:
df.info()

# point out the data type of the columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 83 to 51
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   purchase      100 non-null    int64  
 1   age           97 non-null     float64
 2   gender        97 non-null     object 
 3   credit_score  95 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 3.9+ KB


In [None]:
# the dataset has both numerical and categorical columns
# we will apply null value imputation with the SimpleImputer for both categorical and numerical variables,
# one hot encoding with the OneHotEncoder, scaling for the numerical fields with StandardScaler

# lets first split the data:
X=df.drop(["purchase"], axis = 1)
y = df["purchase"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# we will specify the numerical and categorical features in a separate list each

numeric_features = ["age", "credit_score"]
categorical_features = ["gender"]

In [None]:
# we are now ready to setup some preprocessing pipelines

In [None]:
# numeric feature pipeline - scaling

numeric_transformer = Pipeline(steps = [("imputer", SimpleImputer()),
                                        ("scaler", StandardScaler())])

In [None]:
# Categorical feature transformer

# since the column is categorical, we need to specify the imputation strategy
categorical_transformer = Pipeline(steps = [("imputer", SimpleImputer(strategy = "constant", fill_value = "U")),("ohe", OneHotEncoder())])

In [None]:
# once we run the pipelines, we are ready to pass those objects into one, overall object
# we will use the column transformer functionality

preprocessing_pipeline = ColumnTransformer(transformers = [("numeric", numeric_transformer, numeric_features),
                                                           ("categorical", categorical_transformer, categorical_features)])

In [None]:
# we have so far identified the pipelines, but we have not applied them yet
# now, we are ready to apply them
# we will start with the logistic regression

In [None]:
#lets now create a pipeline where we will instantiate the Logistic Regression classifier

# run it
classifier = Pipeline(steps = [("preprocessing_pipeline", preprocessing_pipeline),
                                   ("classifier", LogisticRegression(random_state = 42))])


# and now that we have created the classifier object, we can fit the model

classifier.fit(X_train, y_train)

# you can see what exactly has happened when data pass the pipeline
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

In [None]:
#lets now create a pipeline where we will instantiate the Random Forest classifier

# run it
classifier = Pipeline(steps = [("preprocessing_pipeline", preprocessing_pipeline),
                                   ("classifier", RandomForestClassifier(random_state = 42))])

# and now that we have created the classifier object, we can fit the model

classifier.fit(X_train, y_train)

# you can see what exactly has happened when data pass the pipeline
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

0.8

In [None]:
# as you can see, we are passing unprocessed data
# we could pass any new and unprocessed data as well

In [None]:
# we can save the pipeline for future use

import joblib

joblib.dump(classifier, "model.joblib")

NameError: ignored

In [None]:
# restart kernel

In [None]:
import joblib
import pandas as pd
import numpy as np

In [None]:
# lets import the pipeline as an object to the environment

model = joblib.load("model.joblib")

In [None]:
# once the model is loaded, we can create a brand new dataset and apply the pipeline on this unprocessed dataset

new_df = pd.DataFrame({"age": [25, np.nan, 50],
                       "gender": ["M", "F", np.nan],
                       "credit_score": [200, 100, 500]})

In [None]:
# now, lets pass the data in and receive predictions

model.predict(new_df)