<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/MLG_D3_LC1_Building_Classification_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://academy.constructor.org/"><img src="https://jobtracker.ai/static/media/constructor_academy_colour.b86fa87f.png" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Building Classification Pipelines (Live coding) </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>Constructor Academy, 2024</center>

# Building Classification Pipelines (Live coding)

__Topic covered__

- Handling numeric and categorical data
- Data preprocessing
- Feature Encoding and Scaling
- Logistic Regression
- Classification Metrics
- ML Pipelines

## Load Dependencies

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.impute import KNNImputer


import matplotlib.pyplot as plt
import seaborn as sns

## Loading Dataset


We can pre-load the dataset with specific values (e.g., ?) depicting missing values (NaNs)



In [None]:
# adult.csv
gdrive_data_url = "https://drive.google.com/file/d/1DcH68AtzKIcNc5KWpZV7pV1mR65jpA0y/view?usp=share_link"
file_id = gdrive_data_url.split('/')[-2]
data_file='https://drive.google.com/uc?export=download&id=' + file_id
df = pd.read_csv(data_file)
print(df.shape)
df.head()

(48842, 15)


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
df = pd.read_csv(data_file, na_values=['?', ''])
# dropping unnecessary column
df = df.drop(columns='fnlwgt')
df.head(10)

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K
5,34,Private,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
6,29,,HS-grad,9,Never-married,,Unmarried,Black,Male,0,0,40,United-States,<=50K
7,63,Self-emp-not-inc,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
8,24,Private,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
9,55,Private,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        46043 non-null  object
 2   education        48842 non-null  object
 3   educational-num  48842 non-null  int64 
 4   marital-status   48842 non-null  object
 5   occupation       46033 non-null  object
 6   relationship     48842 non-null  object
 7   race             48842 non-null  object
 8   gender           48842 non-null  object
 9   capital-gain     48842 non-null  int64 
 10  capital-loss     48842 non-null  int64 
 11  hours-per-week   48842 non-null  int64 
 12  native-country   47985 non-null  object
 13  income           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [None]:
df.isnull().sum()

age                   0
workclass          2799
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

## Training a Logistic Regression Model with Pipelines

We will show all the major steps needed to train your first classification model using sklearn pipelines which often makes the code more concise and you need lesser lines of code.

## Split Dataset into train and test Datsets

We split the dataset into a standard 70:30 train-test split using stratified sampling to keep the distributions of classes similar in train and test datasets

In [None]:
X = df.drop(columns=['income'])
y = df['income']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_train.shape, X_test.shape

((34189, 13), (14653, 13))

In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34189 entries, 38865 to 4610
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              34189 non-null  int64 
 1   workclass        32263 non-null  object
 2   education        34189 non-null  object
 3   educational-num  34189 non-null  int64 
 4   marital-status   34189 non-null  object
 5   occupation       32253 non-null  object
 6   relationship     34189 non-null  object
 7   race             34189 non-null  object
 8   gender           34189 non-null  object
 9   capital-gain     34189 non-null  int64 
 10  capital-loss     34189 non-null  int64 
 11  hours-per-week   34189 non-null  int64 
 12  native-country   33599 non-null  object
dtypes: int64(5), object(8)
memory usage: 3.7+ MB


<span style="color:orange"> **We first want to check how balanced our data is. We can use the buildin function 'value_counts' to do this.**</span>

In [None]:
y_train.value_counts()

<=50K    26008
>50K      8181
Name: income, dtype: int64

In [None]:
y_test.value_counts()

<=50K    11147
>50K      3506
Name: income, dtype: int64

In [None]:
y_train.value_counts(normalize=True)

<=50K    0.760713
>50K     0.239287
Name: income, dtype: float64

In [None]:
y_test.value_counts(normalize=True)

<=50K    0.760732
>50K     0.239268
Name: income, dtype: float64

## Missing Values Imputation and Feature Encoding


Here we will perform the following strategy for missing value imputation and feature encoding

- Separate features into numeric and categorical from the training data
- Fill missing values in categorical data with 'Not Available' given they can't be guessed
- Perform one-hot encoding of the categorical data to get dummy-encoded features
- Fill missing values in numeric data using a K-nearest neighbors model
- Scale numeric data using standard scaler
- Combine the one-hot encoded categorical features and the numeric features to form the final featureset


- Apply similar transformations on the test data

## Separate categorical and numeric columns

We will need to treat these features separately just like before

In [None]:
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
numeric_features = X_train.select_dtypes(exclude=['object']).columns.tolist()

categorical_features, numeric_features

(['workclass',
  'education',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'gender',
  'native-country'],
 ['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week'])

## Define Categorical Transformer Pipeline

Consists of the series of steps needed to tranform the categorical features. This includes:

- Constant imputer to fill missing values
- One-hot encoder to get dummy variables

In [None]:
categorical_transformer = Pipeline(steps=[
                                          ("cat_imputer", SimpleImputer(strategy='constant',
                                                                        fill_value='Not Available').set_output(transform="pandas")),
                                          ("onehot", OneHotEncoder(sparse_output=False,
                                                                   handle_unknown="ignore").set_output(transform="pandas"))
                                          ])
categorical_transformer

## Define Numeric Transformer Pipeline

Consists of the series of steps needed to tranform the numeric features. This includes:

- K-nearest neighbor imputer to fill missing values
- Standard Scaler to scale the numeric features

In [None]:
numeric_transformer = Pipeline(steps=[
                                      ("knn_imputer", KNNImputer(n_neighbors=5).set_output(transform="pandas")),
                                      ("scaler", StandardScaler().set_output(transform="pandas"))
                                      ])

numeric_transformer

## Define Column Transformer Pipeline for preprocessing

Consists of the series of steps needed to tranform all the features in sequence

- Numeric Transfomer defined earlier
- Categorical Transfomer defined earlier

In [None]:
preprocessor = ColumnTransformer(transformers=[
                                               ("num", numeric_transformer, numeric_features),
                                               ("cat", categorical_transformer, categorical_features)
                                               ]).set_output(transform="pandas")
preprocessor

## Initialize Logistic Regression Model

In [None]:
lr_model = LogisticRegression(random_state=42, solver='liblinear')
lr_model

## Build Modeling Pipeline

Chains the following steps:

- Preprocessing pipeline steps defined earlier
- Logistic Regression model defined earlier

In [None]:
pipeline_lr = Pipeline(steps=[
                              ("pre_process", preprocessor),
                              ("model", lr_model)
                              ])
pipeline_lr

## Train and Evaluate Logistic Regression ML Pipeline

In [None]:
pipeline_lr.fit(X_train, y_train)

y_pred = pipeline_lr.predict(X_test)

In [None]:
pipeline_lr['model'].classes_

array(['<=50K', '>50K'], dtype=object)

In [None]:
confusion_matrix(y_test, y_pred)

array([[10428,   719],
       [ 1408,  2098]])

In [None]:
class_labels = pipeline_lr['model'].classes_
pd.DataFrame(confusion_matrix(y_test, y_pred),
             columns=class_labels, index=class_labels)

Unnamed: 0,<=50K,>50K
<=50K,10428,719
>50K,1408,2098


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91     11147
        >50K       0.74      0.60      0.66      3506

    accuracy                           0.85     14653
   macro avg       0.81      0.77      0.79     14653
weighted avg       0.85      0.85      0.85     14653



### Debugging and Inspecting Pipeline steps (Optional)

In [None]:
pipeline_lr

In [None]:
pipeline_lr['pre_process']

In [None]:
pipeline_lr['pre_process'].transformers_[0]

('num',
 Pipeline(steps=[('knn_imputer', KNNImputer()), ('scaler', StandardScaler())]),
 ['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week'])

In [None]:
pipeline_lr['pre_process'].transformers_[0][1]

In [None]:
num_cols = pipeline_lr['pre_process'].transformers_[0][2]
num_cols

['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']

In [None]:
pipeline_lr['pre_process'].transformers_[0][1]

In [None]:
pipeline_lr['pre_process'].transformers_[0][1].transform(X_train[num_cols]).head()

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week
38865,-1.291271,-0.418928,-0.144333,-0.219417,1.569855
17212,-1.437112,-0.030135,-0.144333,-0.219417,-2.451436
9312,-0.489145,-0.418928,-0.144333,-0.219417,0.363468
15512,-0.270383,1.136241,-0.144333,-0.219417,-0.038662
23576,-1.291271,0.747449,-0.144333,-0.219417,-0.440791


In [None]:
pipeline_lr['pre_process'].transformers_[1][1]

In [None]:
cat_cols = pipeline_lr['pre_process'].transformers_[1][2]
cat_cols

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'gender',
 'native-country']

In [None]:
pipeline_lr['pre_process'].transformers_[1][1].transform(X_train[cat_cols]).head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Not Available,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
38865,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
17212,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9312,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15512,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
23576,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
pipeline_lr['pre_process'].transform(X_train).head()

Unnamed: 0,num__age,num__educational-num,num__capital-gain,num__capital-loss,num__hours-per-week,cat__workclass_Federal-gov,cat__workclass_Local-gov,cat__workclass_Never-worked,cat__workclass_Not Available,cat__workclass_Private,...,cat__native-country_Portugal,cat__native-country_Puerto-Rico,cat__native-country_Scotland,cat__native-country_South,cat__native-country_Taiwan,cat__native-country_Thailand,cat__native-country_Trinadad&Tobago,cat__native-country_United-States,cat__native-country_Vietnam,cat__native-country_Yugoslavia
38865,-1.291271,-0.418928,-0.144333,-0.219417,1.569855,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
17212,-1.437112,-0.030135,-0.144333,-0.219417,-2.451436,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9312,-0.489145,-0.418928,-0.144333,-0.219417,0.363468,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15512,-0.270383,1.136241,-0.144333,-0.219417,-0.038662,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
23576,-1.291271,0.747449,-0.144333,-0.219417,-0.440791,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
