## Classification

In this lecture, we will bring together various techniques for feature engineering that we have covered in previsous sections to tackle a classification problem. This would give you an idea of the end-to-end pipeline to build machine learning algorithms for classification. 

Your task is simply to beat the baseline.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.preprocessing import LabelEncoder

In [2]:
# load dataset

# DO NOT CHANGE HERE

cols = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare','embarked', 'survived']

data = pd.read_csv('../datasets/titanic.csv', usecols=cols)

data.head(10)

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S
2,1,0,female,2.0,1,2,151.55,S
3,1,0,male,30.0,1,2,151.55,S
4,1,0,female,25.0,1,2,151.55,S
5,1,1,male,48.0,0,0,26.55,S
6,1,1,female,63.0,1,0,77.9583,S
7,1,0,male,39.0,0,0,0.0,S
8,1,1,female,53.0,2,0,51.4792,S
9,1,0,male,71.0,0,0,49.5042,C


In [3]:
le = LabelEncoder()
data['sex'] = le.fit_transform(data.sex.values)

In [4]:
le = LabelEncoder()
data['embarked'] = le.fit_transform(data.embarked.values)

In [5]:
data.head(10)

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked
0,1,1,0,29.0,0,0,211.3375,2
1,1,1,1,0.9167,1,2,151.55,2
2,1,0,0,2.0,1,2,151.55,2
3,1,0,1,30.0,1,2,151.55,2
4,1,0,0,25.0,1,2,151.55,2
5,1,1,1,48.0,0,0,26.55,2
6,1,1,0,63.0,1,0,77.9583,2
7,1,0,1,39.0,0,0,0.0,2
8,1,1,0,53.0,2,0,51.4792,2
9,1,0,1,71.0,0,0,49.5042,0


### Types of variables

Let's find out what types of variables there are in this dataset

In [6]:
# let's inspect the type of variables in pandas

data.dtypes

pclass        int64
survived      int64
sex           int32
age         float64
sibsp         int64
parch         int64
fare        float64
embarked      int32
dtype: object

In [7]:
# let's inspect the variable values

for var in data.columns:
    print(var, data[var].unique()[0:20], '\n')

pclass [1 2 3] 

survived [1 0] 

sex [0 1] 

age [29.      0.9167  2.     30.     25.     48.     63.     39.     53.
 71.     47.     18.     24.     26.     80.         nan 50.     32.
 36.     37.    ] 

sibsp [0 1 2 3 4 5 8] 

parch [0 2 1 4 3 5 6 9] 

fare [211.3375 151.55    26.55    77.9583   0.      51.4792  49.5042 227.525
  69.3     78.85    30.      25.925  247.5208  76.2917  75.2417  52.5542
 221.7792  26.      91.0792 135.6333] 

embarked [2 0 3 1] 



### Variable characteristics

In [8]:
# check missing data

data.isnull().mean()

pclass      0.000000
survived    0.000000
sex         0.000000
age         0.200917
sibsp       0.000000
parch       0.000000
fare        0.000764
embarked    0.000000
dtype: float64

In [9]:
data = data.fillna(0)

In [10]:
# check missing data, again

data.isnull().mean()

pclass      0.0
survived    0.0
sex         0.0
age         0.0
sibsp       0.0
parch       0.0
fare        0.0
embarked    0.0
dtype: float64

In [11]:
# feature magnitude

data.describe()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0
mean,2.294882,0.381971,0.644003,23.877515,0.498854,0.385027,33.270043,1.495034
std,0.837836,0.486055,0.478997,17.590848,1.041658,0.86556,51.747063,0.81613
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,7.0,0.0,0.0,7.8958,1.0
50%,3.0,0.0,1.0,24.0,0.0,0.0,14.4542,2.0
75%,3.0,1.0,1.0,35.0,1.0,0.0,31.275,2.0
max,3.0,1.0,1.0,80.0,8.0,9.0,512.3292,3.0


In [12]:
# separate into training and testing set

# DO NOT CHANGE HERE

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.25,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((981, 7), (328, 7))

In [13]:
# let's fit the model and make predictions

# DO NOT CHANGE HERE

gbm = GradientBoostingClassifier(random_state=0)

gbm.fit(X_train, y_train)

X_train_preds = gbm.predict(X_train)
X_test_preds = gbm.predict(X_test)

In [14]:
# DO NOT CHANGE HERE

print('Train set')
print('GBM roc-auc: {}'.format(roc_auc_score(y_train, X_train_preds)))

print('Test set')
print('GBM roc-auc: {}'.format(roc_auc_score(y_test, X_test_preds)))

# Train set
# GBM roc-auc: 0.8606270251335494
# Test set
# GBM roc-auc: 0.7733974358974359

Train set
GBM roc-auc: 0.8606270251335494
Test set
GBM roc-auc: 0.7733974358974359


In [15]:
# DO NOT CHANGE HERE

print('Train set')
print('GBM f1_score: {}'.format(f1_score(y_train, X_train_preds)))

print('Test set')
print('GBM f1_score: {}'.format(f1_score(y_test, X_test_preds)))

#Train set
#GBM f1_score: 0.8321775312066575
#Test set
#GBM f1_score: 0.7102803738317757

Train set
GBM f1_score: 0.8321775312066575
Test set
GBM f1_score: 0.7102803738317757


In [16]:
from lazypredict.Supervised import LazyClassifier

In [17]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

print(models)

100%|██████████████████████████████████████████████████████████████████████████████████| 29/29 [00:01<00:00, 16.55it/s]

                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LGBMClassifier                     0.82               0.79     0.79      0.81   
SVC                                0.81               0.78     0.78      0.81   
LogisticRegression                 0.80               0.77     0.77      0.79   
NuSVC                              0.80               0.77     0.77      0.79   
XGBClassifier                      0.79               0.76     0.76      0.79   
AdaBoostClassifier                 0.79               0.76     0.76      0.79   
Perceptron                         0.77               0.76     0.76      0.77   
LinearSVC                          0.78               0.76     0.76      0.78   
CalibratedClassifierCV             0.78               0.76     0.76      0.78   
NearestCentroid                    0.76               0.75     0.75      0.76   
KNeighborsClassifier        




In [19]:
predictions

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.82,0.79,0.79,0.81,0.06
SVC,0.81,0.78,0.78,0.81,0.07
LogisticRegression,0.8,0.77,0.77,0.79,0.02
NuSVC,0.8,0.77,0.77,0.79,0.08
XGBClassifier,0.79,0.76,0.76,0.79,0.57
AdaBoostClassifier,0.79,0.76,0.76,0.79,0.1
Perceptron,0.77,0.76,0.76,0.77,0.01
LinearSVC,0.78,0.76,0.76,0.78,0.04
CalibratedClassifierCV,0.78,0.76,0.76,0.78,0.13
NearestCentroid,0.76,0.75,0.75,0.76,0.01


## Extra

In [20]:
data['age'] = data['age'].fillna(data['age'].mean())
data['fare'] = data['fare'].fillna(data['fare'].mean())

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.25,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

In [22]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)
print(predictions)

100%|██████████████████████████████████████████████████████████████████████████████████| 29/29 [00:01<00:00, 24.94it/s]

                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LGBMClassifier                     0.82               0.79     0.79      0.81   
SVC                                0.81               0.78     0.78      0.81   
LogisticRegression                 0.80               0.77     0.77      0.79   
NuSVC                              0.80               0.77     0.77      0.79   
XGBClassifier                      0.79               0.76     0.76      0.79   
AdaBoostClassifier                 0.79               0.76     0.76      0.79   
Perceptron                         0.77               0.76     0.76      0.77   
LinearSVC                          0.78               0.76     0.76      0.78   
CalibratedClassifierCV             0.78               0.76     0.76      0.78   
NearestCentroid                    0.76               0.75     0.75      0.76   
KNeighborsClassifier        


