<a href="https://colab.research.google.com/github/dajebbar/ML/blob/main/ML1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Classifcation Walkthrough: Titanic Dataset
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

from sklearn import (
    ensemble,
    preprocessing,
    tree,
    model_selection
)

from sklearn.metrics import (
    auc,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
)

from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
)

from yellowbrick.classifier import (
    ConfusionMatrix,
    ROCAUC,
)

from yellowbrick.model_selection import (
    LearningCurve,
)

from sklearn.experimental import (
    enable_iterative_imputer,
)

from sklearn import impute

## Ask a question
As the question is who is survived and who is diead, we want to create a **predictive model** to answer to this question. This is a classification question.

## Gather Data

In [None]:
!pip install opendatasets --upgrade --quiet

In [None]:
import opendatasets as od 
url = 'https://www.kaggle.com/c/titanic/data'
data_dir = od.download(url)

Skipping, found downloaded files in "./titanic" (use force=True to force download)


In [None]:
train_df = pd.read_csv('./titanic/train.csv')
test_df = pd.read_csv('./titanic/test.csv')
origin_df = pd.concat([train_df, test_df])
origin_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Clean Data

In [None]:
origin_df.dtypes

PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [None]:
origin_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,1309.0,655.0,378.020061,1.0,328.0,655.0,982.0,1309.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,1309.0,2.294882,0.837836,1.0,2.0,3.0,3.0,3.0
Age,1046.0,29.881138,14.413493,0.17,21.0,28.0,39.0,80.0
SibSp,1309.0,0.498854,1.041658,0.0,0.0,0.0,1.0,8.0
Parch,1309.0,0.385027,0.86556,0.0,0.0,0.0,0.0,9.0
Fare,1308.0,33.295479,51.758668,0.0,7.8958,14.4542,31.275,512.3292


The count statistic only includes values that are not NaN. so it is useful for checking whether a column is missing data. It is also a good idea to spot-check the minimum and maximum values to see if there are outliers. Summary statistics are one way to do this.

In [None]:
origin_df.isnull().mean().mul(100)

PassengerId     0.000000
Survived       31.932773
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            20.091673
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.076394
Cabin          77.463713
Embarked        0.152788
dtype: float64

In [None]:
origin_df.isnull().mean(axis=1).mul(100)

0       8.333333
1       0.000000
2       8.333333
3       0.000000
4       8.333333
         ...    
413    25.000000
414     8.333333
415    16.666667
416    25.000000
417    25.000000
Length: 1309, dtype: float64

In [None]:
mask = origin_df.isnull().any(axis=1)

origin_df[mask].Parch.head()

0    0
2    0
4    0
5    0
7    1
Name: Parch, dtype: int64

In [None]:
origin_df.Sex.value_counts(dropna=False)

male      843
female    466
Name: Sex, dtype: int64

In [None]:
origin_df.Embarked.value_counts(dropna=False)

S      914
C      270
Q      123
NaN      2
Name: Embarked, dtype: int64

## Create Features

In [None]:
origin_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
origin_df = origin_df.drop(columns =[
                          'PassengerId',
                          'Name',
                          'Ticket',
                          'Cabin',
])

We need to create dummy columns from string columns. This will create new columns for sex and embarked.

In [None]:
origin_df = pd.get_dummies(origin_df)

In [None]:
origin_df.columns

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female',
       'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')

At this point the sex_male and sex_female columns are perfectly inverse correlated. Typically we remove any columns with perfect or very high positive or negative correlation. Multicollinearity can impact interpretation of feature importance and coefficients in some models.

In [None]:
origin_df = origin_df.drop(columns="Sex_female") # or origin_df = pd.get_dummies(origin_df, drop_first=Trie)

In [None]:
origin_df.columns

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male',
       'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')

In [None]:
origin_df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,3,22.0,1,0,7.25,1,0,0,1
1,1.0,1,38.0,1,0,71.2833,0,1,0,0
2,1.0,3,26.0,0,0,7.925,0,0,0,1
3,1.0,1,35.0,1,0,53.1,0,0,0,1
4,0.0,3,35.0,0,0,8.05,1,0,0,1


Create a DataFrame (X) with the features and a series (y) with
the labels.

In [None]:
y = origin_df.Survived.fillna(origin_df.Survived.median())
X = origin_df.drop(columns='Survived')

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [None]:
X.columns

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_C',
       'Embarked_Q', 'Embarked_S'],
      dtype='object')

## Impute Data

The age column has missing values. We need to impute age from the numeric values. We only want to impute on the training set and then use that imputer to fill in the date for the test set. Otherwise we are leaking data (cheating by giving future information to the model).

In [None]:
num_cols = [
          'Pclass',
          'Age',
          'SibSp',
          'Parch',
          'Fare',
          'Sex_male',
]

In [None]:
imputer = impute.IterativeImputer()
imputed = imputer.fit_transform(
    X_train[num_cols]
)
X_train.loc[:, num_cols] = imputed
imputed = imputer.transform(X_test[num_cols])
X_test.loc[:, num_cols] = imputed

Impute with the median

In [None]:
meds = X_train.median()
X_train = X_train.fillna(meds)
X_test = X_test.fillna(meds)

## Normalize
Normalizing or preprocessing the data will help many models perform better after this is done. Particularly those that depend on a distance metric to determine similarity. (Note that tree models, which treat each feature on its own, don’t have this requirement.)  

We are going to standardize the data for the preprocessing. Standardizing is translating the data so that it has a mean value of zero and a standard deviation of one. This way models don’t treat variables with larger scales as more important than smaller scaled variables. I’m going to stick the result (numpy array) back into a pandas DataFrame for easier manipulation (and to keep column names). I also normally don’t standardize dummy columns, so I will ignore those:

In [None]:
cols = 'Pclass,Age,SibSp,Parch,Fare,Parch,Embarked_C,Embarked_Q,Embarked_S'.split(',')
sca = preprocessing.StandardScaler()
X_train = sca.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols)
X_test = sca.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

## Refactor


In [None]:
def tweak_titanic(df):
    df = df.drop(
        columns=[
            # "PassengerId",     
            "Name",
            "Ticket",
            "Cabin",
        ]
    ).pipe(pd.get_dummies, drop_first=True)
    return df

In [None]:
def get_train_test_X_y(
    df, y_col, size=0.3, std_cols=None
):
    y = df[y_col]
    X = df.drop(columns=y_col)
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        X, y, test_size=size, random_state=42
    )
    cols = X.columns
    num_cols = [
        "Pclass",
        "Age",
        "SibSp",
        "Parch",
        "Fare",
    ]
    fi = impute.IterativeImputer()
    fitted = fi.fit_transform(X_train[num_cols])
    X_train = X_train.assign(**{c:fitted[:,i] for i, c in enumerate(num_cols)})
    test_fit = fi.transform(X_test[num_cols])
    X_test = X_test.assign(**{c:test_fit[:,i] for i, c in enumerate(num_cols)})
    if std_cols:
        std = preprocessing.StandardScaler()
        fitted = std.fit_transform(X_train[std_cols])
        X_train = X_train.assign(**{c:fitted[:,i] for i, c in enumerate(std_cols)})
        test_fit = std.transform(X_test[std_cols])
        X_test = X_test.assign(**{c:test_fit[:,i] for i, c in enumerate(std_cols)})

    return X_train, X_test, y_train, y_test

In [None]:
origin_df.head(1)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0.0,3,22.0,1,0,7.25,1,0,0,1


In [None]:
# ti_df = tweak_titanic(origin_df)
std_cols = "Pclass,Age,SibSp,Fare".split(",")
X_train, X_test, y_train, y_test = get_train_test_X_y(
    origin_df, "Survived", std_cols=std_cols
)

## Baseline Model
Creating a baseline model that does something really simple can give us something to compare our model to. Note that using the default .score result gives us the accuracy which can be misleading. A problem where a positive case is 1 in 10,000 can easily get over 99% accuracy by always predicting negative.

In [None]:
from sklearn.dummy import DummyClassifier
bm =DummyClassifier(strategy='most_frequent')
bm.fit(X_train, y_train)
bm.score(X_test, y_test) # accuracy

0.40458015267175573

In [None]:
from sklearn import metrics
metrics.precision_score(y_test, bm.predict(X_test), average='weighted')

  _warn_prf(average, modifier, msg_start, len(result))


0.1636850999359012

In [None]:
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])
from sklearn import model_selection
from sklearn.dummy import DummyClassifier

from sklearn.linear_model import (
    LogisticRegression,
)

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import (
    KNeighborsClassifier,
)

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.ensemble import (
    RandomForestClassifier,
)

import xgboost

In [None]:
for model in [
              DummyClassifier,
              LogisticRegression,
              DecisionTreeClassifier,
              KNeighborsClassifier,
              GaussianNB,
              SVC,
              RandomForestClassifier,
              xgboost.XGBClassifier,
]:
    cls = model()
    kfold = model_selection.KFold(
        n_splits=10, random_state=42, shuffle=True
    )
    s = model_selection.cross_val_score(
        cls, X, y, scoring="roc_auc", cv=kfold
    )
    print(
        f"{model.__name__:22} AUC: "
        f"{s.mean():.3f} STD: {s.std():.2f}"
    )

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 103, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 347, in _score
    y_type = type_of_target(y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py", line 324, in type_of_target
    _assert_all_finite(y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 116, in _assert_all_finite
    type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761

DummyClassifier        AUC: nan STD: nan
LogisticRegression     AUC: nan STD: nan
DecisionTreeClassifier AUC: nan STD: nan
KNeighborsClassifier   AUC: nan STD: nan
GaussianNB             AUC: nan STD: nan
SVC                    AUC: nan STD: nan
RandomForestClassifier AUC: nan STD: nan


Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 103, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_scorer.py", line 347, in _score
    y_type = type_of_target(y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/multiclass.py", line 324, in type_of_target
    _assert_all_finite(y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 116, in _assert_all_finite
    type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 761

KeyboardInterrupt: ignored