## Import Data and Gain Insight

In [5]:
import pandas as pd

train_data = pd.read_csv("../../datasets/titanic/train.csv")

# The target values are should be drop from rest of dataset
X_train = train_data.drop("Survived", axis="columns")
y_train = train_data["Survived"].copy()

In [6]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


As you can see, whole dataset is almost complete (i.e. there are a few nan values in the dataset). But still there are some values that must be complete or drop.

## Preparing Data to Training

In [7]:
X_train = X_train.drop(["Cabin", "Ticket", "Name"], axis="columns")

Due to almost entire __cabin__ feature is nan, it must be drop. Because it can't be filling since the filling operation will damage the dataset. The reason of dropped some other features is what they must be numeric value or scalable value. But these features are not scalable since they are almost unique. (e.g. all names in the dataset might be unique)

In [8]:
# This transformer select columns according to "columns" paramater.
# It is used for distinguish numeric value from categorical (non-numeric) value or vice versa.

from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelection(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.columns]

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline ([
    ("selected_num", ColumnSelection(["Age", "SibSp", "Parch", "Fare"])),
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler())
])

__*num_pipeline*__ transforms numeric features in the dataset. It does this in that way:

<ol>
    <li>Select numerical features(i.e columns).</li>
    <dd>-> ColumnSelection</dd>
    <li>Fill empty values.</li>
    <dd>-> SimpleImputer (fill empty values with median of non-empty values of columns)</dd>
    <li>Scale numerical features.</li>
    <dd>-> StandardScaller</dd>
</ol>

In [10]:
num_pipeline.fit_transform(X_train)

array([[-0.56573646,  0.43279337, -0.47367361, -0.50244517],
       [ 0.66386103,  0.43279337, -0.47367361,  0.78684529],
       [-0.25833709, -0.4745452 , -0.47367361, -0.48885426],
       ...,
       [-0.1046374 ,  0.43279337,  2.00893337, -0.17626324],
       [-0.25833709, -0.4745452 , -0.47367361, -0.04438104],
       [ 0.20276197, -0.4745452 , -0.47367361, -0.49237783]])

In [11]:
# This transformer finds the most high frequency value and fills empty values in dataset with this value.

class MaxFrequency(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.frequency_ = pd.Series([X[column].value_counts().keys()[0] for column in X], index=X.columns)
        
        return self
    
    def transform(self, X):
        return X.fillna(self.frequency_)

In [12]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
cat_pipeline = Pipeline([
    ("selected_cat", ColumnSelection(["Pclass", "Sex", "Embarked"])),
    ("imputer", MaxFrequency()),
    ("encoder", OneHotEncoder(sparse=False)),
    ("std_scaler_cat", StandardScaler())
])

__cat_pipeline__ transforms categorical(i.e. non-numerical) features in the dataset. It does this in that way:

<ol>
    <li>Select categorical features</li>
    <dd>-> ColumnSelection</dd>
    <li>Fill empty values</li>
    <dd>-> MaxFrequency</dd>
    <li>Categorize values in the columns</li>
    <dd>-> OneHotEncoder</dd>
    <li>Scale categorical features</li>
    <dd>-> StandardScaler</dd>
</ol>

In [15]:
cat_pipeline.fit_transform(X_train)

array([[-0.56568542, -0.51015154,  0.90258736, ..., -0.48204268,
        -0.30756234,  0.61583843],
       [ 1.76776695, -0.51015154, -1.10792599, ...,  2.0745051 ,
        -0.30756234, -1.62380254],
       [-0.56568542, -0.51015154,  0.90258736, ..., -0.48204268,
        -0.30756234,  0.61583843],
       ...,
       [-0.56568542, -0.51015154,  0.90258736, ..., -0.48204268,
        -0.30756234,  0.61583843],
       [ 1.76776695, -0.51015154, -1.10792599, ...,  2.0745051 ,
        -0.30756234, -1.62380254],
       [-0.56568542, -0.51015154,  0.90258736, ..., -0.48204268,
         3.25137334, -1.62380254]])

In [16]:
from sklearn.pipeline import FeatureUnion

In [17]:
# This pipeline consist from other two pipeline which defined earlier.

full_pipeline = FeatureUnion(transformer_list=[
    ("num", num_pipeline),
    ("cat", cat_pipeline)
])

In [18]:
X_train_prepared = full_pipeline.fit_transform(X_train)

## Selecting and Training Models

In [19]:
from sklearn.ensemble import RandomForestClassifier

In [20]:
# RandomForestClassifier is selected as classifier with "42" random state
# You can select a number whatever you want as random state
# This makes sure that RandomForestClassifier will generated as same every time

forest_clf = RandomForestClassifier(random_state=42)

In [21]:
forest_clf.fit(X_train_prepared, y_train)

RandomForestClassifier(random_state=42)

In [22]:
y_pred = forest_clf.predict(X_train_prepared)

In [23]:
from sklearn.metrics import f1_score, accuracy_score

In [24]:
f1_score(y_train, y_pred)

0.9732142857142858

In [25]:
accuracy_score(y_train, y_pred)

0.9797979797979798

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
# Searching for the best models parameters 

param_grid = [
    {"n_estimators":[50, 100, 200], "bootstrap": [True, False], "ccp_alpha":[0.5, 1.0]}
]

forest_clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(forest_clf, param_grid, cv=5)

In [28]:
grid_search.fit(X_train_prepared, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid=[{'bootstrap': [True, False], 'ccp_alpha': [0.5, 1.0],
                          'n_estimators': [50, 100, 200]}])

In [29]:
grid_search.best_params_

{'bootstrap': True, 'ccp_alpha': 0.5, 'n_estimators': 50}

In [55]:
from sklearn.model_selection import cross_val_predict

In [106]:
# and generating a model from scratch with best parameters

forest_clf = RandomForestClassifier(random_state=42, bootstrap=True, ccp_alpha=0.5, n_estimators=50)

In [107]:
y_cv_pred = cross_val_predict(forest_clf, X_train_prepared, y_train, cv=5)

In [108]:
# It seems like model has not done any mistake

f1_score(y_train, y_cv_pred)

0.0

In [109]:
# but indeed there is an overfitting

accuracy_score(y_train, y_cv_pred)

0.6161616161616161

In [19]:
from sklearn.linear_model import LogisticRegression

In [21]:
# At this point, another model is selected to compare other model

log_clf = LogisticRegression(random_state=42)

In [48]:
# Searching for best models parameters

param_grid_log = [
    {"C": [1, 2, 3, 4, 5], "max_iter":[100, 500, 1000, 3000]}
]

In [49]:
grid_search_log = GridSearchCV(log_clf, param_grid_log, cv=5)

In [50]:
grid_search_log.fit(X_train_prepared, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(random_state=42),
             param_grid=[{'C': [1, 2, 3, 4, 5],
                          'max_iter': [100, 500, 1000, 3000]}])

In [52]:
grid_search_log.best_params_

{'C': 1, 'max_iter': 100}

In [53]:
# At this point, instead of generate a model with best parameters from scratch, 
# GridSearchCV best_estimator_ attribute is used.
# It returns the model which was found best parameters for it.

log_estimator = grid_search_log.best_estimator_

In [56]:
log_pred = cross_val_predict(log_estimator, X_train_prepared, y_train, cv=5)

In [59]:
# These scores are lower than the previous one. 
# Thereby, for this solution, RandomForestClassifier was used.

f1_score(y_train, log_pred)

0.7192716236722307

In [61]:
accuracy_score(y_train, log_pred)

0.792368125701459

## Test Results

In [62]:
# Import test data

test_set = pd.read_csv("../../datasets/titanic/test.csv")

In [121]:
# Transform test data

test_set_prepared = full_pipeline.transform(test_set)

In [122]:
# Predict

y_test_pred = forest_clf.predict(test_set_prepared)

In [133]:
# Generating submission file

submission = pd.DataFrame(y_test_pred, index=test_set["PassengerId"], columns=["Survived"])

In [134]:
# and writing to the submission file.

submission.to_csv("submission.csv")