### Submitters:
* Dorit Lyakhovitsky (ID: )
* Haim Michalashvili (ID: )


[Kaggle Account](https://www.kaggle.com/doritlyakhovitsky)

### TL;DR
The Titanic competition is about using machine learning to create a model that predicts which passengers would have survived the Titanic shipwreck. The dataset we will be using includes passenger information like: name, gender, age, number of family members on board, etc.
This is a **Classification** problem, so the model we will be using to predict the survival of passengers (i.e. belonging to the survivors' vs. non-survivors' groups), will be **Logistic Regression** (implemented by sklearn lib).

###EDA - Essential Data Analysis

### Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder


## Data Set Exploration

In [None]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
test_passenger_ids = test_df["PassengerId"]
display(train_df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
test_df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

## Pre Processing

In [None]:
def clean_titanic(df):
    df = df.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
    df["Embarked"] = df["Embarked"].fillna("U")

    numericCols = ["SibSp", "Parch", "Fare", "Age"]
    for col in numericCols:
        df[col] = df[col].fillna(df[col].median())

    return df

train_df = clean_titanic(train_df)
test_df = clean_titanic(test_df)

In [None]:
# encode the categorical variables in the df
def hot_encode(p_train_df, p_test_df, columns_names_list):
    enc = OneHotEncoder(drop='first', sparse_output=False)

    train_df_enc = p_train_df.drop(columns_names_list, axis=1).join(pd.DataFrame(enc.fit_transform(p_train_df[columns_names_list]), columns=enc.get_feature_names_out(columns_names_list)))
    test_df_enc  = p_test_df.drop(columns_names_list, axis=1).join(pd.DataFrame(enc.transform(p_test_df[columns_names_list]), columns=enc.get_feature_names_out(columns_names_list)))

    return train_df_enc, test_df_enc

train_df, test_df = hot_encode(train_df, test_df, ['Sex', 'Embarked'])
display(train_df)
display(test_df)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Embarked_U
0,0,3,22.0,1,0,7.2500,1.0,0.0,1.0,0.0
1,1,1,38.0,1,0,71.2833,0.0,0.0,0.0,0.0
2,1,3,26.0,0,0,7.9250,0.0,0.0,1.0,0.0
3,1,1,35.0,1,0,53.1000,0.0,0.0,1.0,0.0
4,0,3,35.0,0,0,8.0500,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,0,0,13.0000,1.0,0.0,1.0,0.0
887,1,1,19.0,0,0,30.0000,0.0,0.0,1.0,0.0
888,0,3,28.0,1,2,23.4500,0.0,0.0,1.0,0.0
889,1,1,26.0,0,0,30.0000,1.0,0.0,0.0,0.0


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Embarked_U
0,3,34.5,0,0,7.8292,1.0,1.0,0.0,0.0
1,3,47.0,1,0,7.0000,0.0,0.0,1.0,0.0
2,2,62.0,0,0,9.6875,1.0,1.0,0.0,0.0
3,3,27.0,0,0,8.6625,1.0,0.0,1.0,0.0
4,3,22.0,1,1,12.2875,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
413,3,27.0,0,0,8.0500,1.0,0.0,1.0,0.0
414,1,39.0,0,0,108.9000,0.0,0.0,0.0,0.0
415,3,38.5,0,0,7.2500,1.0,0.0,1.0,0.0
416,3,27.0,0,0,8.0500,1.0,0.0,1.0,0.0


## X-t Split

In [20]:
# detach the target value from the input data
train_df_t = train_df['Survived']
train_df_X = train_df.drop('Survived', axis=1)
print('X')
display(train_df_X)
print()
print('t')
display(train_df_t)

X


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S,Embarked_U
0,3,22.0,1,0,7.2500,1.0,0.0,1.0,0.0
1,1,38.0,1,0,71.2833,0.0,0.0,0.0,0.0
2,3,26.0,0,0,7.9250,0.0,0.0,1.0,0.0
3,1,35.0,1,0,53.1000,0.0,0.0,1.0,0.0
4,3,35.0,0,0,8.0500,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,1.0,0.0,1.0,0.0
887,1,19.0,0,0,30.0000,0.0,0.0,1.0,0.0
888,3,28.0,1,2,23.4500,0.0,0.0,1.0,0.0
889,1,26.0,0,0,30.0000,1.0,0.0,0.0,0.0



t


0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [21]:
# sklearn imports
import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection


## Experiments

In [22]:
import plotly.express as px

# This function is used to print the accuracy/log loss graphs
def print_graphs(graph_points):
    for k, v in graph_points.items():
        best_value = max(v.values()) if 'Accuracy' in k else min(v.values())
        best_index = np.argmax(list(v.values())) if 'Accuracy' in k else np.argmin(list(v.values()))
        color = 'red' if 'train' in k else 'blue'
        fig = px.scatter(x=v.keys(), y=v.values(), title=f'{k}, best value: x={best_index + 1}, y={best_value}', color_discrete_sequence=[color])
        fig.data[0].update(mode='markers+lines')
        fig.show()

### Experiments on train-validation split size

In [24]:
# sklearn imports
import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection

# plot the score by split and the loss by split
def plot_score_and_loss_by_split(X, t):
    graph_points = {
                    'train_LogLoss':{},
                    'val_LogLoss': {},
                    'train_Accuracy': {},
                    'val_Accuracy': {}
                    }
    for size in range(10, 100, 10):

        X_train, X_val, t_train, t_val = model_selection.train_test_split(
            train_df_X, train_df_t, test_size=size/100, random_state=42)
        SGD_cls = pipeline.make_pipeline(
            preprocessing.StandardScaler(),
            linear_model.SGDClassifier(
                loss='log_loss', alpha=0, learning_rate='constant', eta0=0.01)
            ).fit(X_train, t_train)

        y_train = SGD_cls.predict(X_train)
        y_val = SGD_cls.predict(X_val)

        graph_points['train_LogLoss'][size/100] = metrics.log_loss(t_train, y_train)
        graph_points['val_LogLoss'][size/100] = metrics.log_loss(t_val, y_val)
        graph_points['train_Accuracy'][size/100] = accuracy_score(t_train, y_train)
        graph_points['val_Accuracy'][size/100] = accuracy_score(t_val, y_val)
    print_graphs(graph_points)

plot_score_and_loss_by_split(train_df_X, train_df_t)

These graphs show us that when test size is at 20%, we get both low log_loss and high accuracy.

### Expermients on 'alpha' hyper params

In [25]:
# plot the score by split and the loss by split
def plot_score_and_loss_by_alpha(X, t):
    graph_points = {
                    'train_LogLoss':{},
                    'val_LogLoss': {},
                    'train_Accuracy': {},
                    'val_Accuracy': {}
                    }
    for size in range(1, 100, 5):

        X_train, X_val, t_train, t_val = model_selection.train_test_split(
            train_df_X, train_df_t, test_size=0.2, random_state=42)
        SGD_cls = pipeline.make_pipeline(
            preprocessing.StandardScaler(),
            linear_model.SGDClassifier(
                loss='log_loss', alpha=size/1000, learning_rate='optimal', eta0=0.01)
            ).fit(X_train, t_train)

        y_train = SGD_cls.predict(X_train)
        y_val = SGD_cls.predict(X_val)

        graph_points['train_LogLoss'][size/1000] = metrics.log_loss(t_train, y_train)
        graph_points['val_LogLoss'][size/1000] = metrics.log_loss(t_val, y_val)
        graph_points['train_Accuracy'][size/1000] = accuracy_score(t_train, y_train)
        graph_points['val_Accuracy'][size/1000] = accuracy_score(t_val, y_val)
    print_graphs(graph_points)

plot_score_and_loss_by_alpha(train_df_X, train_df_t)

These graphs show us tat when alpha=0.06, we get both low log_loss and the highest accuracy= ~80%!

### Experiments on 'eta0' hyper param

In [26]:
# plot the score by split and the loss by split
def plot_score_and_loss_by_eta(X, t):
    graph_points = {
                    'train_LogLoss':{},
                    'val_LogLoss': {},
                    'train_Accuracy': {},
                    'val_Accuracy': {}
                    }
    for size in range(1, 100, 5):

        X_train, X_val, t_train, t_val = model_selection.train_test_split(
            train_df_X, train_df_t, test_size=0.2, random_state=42)
        SGD_cls = pipeline.make_pipeline(
            preprocessing.StandardScaler(),
            linear_model.SGDClassifier(
                loss='log_loss', alpha=0.06, learning_rate='constant', eta0=size/1000)
            ).fit(X_train, t_train)

        y_train = SGD_cls.predict(X_train)
        y_val = SGD_cls.predict(X_val)

        graph_points['train_LogLoss'][size/1000] = metrics.log_loss(t_train, y_train)
        graph_points['val_LogLoss'][size/1000] = metrics.log_loss(t_val, y_val)
        graph_points['train_Accuracy'][size/1000] = accuracy_score(t_train, y_train)
        graph_points['val_Accuracy'][size/1000] = accuracy_score(t_val, y_val)
    print_graphs(graph_points)

plot_score_and_loss_by_eta(train_df_X, train_df_t)

These graphs show us that eta0=0.041 give us the lowest log_loss and the highest accuracy= ~82%!

### Expermients on different feature groups

In [27]:
# plot the score by split and the loss by split
def plot_score_and_loss_by_feature_groups(X, t):
    graph_points = {
                    'train_LogLoss':{},
                    'val_LogLoss': {},
                    'train_Accuracy': {},
                    'val_Accuracy': {}
                    }
    feature_groups = [["Age", "SibSp", "Parch"],
                      ["Age", "Sex_male"],
                      ["Pclass", "Fare"],
                      ["Pclass", "Fare", "Embarked_Q", "Embarked_S", "Embarked_U"],
                      ["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex_male", "Embarked_Q", "Embarked_S", "Embarked_U"]]

    for index in range(len(feature_groups)):

        X_train, X_val, t_train, t_val = model_selection.train_test_split(
            train_df_X, train_df_t, test_size=0.2, random_state=42)

        X_train = X_train[feature_groups[index]]
        X_val = X_val[feature_groups[index]]

        SGD_cls = pipeline.make_pipeline(
            preprocessing.StandardScaler(),
            linear_model.SGDClassifier(
                loss='log_loss', alpha=0.06, learning_rate='constant', eta0=0.041)
            ).fit(X_train, t_train)

        y_train = SGD_cls.predict(X_train)
        y_val = SGD_cls.predict(X_val)

        graph_points['train_LogLoss'][index] = metrics.log_loss(t_train, y_train)
        graph_points['val_LogLoss'][index] = metrics.log_loss(t_val, y_val)
        graph_points['train_Accuracy'][index] = accuracy_score(t_train, y_train)
        graph_points['val_Accuracy'][index] = accuracy_score(t_val, y_val)
    print_graphs(graph_points)

plot_score_and_loss_by_feature_groups(train_df_X, train_df_t)

These graphs show us that group 4 (that includes all the numerical & categorical features) gives us the lowest log_loss and the highest accuracy!

### Final predicting on the test dataset based on the experiments we conducted:



In [28]:
X_train, X_val, t_train, t_val = sklearn.model_selection.train_test_split(train_df_X, train_df_t,
                                                                           test_size=0.1,
                                                                           random_state=42)
# create the SGDClassifier and predict
SGD_cls = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.SGDClassifier(loss='log_loss', alpha=0.06, learning_rate='constant', eta0=0.041)).fit(X_train, t_train)

y_train = SGD_cls.predict(X_train)
# y_val = SGD_cls.predict(X_val)
y_test = SGD_cls.predict(test_df)

submission_df = pd.DataFrame({"PassengerId": test_passenger_ids.values, "Survived": y_test,})
submission_df.to_csv("submission.csv", index=False)
submission_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


## Screenshots

![our leaderboard score](https://github.com/doritlya1997/ml-10244-submissions/blob/main/1_titanic_leaderboard_updated.png?raw=true)

![best scores of models](https://github.com/doritlya1997/ml-10244-submissions/blob/main/1_titanic_subs_and_scores_updated.png?raw=true)

## Summary


## References
[The 7 ways to handle missing values in ML](https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326af79e)

[Column Transformer with mixed types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)

* The class notebooks Shira has shared with us: especially notebooks #03 & #04
