# Kaggle Task - Titanic
### Dale Sandbox
Creating a code to test out the import and investigation of the Titanic Kaggle data

**Coding Steps**
1. Exploratory data analysis
    1. Summary statistics for each numeric and categorical independent variable
    2. Overall summary statistics for the dependent variable
    3. Cross-tab analysis of independent vs. dependent
    4. Plot some charts / band up variables as appropriate
2. Prep data for model
    1. Band up variables that require banding
    2. Filter to include only independent and dependent var
3. Train model
    1. Simple logistic regression with variables proposed
    2. Generate predictions on the trianing dataset
    3. Calculate accuracy

**Additional Ideas**
1. Using fancier plotting for the exploratory analysis
2. Missing handling
3. Enhancement of the modelling functions
4. Feature engineering
    1. Turn all variables into numeric features
    2. Enhanced banding
    3. Normalise all the variables
    4. Combining variables (can we work out who is in the family)
5. Additional modelling techniques

**0. Packages & Data Import**

In [5]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
import statsmodels.api as sm
import plotly.express as px

pd.set_option('display.max_rows', 1000)

In [10]:
input_train = pd.read_csv('Data/train.csv')
input_train.head(100)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [11]:
input_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [12]:
input_test = pd.read_csv('Data/test.csv')
input_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


**1. Exploratory Data Analysis**

Work out which variables are numeric or strings, and which are categorical or continuous

In [13]:
# Count number of unique values
input_uniqvals = pd.DataFrame(input_train.nunique(axis=0),columns=["Unique Values"])
input_uniqvals

Unnamed: 0,Unique Values
PassengerId,891
Survived,2
Pclass,3
Name,891
Sex,2
Age,88
SibSp,7
Parch,7
Ticket,681
Fare,248


In [14]:
input_train.value_counts()

PassengerId  Survived  Pclass  Name                                                                                Sex     Age    SibSp  Parch  Ticket             Fare      Cabin            Embarked
2            1         1       Cumings, Mrs. John Bradley (Florence Briggs Thayer)                                 female  38.00  1      0      PC 17599           71.2833   C85              C           1
572          1         1       Appleton, Mrs. Edward Dale (Charlotte Lamson)                                       female  53.00  2      0      11769              51.4792   C101             S           1
578          1         1       Silvey, Mrs. William Baird (Alice Munger)                                           female  39.00  1      0      13507              55.9000   E44              S           1
582          1         1       Thayer, Mrs. John Borland (Marian Longstreth Morris)                                female  39.00  1      1      17421              110.8833  C68             

In [15]:
# Get the data types of each column
input_dtypes = pd.DataFrame(input_train.dtypes, columns=["Data Type"])
input_dtypes

Unnamed: 0,Data Type
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


In [16]:
# Merge to create one dataframe
pd.merge(left=input_uniqvals,right=input_dtypes,left_index=True,right_index=True)

Unnamed: 0,Unique Values,Data Type
PassengerId,891,int64
Survived,2,int64
Pclass,3,int64
Name,891,object
Sex,2,object
Age,88,float64
SibSp,7,int64
Parch,7,int64
Ticket,681,object
Fare,248,float64


In [158]:
input_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [23]:
# input_train["Ticket"]
input_train["BookedCabin"] = input_train["Cabin"].notnull().astype('int')
input_train.head(50)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,BookedCabin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,0


In [32]:
# We know that survived is the dependent variable
dep_vars = ['Survived']
# Assume that Name, Ticket, and cabin are non-continous and/or no effect on outcome
indep_vars = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','BookedCabin']
# Continuous variables
cont_vars = ['Age','Fare']

In [33]:
dep_vars + indep_vars

['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked',
 'BookedCabin']

Process data

In [35]:
# Filter for columns we want
build_train = input_train[dep_vars+indep_vars]
#build_test = input_test[['PassengerId']+indep_vars]
build_train.head(30)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,BookedCabin
0,0,3,male,22.0,1,0,7.25,S,0
1,1,1,female,38.0,1,0,71.2833,C,1
2,1,3,female,26.0,0,0,7.925,S,0
3,1,1,female,35.0,1,0,53.1,S,1
4,0,3,male,35.0,0,0,8.05,S,0
5,0,3,male,,0,0,8.4583,Q,0
6,0,1,male,54.0,0,0,51.8625,S,1
7,0,3,male,2.0,3,1,21.075,S,0
8,1,3,female,27.0,0,2,11.1333,S,0
9,1,2,female,14.0,1,0,30.0708,C,0


In [36]:
# Mappings for categorical
sex_map = {'male':0,'female':1}


In [38]:
# Loop through each dataset and process variables
# for df in [build_train,build_test]:
for df in [build_train]:
    
    # Calculate Quartiles
    age_qs = pd.qcut(df['Age'],q=4, labels=[1,2,3,4]).copy()
    fare_qs = pd.qcut(df['Fare'],q=4, labels=[1,2,3,4]).copy()

    # Band continuous variables using quantiles
    df['Age_Quartile'] = age_qs
    df['Fare_Quartile'] = fare_qs
    df['Sex'] = df['Sex'].map(sex_map)


In [39]:
build_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,BookedCabin,Age_Quartile,Fare_Quartile
0,0,3,0,22.0,1,0,7.25,S,0,2,1
1,1,1,1,38.0,1,0,71.2833,C,1,3,4
2,1,3,1,26.0,0,0,7.925,S,0,2,2
3,1,1,1,35.0,1,0,53.1,S,1,3,4
4,0,3,0,35.0,0,0,8.05,S,0,3,2


In [42]:
build_train['fare_band'] = pd.qcut(build_train["Fare"],
                                    q=20,
                                    precision=2).copy()

# apply value_counts which is the same as group by + count
char_table=build_train.groupby(['fare_band'])['Survived'].mean('Survived').sort_index()

char_table

fare_band
(-0.01, 7.22]       0.109091
(7.22, 7.55]        0.189189
(7.55, 7.75]        0.333333
(7.75, 7.85]        0.256410
(7.85, 7.91]        0.113636
(7.91, 8.05]        0.225806
(8.05, 9.0]         0.142857
(9.0, 10.5]         0.280000
(10.5, 13.0]        0.461538
(13.0, 14.45]       0.315789
(14.45, 16.1]       0.382979
(16.1, 21.68]       0.463415
(21.68, 26.0]       0.459016
(26.0, 27.0]        0.642857
(27.0, 31.0]        0.400000
(31.0, 39.69]       0.347826
(39.69, 56.5]       0.510638
(56.5, 77.96]       0.547619
(77.96, 112.08]     0.761905
(112.08, 512.33]    0.755556
Name: Survived, dtype: float64

In [70]:
build_train.groupby("fare_band")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
fare_band,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(0, 10]",321,0.205607,0.022592
"(10, 25]",221,0.420814,0.033285
"(25, 1000]",334,0.54491,0.027289


In [44]:
build_train['fare_band'] = pd.cut(build_train["Fare"],
                                    bins=[0,10,25,1000],
                                    precision=2)

build_train['fare_band'].value_counts().sort_index()


fare_band
(0, 10]       321
(10, 25]      221
(25, 1000]    334
Name: count, dtype: int64

In [69]:
build_train.groupby("fare_band")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
fare_band,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(0, 10]",321,0.205607,0.022592
"(10, 25]",221,0.420814,0.033285
"(25, 1000]",334,0.54491,0.027289


In [71]:
build_train.head(25)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,BookedCabin,Age_Quartile,Fare_Quartile,fare_band,ParchBin
0,0,3,0,22.0,1,0,7.25,S,0,2.0,1,"(0, 10]",0
1,1,1,1,38.0,1,0,71.2833,C,1,3.0,4,"(25, 1000]",0
2,1,3,1,26.0,0,0,7.925,S,0,2.0,2,"(0, 10]",0
3,1,1,1,35.0,1,0,53.1,S,1,3.0,4,"(25, 1000]",0
4,0,3,0,35.0,0,0,8.05,S,0,3.0,2,"(0, 10]",0
5,0,3,0,,0,0,8.4583,Q,0,,2,"(0, 10]",0
6,0,1,0,54.0,0,0,51.8625,S,1,4.0,4,"(25, 1000]",0
7,0,3,0,2.0,3,1,21.075,S,0,1.0,3,"(10, 25]",1
8,1,3,1,27.0,0,2,11.1333,S,0,2.0,2,"(10, 25]",1
9,1,2,1,14.0,1,0,30.0708,C,0,1.0,3,"(25, 1000]",0


In [68]:
build_train.groupby("Embarked")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,168,0.553571,0.038468
Q,77,0.38961,0.055939
S,644,0.336957,0.01864


In [67]:
build_train.groupby("BookedCabin")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
BookedCabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,687,0.299854,0.017494
1,204,0.666667,0.033086


In [79]:

build_train.groupby("Parch")["Survived"].agg(['count', 'mean'])
build_train["ParchBin"]=np.where((build_train.Parch>0),1,0)
build_train.groupby("ParchBin")["Survived"].agg(['count', 'mean'])
#build_train.head(100)

Unnamed: 0_level_0,count,mean
ParchBin,Unnamed: 1_level_1,Unnamed: 2_level_1
0,678,0.343658
1,213,0.511737


In [64]:
build_train.groupby("SibSp")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
SibSp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,608,0.345395,0.0193
1,209,0.535885,0.034579
2,28,0.464286,0.095979
3,16,0.25,0.111803
4,18,0.166667,0.090388
5,5,0.0,0.0
8,7,0.0,0.0


In [76]:
build_train['Age_band'] = pd.qcut(build_train["Age"],
                                    q=10,
                                    precision=2).copy()

# apply value_counts which is the same as group by + count
char_table=build_train.groupby(['Age_band'])['Survived'].mean('Survived').sort_index()

In [77]:
build_train.groupby("Age_band")["Survived"].agg(['count', 'mean','sem'])

Unnamed: 0_level_0,count,mean,sem
Age_band,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(0.41, 14.0]",77,0.584416,0.056531
"(14.0, 19.0]",87,0.390805,0.052615
"(19.0, 22.0]",67,0.283582,0.055482
"(22.0, 25.0]",70,0.371429,0.058169
"(25.0, 28.0]",61,0.393443,0.063067
"(28.0, 31.8]",66,0.393939,0.060606
"(31.8, 36.0]",91,0.483516,0.052676
"(36.0, 41.0]",53,0.358491,0.066503
"(41.0, 50.0]",78,0.397436,0.055769
"(50.0, 80.0]",64,0.34375,0.059839


In [48]:
#Function to create a logistic regression model using one numeric variable and the dependent variable
def build_toy_model(dependent, independent):
    toy_model_train = build_train[dependent+independent].copy()

    toy_model_train.dropna(inplace=True)
    toy_model_train['Int'] = 1
    log_reg = sm.Logit(toy_model_train[dependent],toy_model_train[independent+['Int']]).fit()
    return log_reg



In [49]:
age_q_model = build_toy_model(dep_vars, ['Age_Quartile'])
sex_model = build_toy_model(dep_vars, ['Sex'])
pclass_model = build_toy_model(dep_vars, ['Pclass'])


Optimization terminated successfully.
         Current function value: 0.674580
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.515041
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.608531
         Iterations 5


In [50]:
age_sex_model = build_toy_model(dep_vars, ['Age_Quartile','Sex'])

Optimization terminated successfully.
         Current function value: 0.525659
         Iterations 5


In [51]:
print(age_q_model.summary())
print(sex_model.summary())
print(pclass_model.summary())

                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      712
Method:                           MLE   Df Model:                            1
Date:                Fri, 10 May 2024   Pseudo R-squ.:                0.001261
Time:                        11:27:08   Log-Likelihood:                -481.65
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                    0.2701
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Age_Quartile    -0.0753      0.068     -1.102      0.270      -0.209       0.059
Int             -0.1930      0.185     -1.041      0.298      -0.556       0.170
                           Logit Regression 

In [52]:
print(age_sex_model.summary())

                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      711
Method:                           MLE   Df Model:                            2
Date:                Fri, 10 May 2024   Pseudo R-squ.:                  0.2217
Time:                        11:27:08   Log-Likelihood:                -375.32
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 3.611e-47
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Age_Quartile    -0.0197      0.081     -0.243      0.808      -0.179       0.139
Sex              2.4752      0.185     13.360      0.000       2.112       2.838
Int             -1.3035      0.236     -5.53

In [53]:
#Function to create a logistic regression model using one numeric variable and the dependent variable
def predict_toy_model(independent, model_name):
    toy_model_train = build_test[independent].copy()

    toy_model_train.dropna(inplace=True)
    toy_model_train['Int'] = 1
    predictions = model_name.predict(toy_model_train)
    predictions.name = 'SurvivalProb'
    return predictions

In [54]:
age_q_predictions = predict_toy_model(['Age_Quartile'],age_q_model)
sex_predictions = predict_toy_model(['Sex'],sex_model)
pclass_predictions = predict_toy_model(['Pclass'],pclass_model)

In [55]:
test_results = pd.DataFrame(build_test[['PassengerId','Sex']])
test_results['PredProb'] = sex_predictions
test_results['Survived'] = test_results['PredProb'].round().astype(int)

In [56]:
test_results.dtypes

PassengerId      int64
Sex              int64
PredProb       float64
Survived         int32
dtype: object

In [57]:
final_result = test_results.drop(columns=['PredProb'],errors='ignore')
final_result.to_csv('Outputs/simple_logit.csv',index=False)

OSError: Cannot save file into a non-existent directory: 'Outputs'

## Second attempt with Sex and Pclass

In [None]:
toy_model_train = build_train[['Survived','Pclass','Sex']].copy()

toy_model_train.dropna(inplace=True)
toy_model_train['Int'] = 1
model_2 = sm.Logit(toy_model_train['Survived'],toy_model_train[['Pclass','Sex','Int']]).fit()
model_2.summary()

Optimization terminated successfully.
         Current function value: 0.464195
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,888.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 09 May 2024",Pseudo R-squ.:,0.3029
Time:,17:42:02,Log-Likelihood:,-413.6
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,8.798e-79

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Pclass,-0.9606,0.106,-9.057,0.000,-1.168,-0.753
Sex,2.6434,0.184,14.380,0.000,2.283,3.004
Int,0.6512,0.241,2.703,0.007,0.179,1.124


In [None]:
model_2_predictions = predict_toy_model(['Pclass','Sex'],model_2)

In [None]:
test_results2 = pd.DataFrame(build_test[['PassengerId','Sex']])
test_results2['PredProb'] = model_2_predictions
test_results2['Survived'] = test_results2['PredProb'].round()

In [None]:
final_result2 = test_results2.drop(columns=['PredProb','Sex'],errors='ignore')
final_result2.to_csv('Outputs/sex_pclass_logit.csv',index=False)