The goal is to correctly predict if someone survived the Titanic shipwreck

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

training = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

training['train_test'] = 1
test['train_test'] = 0
test['Survived'] = np.NaN
all_data = pd.concat([training,test])

%matplotlib inline
# Let's see the columns
all_data.columns

## Light Data Exploration

1) For numeric data

* Make histograms to understand distributions
* Corrplot
* Pivot table comparing survival rate across numeric variables

2) For Categorical Data
* Make bar charts to understand balance of classes
* Make pivot tables to understand relationship with survival

In [None]:
# Quick look at our data types & null counts 
training.info()

"""
Key Columns:
PassengerId: Unique identifier for each passenger.
Survived: Binary variable indicating survival (1 = Yes, 0 = No).
Pclass: Ticket class (1 = First, 2 = Second, 3 = Third).
Name: Passenger's name.
Sex: Passenger's gender (Male or Female).
Age: Passenger's age (some values are missing).
SibSp: Number of siblings or spouses aboard.
Parch: Number of parents or children aboard.
Ticket: Ticket number.
Fare: Fare paid by the passenger.
Cabin: Cabin number (some values are missing).
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
"""

In [None]:
"""
To better understand the numeric data, we want to use the .describe() method. 
This gives us an understanding of the central tendencies of the data.
"""
training.describe()

In [None]:
# Quick way to separate numeric columns
training.describe().columns

In [None]:
# Look at numeric and categorical values separately 
df_num = training[['Age','SibSp','Parch','Fare']]
df_cat = training[['Survived','Pclass','Sex','Ticket','Cabin','Embarked']]

In [None]:
# Distributions for all numeric variables 
for i in df_num.columns:
    plt.hist(df_num[i])
    plt.title(i)
    plt.show()

In [None]:
"""
Let's correlate the numeric features.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

The correlation coefficient ranges from −1 to 1.
An absolute value of exactly 1 implies that a linear equation describes the relationship between X and Y 
perfectly, with all data points lying on a line. 
The correlation sign is determined by the regression slope: a value of +1 implies that all data points 
lie on a line for which Y increases as X increases, whereas a value of -1 implies a line where Y increases 
while X decreases. A value of 0 implies that there is no linear dependency between the variables.
"""
print(df_num.corr())
sns.heatmap(df_num.corr())

In [None]:
"""
Compare survival rate across Age, SibSp, Parch, and Fare.
https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
This gives us a spreadsheet-style table indicating means of each of the numerical features.

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                        columns=['C'], aggfunc="sum")
>>> table
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0
    
From the above example, we are summing all the values from D that have 'foo', 'two' and 'small' in the same 
row giving 6.0
"""
pd.pivot_table(training, index = 'Survived', values = ['Age','SibSp','Parch','Fare'], aggfunc="mean")

In [None]:
"""
https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most 
frequently-occurring element. Excludes NA values by default.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
Name: proportion, dtype: float64

"""
for i in df_cat.columns:
    sns.barplot(x=df_cat[i].value_counts().index, y=df_cat[i].value_counts()).set_title(i)
    plt.show()

Cabin and ticket graphs are very messy. This is an area where we may want to do some feature engineering!

In [None]:
# Comparing survival and each of these categorical variables 
# For example, count the number of tickets that have survived (or not) with each Pclass 
print(pd.pivot_table(training, index = 'Survived', columns = 'Pclass', values = 'Ticket', aggfunc ='count'))
print()
print(pd.pivot_table(training, index = 'Survived', columns = 'Sex', values = 'Ticket', aggfunc ='count'))
print()
print(pd.pivot_table(training, index = 'Survived', columns = 'Embarked', values = 'Ticket', aggfunc ='count'))

## Feature Engineering

1) Cabin - Simplify cabins
2) Tickets - Do different ticket types impact survival rates?
3) Does a person's title relate to survival rates?

In [None]:
print(df_cat.Cabin)
"""
If cabin value is NaN, assign it a zero. For non-NaN, count the number of cabins associated for each passenger.
"""
training['cabin_multiple'] = training.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
training['cabin_multiple'].value_counts()

In [None]:
# Count the number of tickets that have survived (or not) with each associated number of cabins 
pd.pivot_table(training, index = 'Survived', columns = 'cabin_multiple', values = 'Ticket', aggfunc ='count')

In [None]:
"""
Creates categories based on the cabin letter (n stands for null). We will treat null values like it's own category
"""
training['cabin_adv'] = training.Cabin.apply(lambda x: str(x)[0])

In [None]:
# Comparing surivial rate by cabin
print(training.cabin_adv.value_counts())
# Can also do values = 'Ticket'
pd.pivot_table(training,index='Survived', columns='cabin_adv', values = 'Name', aggfunc='count') 

In [None]:
# Understand ticket values better. numeric vs non numeric 
training['numeric_ticket'] = training.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
training['ticket_letters'] = training.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)

print(training['numeric_ticket'].value_counts())
print(training['ticket_letters'].value_counts())

In [None]:
pd.pivot_table(training,index='Survived',columns='numeric_ticket', values = 'Ticket', aggfunc='count')

In [None]:
pd.pivot_table(training,index='Survived',columns='ticket_letters', values = 'Ticket', aggfunc='count')

In [None]:
# Feature engineering on person's title. mr., ms., master. etc
print(training.Name.head(10))
training['name_title'] = training.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
training['name_title'].value_counts()

In [None]:
pd.pivot_table(training,index='Survived',columns='name

In [None]:
pd.pivot_table(training,index='Survived',columns='name_title', values = 'Ticket', aggfunc='count')

## Data Preprocessing for Model

1) Drop null values from Embarked (only 2)¶
2) Include only relevant variables (we want to exclude things like name and passanger ID so that we could have a reasonable number of features for our models to deal with)
Variables: 'Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','cabin_adv','cabin_multiple','numeric_ticket','name_title'
3) We can ensure that our traning and test data have the same columns. 
4) Impute data with mean for fare and age (Should also experiment with median)
5) Normalized fare using logarithm to give more semblance of a normal distribution

In [None]:
# Create all categorical variables that we did above for both training and test sets 
all_data['cabin_multiple'] = all_data.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
all_data['cabin_adv'] = all_data.Cabin.apply(lambda x: str(x)[0])
all_data['numeric_ticket'] = all_data.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
all_data['ticket_letters'] = all_data.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)
all_data['name_title'] = all_data.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())

# Impute nulls for continuous data 
#all_data.Age = all_data.Age.fillna(training.Age.mean())
all_data.Age = all_data.Age.fillna(training.Age.median())
#all_data.Fare = all_data.Fare.fillna(training.Fare.mean())
all_data.Fare = all_data.Fare.fillna(training.Fare.median())

# Drop null 'embarked' rows. Only 2 instances of this in training and 0 in test 
all_data.dropna(subset=['Embarked'],inplace = True)

# Tried log norm of sibsp (not used)
all_data['norm_sibsp'] = np.log(all_data.SibSp+1)
all_data['norm_sibsp'].hist()

# log norm of fare (used)
all_data['norm_fare'] = np.log(all_data.Fare+1)
all_data['norm_fare'].hist()

# Explicitly convert Pclass into str type
all_data.Pclass = all_data.Pclass.astype(str)

# Created dummy variables from categories (OneHotEncoder)
all_dummies = pd.get_dummies(all_data[['Pclass','Sex','Age','SibSp','Parch','norm_fare','Embarked','cabin_adv','cabin_multiple','numeric_ticket','name_title','train_test']])

# Peek at first 5 items of the dataset
print(all_dummies.head(5))

# Split to train test again
X_train = all_dummies[all_dummies.train_test == 1].drop(['train_test'], axis =1)
X_test = all_dummies[all_dummies.train_test == 0].drop(['train_test'], axis =1)


y_train = all_data[all_data.train_test==1].Survived

In [None]:
# Scale data 
"""
fit() computes the mean and stdev to be used for later scaling, note it's just a computation with no scaling done.

transform() uses the previously computed mean and stdev to scale the data 
(subtract mean from all values and then divide it by stdev).

fit_transform() does both at the same time. So you can do it with just 1 line of code.
"""
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
all_dummies_scaled = all_dummies.copy()
all_dummies_scaled[['Age','SibSp','Parch','norm_fare']]= scale.fit_transform(all_dummies_scaled[['Age','SibSp','Parch','norm_fare']])

# Peek at first 5 items of the transformed dataset
print(all_dummies_scaled.head(5))

X_train_scaled = all_dummies_scaled[all_dummies_scaled.train_test == 1].drop(['train_test'], axis =1)
X_test_scaled = all_dummies_scaled[all_dummies_scaled.train_test == 0].drop(['train_test'], axis =1)

y_train = all_data[all_data.train_test==1].Survived

## Model Building (Baseline Validation Performance)

Let's see how various models perform with default parameters. Let's use 5-fold cross validation to get a baseline. Just because a model has a high basline doesn't mean that it will actually do better on the eventual test set.

![image.png](attachment:image.png)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Gaussian Naive Bayes (GNB) is a classification technique used in machine learning based on a 
probabilistic approach and Gaussian distribution. 
Gaussian Naive Bayes assumes that each parameter, also called features or predictors, 
has an independent capacity of predicting the output variable.

StatQuest: https://www.youtube.com/watch?v=H3EjCKtlVog

In [None]:
gnb = GaussianNB()
cv = cross_val_score(gnb,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
In statistics, a logistic model (or logit model) is a statistical model that models the log-odds of an 
event as a linear combination of one or more independent variables. 
![image.png](attachment:image.png)

StatQuest: https://www.youtube.com/watch?v=yIYKR4sgzI8

In [None]:
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Decision tree is actually very intuitive and easy to understand. It is basically a set of Yes/No or if/else questions.

StatQuest: https://www.youtube.com/watch?v=_L39rN6gz7Y

In [None]:
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
The concept of neighborhood depends on the idea that those close to us tend to be more like us.
From this notion, what KNN (very generically) does is create neighborhoods in our dataset and as we pass other data samples to the model it will return us on “which neighborhood our sample would best fit” !

StatQuest: https://www.youtube.com/watch?v=HVXime0nQeI

In [None]:
knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best split strategy, i.e. equivalent to passing splitter="best" to the underlying DecisionTreeClassifier. 

StatQuests: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ and https://www.youtube.com/watch?v=sQ870aTKqiM

In [None]:
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

Effective in high dimensional spaces.

Still effective in cases where number of dimensions is greater than the number of samples.

Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.

SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset.

![image.png](attachment:image.png)

StatQuests: https://www.youtube.com/watch?v=efR1C6CvhmE, https://www.youtube.com/watch?v=Toet3EiSFcM and https://www.youtube.com/watch?v=Qc5IyLW_hns

In [None]:
svc = SVC(probability = True)
cv = cross_val_score(svc,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

https://xgboost.readthedocs.io/en/stable/get_started.html 
https://www.nvidia.com/en-us/glossary/xgboost/
XGBoost and Random Forest are both ensemble methods for machine learning, but they differ in how they combine base models. XGBoost, a gradient boosting algorithm, builds trees sequentially, focusing on errors of previous trees, while Random Forest uses bagging, where trees are trained in parallel and combined via averaging. XGBoost generally achieves higher accuracy but requires more tuning and can be prone to overfitting if not handled carefully. Random Forest is faster to train and can handle larger datasets, with fewer parameters to tune and less risk of overfitting.

StatQuests: https://www.youtube.com/watch?v=OtD8wVaFm6E, https://www.youtube.com/watch?v=8b1JEDvenQU, https://www.youtube.com/watch?v=ZVFeW798-2I and https://www.youtube.com/watch?v=oRrKeUCEbq8

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state =1)
cv = cross_val_score(xgb,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
"""
A Voting Classifier takes all of the inputs and averages the results. 
For a "hard" voting classifier each classifier gets 1 vote "yes" or "no" and the result is just a popular vote. 
For this, you generally want odd numbers. 
A "soft" classifier averages the confidence of each of the models. 
If the average confidence is > 50% that it is a 1 it will be counted as such.
"""
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf),('gnb',gnb),('svc',svc),('xgb',xgb)], voting = 'soft') 
cv = cross_val_score(voting_clf,X_train_scaled,y_train,cv=5)
print(cv)
print(cv.mean())

In [None]:
voting_clf.fit(X_train_scaled,y_train) # train
y_hat_base_vc = voting_clf.predict(X_test_scaled).astype(int) # test
basic_prediction = {'PassengerId': test.PassengerId, 'Survived': y_hat_base_vc}
base_prediction = pd.DataFrame(data=basic_prediction)
base_prediction.to_csv('prediction.csv', index=False)