# 1. Import libraries



In [None]:
import pandas as pd
import numpy as np
import missingno
from collections import Counter

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning models
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier

from sklearn.ensemble import VotingClassifier

# Model evaluation
from sklearn.model_selection import cross_val_score

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# 2. Import and read data

Now import and read the 3 datasets as outlined in the introduction.

In [None]:
#Gia
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")


In [None]:
train.head()

In [None]:
test.head()

# 3. Data description

Gia
The data provided is split into two groups:
1) The training set (train.csv)
2) The testing set (test.csv)

The training set includes a survival column which indicates whether or not the passenger survived. This data set is used to create the machine learning model.
The testing set is used to determine how well the model (generated from the training data set) performs on new unseen data. The testing data set does not provide the passengers' survival status. The model generated predicts the passengers' survival status.

The table below provides all the relevant information about the columns in the data sets:

| Column Name          | Description                                                | Key                    |
| ---------------------| ---------------------------------------------------------- | ---------------------- |
| __PassengerId__      | Passenger Identity                                         |                        | 
| __Survived__         | Passenger survival status                                  | 0 = No, 1 = Yes        | 
| __Pclass__           | Ticket class, a representation of socio-economic status (SES)| 1 = 1st class, 2 = 2nd class, 3 = 3rd class | 
| __Name__             | Passenger's name                                           |                        | 
| __Sex__              | Passenger's sex                                            |                        |
| __Age__              | Passengers age (in years)                                  |                        |
| __SibSp__            | Number of sibling and/or spouse travelling with passenger  |                        |
| __Parch__            | Number of parent and/or children travelling with passenger |                        |
| __Ticket__           | Ticket number                                              |                        |
| __Fare__             | Price of the ticket                                        |                        |
| __Cabin__            | Cabin number                                               |                        |
| __Embarked__         | Point of embarkation                                       | C = Cherbourg, Q = Queenstown, S = Southampton |

More information can be found under the [data](https://www.kaggle.com/c/titanic/

# 4. Exploratory Data Analysis (EDA)

Gia
Exploratory data analysis is used to gain insight on the data provided. This is achieved by using visualisation tools such as graphs and tables. It will allow us to understand the data and derive preliminary conclusions. Furthermore, it will summerise important trends, characteristics, and abnormalities in the dataset which will ultimately aid in training the model.

The following is explored and analysed:
- Data Types
- The shape of the data
- Missing values in the data
- Statistics derived from the data

## 4.1 Data types,data shapes, missing data and summary statistics

### 4.1.1.Data Types

In [None]:
# Non-null count and data types of the training
train.info()

Gia <font color='pink'>Observation:</font>  The training-set has 891 rows and 11 features + the __target variable (survived).__ 2 of the features are floats, 5 are integers and 5 are objects.

### 4.1.2.Data Shape

In [None]:
#Gia
print("The shape of the training data set: ", train.shape)
print("The shape of the testing data set: ", test.shape)

Gia <font color='pink'>Observation:</font> The testing data set has one column less column than the training data set (the Survived column). As discussed above in section 3, survived is our response/target variable and will therefore be determined from the model derived from the training data.

### 4.1.3.Missing Values

In [None]:
#Gia
#Determine what percentage of data is missing values in each column of the training dataset
totalNumberOfDataPoints = train.isnull().sum().sort_values(ascending=False)
percentMissing = train.isnull().sum()/train.isnull().count()*100
percentMissingRounded = (round(percentMissing, 1)).sort_values(ascending=False)
missingData = pd.concat([totalNumberOfDataPoints, percentMissingRounded], axis=1, keys=['Total missing', '%'])
missingData.head(13)

In [None]:
#Gia
#Determine what percentage of data is missing values in each column of the testing dataset
totalNumberOfDataPoints = test.isnull().sum().sort_values(ascending=False)
percentMissing = test.isnull().sum()/test.isnull().count()*100
percentMissingRounded = (round(percentMissing, 1)).sort_values(ascending=False)
missingData = pd.concat([totalNumberOfDataPoints, percentMissingRounded], axis=1, keys=['Total missing', '%'])
missingData.head(13)

Gia <font color='pink'>Observation:</font> From the two tables above it can be seen that the training set has missing values in the Cabin, Age and Embarked columns. The testing dataset has missing values in the Cabin, Age and Fare columns. 
For the training dataset, the Embarked column only contains two missing values which can be easily dropped or filled. The Age column on the other hand has 177 missing values. We therefore, cannot drop the rows which have missing values in the age column as this will eliminate 20% of the training data. Therefore, these values need to be filled in. The approach taken to fill in the missing values is discussed below in section 5.2. Since the Cabin column is missing 77% of data points, we have decided to drop this column. 

### 4.1.4.Statistics

In [None]:
#Gia
# Summary of the statistics for the training data set 
train.describe()

Gia
The table above gives an overview of the central tendencies of the numeric data in the testing dataset. <br /> <font color='pink'>Observations:</font> 
- 38% of people in the training dataset survived the Titanic 
- The passenger age ranges from 0.4 to 80 years old.
- There is an outlier in the Fare column because of the differences between the 75th percentile, standard deviation, and the max value (512). We will thus determine how to deal with this outlier by either dropping its corresponding row or filling the outlier with an appropriate value. 

## 4.2 Feature analysis

Gia
For feature analysis the training dataset will be split into two categories:
1) Categorical variables
2) Numerical variables

Categorical variables have values belonging to one of two or more categories. Numerical variables have a continuous distribution.
Identifying which variables are categorical and which variables are numerical will hel structure the data analysis properly. For example it makes no sense to determine the average of a categorical variable such as sex or class. Furthermore, sex, class and embarked have no intrinsic ordering to its value. 

### 4.2.1 Categorical variables

Gia
In this data set the categorical variables are:
1) Sex
2) Pclass 
3) Embarked.

#### 4.2.1.1.Categorical variable: Sex

In [None]:
#Gia
# Value counts of the sex column
train['Sex'].value_counts(dropna = False)

Gia <font color='pink'>Observation:</font> There are 263 more male passengers than female passengers in the training dataset. Therefore there is a high probability that this observation will also occur in the testing dataset

In [None]:
#Gia
# Mean of survival according to sex
train[['Sex', 'Survived']].groupby('Sex', as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
#Gia
# visualisation for the probability of survival according to sex
sns.barplot(x = 'Sex', y ='Survived', data = train)
plt.ylabel('Probability of survival')
plt.title('Survival Probability by Sex')

Gia <font color='pink'>Observation:</font> Female passengers are more likely to survive.

#### 4.2.1.2.Categorical variable: Pclass

In [None]:
#Gia
# Value counts of the Pclass column in the training dataset

train['Pclass'].value_counts(dropna = False)

In [None]:
#Gia
# Mean of survival by passenger class in the training dataset

train[['Pclass', 'Survived']].groupby(['Pclass'], as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
#Gia 
#Pclass distributions for survived and not survived
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'Pclass'],shade=True,color='r',label='Not Survived')
ax.legend()
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Pclass'],shade=True,color='b',label='Survived')
ax.legend()

plt.title("Passenger Class Distribution - Survived vs Non-Survived", fontsize = 25)
labels = ['First', 'Second', 'Third']
plt.xticks(sorted(train.Pclass.unique()),labels);

In [None]:
#Gia
sns.barplot(x = 'Pclass', y ='Survived', data = train)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Passenger Class')

<font color='pink'>Observation:</font> The probability of survival decreases with a decrease in passenger class. It can therefore be assumed that first class passengers were prioritised during the evacuation. Evidently, from the two graphs above, Pclass plays an important role in determining whether a passenger did or did not survive. According to the training dataset, 63% of the 1st class passengers survived, 48% of the 2nd class passengers survived and only 24% of the 3rd class passengers survived.

#### 4.2.1.3.Categorical variables combined: Sex and Plass

In [None]:
#Gia
# Survival by gender and passenger class
sns.factorplot(x = 'Pclass', y = 'Survived', hue = 'Sex', data = train, kind = 'bar').despine(left = True)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Sex and Passenger Class')

#Aiden did this
Gia <font color='pink'>Observation:</font>The graph above indicates that in every class, females where always more likely to survive

#### 4.2.1.4.Categorical variable: Embarked

In [None]:
# Value counts of the Embarked column 
#NAN is the missing values in Embarked
train['Embarked'].value_counts(dropna = False)

In [None]:
#Gia
# Mean of survival by point of embarkation
train[['Embarked', 'Survived']].groupby(['Embarked'], as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
#Gia
#Visualisation for the probability of survival according to point of embarkation
sns.barplot(x = 'Embarked', y ='Survived', data = train)
plt.ylabel('Probability of Survival')
plt.title('Survival Probability by Point of Embarkation')

Gia <font color='pink'>Observation:</font> The probability of survival is highest for location C and lowest for location S.
Perhaps first class passengers embarked from location C and therefore because first class passengers had a higher chance of survival, location c also has the highest chance of survival. As an alternative perhaps third class passengers embarked from location S and because third class passengers had the lowest chance of survival , location S also has the lowest survival probability. This hypothesis is tested in section 4.2.1.5 below. 

#### 4.2.1.5.Categorical variable combined: Embarked and Class

In [None]:
#Gia
# Visualisation for the relationship between class and embark 
sns.factorplot('Pclass', col = 'Embarked', data = train, kind = 'count')

<font color='pink'>Observation:</font> The hypothesis discussed in section 4.1.2.4 appears to be correct.  Location S has majority of the third class passengers

### 4.2.2 Numerical variables

Gia
In this dataset, the numerical variables are:
1) SibSp
2) Parch
3) Age
4) Fare

#### 4.2.2.1.Detect outliers in numerical variables

Gia Outliers are points in the dataset that don't conform with majority of the data (they are extreme values). Outliers need to be addressed as they tend to skew data and can cause inaccurate model predictions. The Tukey method is used to detect these outliers.

In [None]:
#Gia
#Function to predict outliers
def detect_outliers(df, n, features):
    """"
    This function loops through the list of features and detects outliers in each feature. A data point is considered to be 
    an outlier if it is less than Q1-1.5*IQR or if it is greater than Q3+1.5*IQR. Once the outliers have been determined for 
    a feature, their indices will be stored in a list and then the loop will proceed to the next feature. This process repeats
    until the last feature is complete. Finally, using the list with the indices of the outliers, the frequency of outliers is
    determined and if the frequency is greater than n then the list fill be returned.    
    """
    outlierIndices = [] 
    for col in features: 
        Q1 = np.percentile(df[col], 25)
        Q3 = np.percentile(df[col], 75)
        IQR = Q3 - Q1
        outlierStep = 1.5 * IQR 
        outlierList = df[(df[col] < Q1 - outlierStep) | (df[col] > Q3 + outlierStep)].index
        outlierIndices.extend(outlierList) 
    outlierIndices = Counter(outlierIndices)
    multipleOutliers = list(key for key, value in outlierIndices.items() if value > n) 
    return multipleOutliers

outliers_to_drop = detect_outliers(train, 2, ['Age', 'SibSp', 'Parch', 'Fare'])
print("The indices where outliers occur are {}: ".format(len(outliers_to_drop)), outliers_to_drop) 

In [None]:
#Gia
# Outliers in numerical variables
#Visualise the 10 rows identified above as rows containing outliers
train.loc[outliers_to_drop, :]

#### 4.2.2.2.Numerical variables correlation with survival

In [None]:
#Gia
#Heatmap of numerical variables
df_num = train[['Age','SibSp','Parch','Fare']]
sns.heatmap(df_num.corr(), annot=True,cmap="RdBu")
plt.title("Correlations Among Numeric Features", fontsize = 18);

Gia<font color='pink'>Observation:</font> The heatmap displayed above shows that Parch and SiSP often travel together.Therefore it will useful to create a isAlone and a family size feature.

#### 4.2.2.3.Numerical variable: SibSp

In [None]:
#Gia
# Value counts of the SibSp column 
train['SibSp'].value_counts(dropna = False)

In [None]:
#Gia
# Mean of survival by SibSp
train[['SibSp', 'Survived']].groupby('SibSp', as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
#Gia
#Visualisation for probability of survival according to SiSP 
sns.barplot(x = 'SibSp', y ='Survived', data = train)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by SibSp')

#### 4.2.2.4.Numerical variable: Parch

In [None]:
#Gia
# Value counts of the Parch column 
train['Parch'].value_counts(dropna = False)

In [None]:
#Gia
# Mean of survival by Parch
train[['Parch', 'Survived']].groupby('Parch', as_index = False).mean().sort_values(by = 'Survived', ascending = False)

In [None]:
#Gia
#Visualisation for probability of survival according to Parch
sns.barplot(x = 'Parch', y ='Survived', data = train)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Parch')

#### 4.2.2.5.Numerical variable: Age

In [None]:
#Gia
# Passenger age distribution

sns.distplot(train['Age'], label = 'Skewness: %.2f'%(train['Age'].skew()))
plt.legend(loc = 'best')
plt.title('Passenger Age Distribution')

In [None]:
#Gia
# Age distribution by survival
sns.FacetGrid(train, col = 'Survived').map(sns.distplot, 'Age')

In [None]:
sns.kdeplot(train['Age'][train['Survived'] == 0], label = 'Did not survive')
sns.kdeplot(train['Age'][train['Survived'] == 1], label = 'Survived')
plt.xlabel('Age')
plt.legend()
plt.title('Passenger Age Distribution by Survival')

#### 4.2.2.6.Numerical variable: Fare

In [None]:
# Passenger fare distribution
sns.distplot(train['Fare'], label = 'Skewness: %.2f'%(train['Fare'].skew()))
plt.legend(loc = 'best')
plt.ylabel('Passenger Fare Distribution')

Gia <font color='pink'>Observation:</font> 

### 4.2.3 Correlation between categorical and numerical

#### 4.2.3.1.All variables

In [None]:
sns.heatmap(train[['Survived', 'SibSp', 'Parch', 'Age', 'Fare','Pclass']].corr(), annot = True, fmt = '.2f', cmap='RdBu')

Gia <font color='pink'>Observation:</font>  Fare appears to have a high correlation with survival.

# 5. Data preprocessing

Getting the dataset in a form to be modelled and trained. This includes:
- Dealing with ouliers
- Drop and fill missing values
- Data transformation 
- Feature engineering
- Feature encoding

## 5.1 Remove Outliers

In [None]:
# Drop outliers 

print("Train Set Before: {} rows".format(len(train)))
#train = train.drop(outliers_to_drop, axis = 0).reset_index(drop = True)
print("Train Set After: {} rows".format(len(train)))
print("Test Set Before: {} rows".format(len(test)))
# test = test.drop(outliers_to_drop_test, axis = 0).reset_index(drop = True)
print("Test Set After: {} rows".format(len(test)))

## 5.2 Drop and fill missing values

In [None]:
# Drop ticket and cabin features from training and test set as they are unique or missing many values
train = train.drop(['Ticket', 'Cabin'], axis = 1)
test = test.drop(['Ticket', 'Cabin'], axis = 1)

I have decided to drop both ticket and cabin for simplicity of this tutorial but if you have the time, I would recommend going through them and see if they can help improve your model.

In [None]:
train.isnull().sum().sort_values(ascending = False)

In [None]:
# Fill missing value in Embarked with mode as only 3 values
mode = train['Embarked'].dropna().mode()[0]
train['Embarked'].fillna(mode, inplace = True)

In [None]:
test.isnull().sum().sort_values(ascending = False)

In [None]:
# Fill missing value for Fare 
median = test['Fare'].dropna().median()
test['Fare'].fillna(median, inplace = True)

In [None]:
# Check where indeces of missing ages are
age_nan_indices_train = list(train[train['Age'].isnull()].index)
len(age_nan_indices_train)
age_nan_indices_test = list(test[test['Age'].isnull()].index)


Age is negatively correlated with SibSp, Parch and Pclass as shown in section 4. Loop through each those rows and fill the missing age with their median. Othwerise fill with the Age median.

In [None]:
for index in age_nan_indices_train:
    median_age = train['Age'].median()
    predict_age = train['Age'][(train['SibSp'] == train.iloc[index]['SibSp']) 
                                 & (train['Parch'] == train.iloc[index]['Parch'])
                                 & (train['Pclass'] == train.iloc[index]["Pclass"])].median()
    if np.isnan(predict_age):
        train['Age'].iloc[index] = median_age
    else:
        train['Age'].iloc[index] = predict_age
combine = pd.concat([train, test], axis = 0).reset_index(drop = True)
median_age = combine['Age'].median()
for index in age_nan_indices_test:
    #use larger sample to fill test data 
    test['Age'].iloc[index] = median_age  

In [None]:
# Make sure there are no more missing ages 
print(train['Age'].isnull().sum())
test['Age'].isnull().sum()

## 5.3 Data transformation

Recall that our passenger fare column has a very high positive skewness. Therefore, we will apply a log transformation to address this issue.

In [None]:
#  fare distribution

sns.distplot(train['Fare'], label = 'Skewness: %.2f'%(train['Fare'].skew()))
plt.legend(loc = 'best')
plt.title('Passenger Fare Distribution')

In [None]:
# In order to reduce skewness in fare, apply log transformation 
train['Fare'] = train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
test['Fare'] = test['Fare'].map(lambda x: np.log(x) if x > 0 else 0)

In [None]:
# After log transformation

sns.distplot(train['Fare'], label = 'Skewness: %.2f'%(train['Fare'].skew()))
plt.legend(loc = 'best')
plt.title('Fare Distribution After Log Transformation')

## 5.4 Feature engineering

We create new features from existing features to obtain an improved model.

### 5.4.1 Title

In [None]:
train.head()

In [None]:
#Title from name column
train['Title'] = [name.split(',')[1].split('.')[0].strip() for name in train['Name']]
train[['Name', 'Title']].head()
test['Title'] = [name.split(',')[1].split('.')[0].strip() for name in test['Name']]
test[['Name', 'Title']].head()

In [None]:
# Value counts of Title
train['Title'].value_counts()

In [None]:
# visualise the testing titles
test['Title'].value_counts()

In [None]:
# Simplify Title as there are several unique itles that do not necessarily have a trend

train['Title'] = train['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Lady', 'Jonkheer', 'Don', 'Capt', 'the Countess',
                                             'Sir'], 'Rare')
train['Title'] = train['Title'].replace(['Mlle', 'Ms'], 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

test['Title'] = test['Title'].replace(['Dr', 'Rev', 'Col',  'Capt', 'Dona'], 'Rare')
test['Title'] = test['Title'].replace(['Ms'], 'Miss')


In [None]:
# Drop name column as title has been extracted


train = train.drop('Name', axis = 1)
train.head()

test = test.drop('Name', axis = 1)
test.head()

### 5.4.2 IsAlone

In [None]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
train[['SibSp', 'Parch', 'FamilySize']].head()

test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
test[['SibSp', 'Parch', 'FamilySize']].head()

In [None]:
# Create IsAlone feature as familySize may have more information than we need, leading to overfitting

train['IsAlone'] = 0
train.loc[train['FamilySize'] == 1, 'IsAlone'] = 1

test['IsAlone'] = 0
test.loc[test['FamilySize'] == 1, 'IsAlone'] = 1

In [None]:
# Drop SibSp, Parch and FamilySize as this is contained in isAlone

train = train.drop(['SibSp', 'Parch','FamilySize'], axis = 1)
test = test.drop(['SibSp', 'Parch','FamilySize'], axis = 1)
train.head()

### 5.4.3 Age*Class

First convert Age into an ordinal variable. Group Ages into 5 age bands 

In [None]:

train['AgeBand'] = pd.cut(train['Age'], 5)
test['AgeBand'] = pd.cut(test['Age'], 5)

In [None]:
train.loc[train['Age'] <= 16.136, 'Age'] = 0
train.loc[(train['Age'] > 16.136) & (train['Age'] <= 32.102), 'Age'] = 1
train.loc[(train['Age'] > 32.102) & (train['Age'] <= 48.068), 'Age'] = 2
train.loc[(train['Age'] > 48.068) & (train['Age'] <= 64.034), 'Age'] = 3
train.loc[train['Age'] > 64.034 , 'Age'] = 4

test.loc[test['Age'] <= 16.136, 'Age'] = 0
test.loc[(test['Age'] > 16.136) & (test['Age'] <= 32.102), 'Age'] = 1
test.loc[(test['Age'] > 32.102) & (test['Age'] <= 48.068), 'Age'] = 2
test.loc[(test['Age'] > 48.068) & (test['Age'] <= 64.034), 'Age'] = 3
test.loc[test['Age'] > 64.034 , 'Age'] = 4

# Drop age band feature
train = train.drop('AgeBand', axis = 1)
test = test.drop('AgeBand', axis = 1)

In [None]:
# Convert ordinal Age into integer
train['Age'] = train['Age'].astype('int')
test['Age'] = test['Age'].astype('int')
train['Age'].dtype

In [None]:
# Create Age*Class

train['Age*Class'] = train['Age'] * train['Pclass']
test['Age*Class'] = test['Age'] * test['Pclass']
train[['Age', 'Pclass', 'Age*Class']].head()

In [None]:
# Bin Fare 
train['FareBand'] = pd.qcut(train['Fare'], 4)
test['FareBand'] = pd.qcut(test['Fare'], 4)
train['FareBand'].head(10)


In [None]:
#ordinal encoding, simliar to age
train.loc[train['Fare'] <= 2.066, 'Fare'] = 0
train.loc[(train['Fare'] > 2.066) & (train['Fare'] <= 2.671), 'Fare'] = 1
train.loc[(train['Fare'] > 2.671) & (train['Fare'] <= 3.418), 'Fare'] = 2
train.loc[train['Fare'] > 3.418, 'Fare'] = 3

test.loc[test['Fare'] <= 2.066, 'Fare'] = 0
test.loc[(test['Fare'] > 2.066) & (test['Fare'] <= 2.671), 'Fare'] = 1
test.loc[(test['Fare'] > 2.671) & (test['Fare'] <= 3.418), 'Fare'] = 2
test.loc[test['Fare'] > 3.418, 'Fare'] = 3

In [None]:
train = train.drop([ 'FareBand'], axis = 1)
test = test.drop(['FareBand'], axis = 1)

In [None]:
# Convert Fare into integer
train['Fare'] = train['Fare'].astype('int')
test['Fare'] = test['Fare'].astype('int')

## 5.5 Feature encoding 

Variables must be numeric to use for machine learning. Age and Fare were done when Binning. 

In [None]:
train.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
label = LabelEncoder() 
train['Embarked'] = label.fit_transform(train['Embarked'])
test['Embarked'] = label.fit_transform(test['Embarked'])
train['Title'] = label.fit_transform(train['Title'])
test['Title'] = label.fit_transform(test['Title'])
train['Sex'] = train['Sex'].map({'male': 0, 'female': 1})
test['Sex'] = test['Sex'].map({'male': 0, 'female': 1})

train.head()

In [None]:
train = train.drop('PassengerId', axis = 1)
train.head()

In [None]:
train['Survived'] = train['Survived'].astype('int')
train.head()

In [None]:
test.head()

# 6. Modelling

Scikit-learn is one of the most popular libraries for machine learning in Python and that is what we will use in the modelling part of this project. 

Since Titanic is a classfication problem, we will need to use classfication models, also known as classifiers, to train on our model to make predictions. I highly recommend checking out this scikit-learn [documentation](https://scikit-learn.org/stable/index.html) for more information on the different machine learning models available in their library. I have chosen the following classifiers for the job:

- Logistic regression
- Support vector machines
- K-nearest neighbours
- Gaussian naive bayes
- Perceptron
- Linear SVC
- Stochastic gradient descent
- Decision tree
- Random forest
- CatBoost

In this section of the notebook, I will fit the models to the training set as outlined above and evaluate their accuracy at making predictions. Once the best model is determined, I will also do hyperparameter tuning to further boost the performance of the best model.

## 6.1 Split training data

We need to first split our training data into independent variables or predictor variables, represented by X as well as  dependent variable or response variable, represented by Y.

Y_train is the survived column in our training set and X_train is the other columns in the training set excluding the Survived column. Our models will learn to classify survival, Y_train based on all X_train and make predictions on X_test.

In [None]:
X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test.drop('PassengerId', axis = 1).copy()#why only drop now
print("X_train shape: ", X_train.shape)
print("Y_train shape: ", Y_train.shape)
print("X_test shape: ", X_test.shape)

## 6.2 Fit model to data and make predictions

This requires 3 simple steps: instantiate the model, fit the model to the training set and predict the data in test set. 

### 6.2.1 Logistic regression

 Explanation (not to be included in final submision): In section 6.2, we are training our models using the ENTIRE training set (every row that has a survive column). The models are UNTUNED.. We then calculate the accuracy of each model for the TRAINING set data. In other words we  determine how accurate each model is when it is asked to predict the outcome  (survival)  for the passengers on which it was trained. High scores might be an inidcation of which algorithms are likely to work well for predicting survival for passenges in the test set(this is the ultimate goal), although high scores could also indicate overfitting which is bad . These scores are summarised in the next section

In [None]:
#rael
logreg = LogisticRegression()
LGtrained=logreg.fit(X_train, Y_train)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)


### 6.2.2 Support vector machines

In [None]:
#rael
svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)


### 6.2.3 K-nearest neighbours (KNN)

In [None]:
#rael

knn = KNeighborsClassifier(n_neighbors = 5)
KNNtrained=knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

### 6.2.4 Gaussian naive bayes

In [None]:
#rael
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

### 6.2.5 Perceptron

In [None]:
#rael
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

### 6.2.6 Linear SVC

In [None]:
#rael
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

### 6.2.7 Stochastic gradient descent

In [None]:
#rael
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

### 6.2.8 Decision tree

In [None]:
#rael
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

### 6.2.9 Random forest

In [None]:
#rael
random_forest = RandomForestClassifier(n_estimators = 100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

### 6.2.10 CatBoost

In [None]:
#rael
catboost = CatBoostClassifier()
catboost.fit(X_train, Y_train)
acc_catboost = round(catboost.score(X_train, Y_train) * 100, 2)

In [None]:
#rael
#MLP
mlp = MLPClassifier()
mlp.fit(X_train, Y_train)
acc_mlp = round(catboost.score(X_train, Y_train) * 100, 2)

In [None]:
#rael
#acc_catboost

## 6.3 Model evaluation and hyperparameter tuning

Once all our models have been trained, the next step is to assess the performance of these models and select the one which has the highest prediction accuracy. 

### 6.3.1 Training accuracy

Training accuracy shows how well our model has learned from the training set. 

Internal comment: Viewing and summarising the scores calcualted above for each algorithm. These models have have not yet been tuned

In [None]:
models = pd.DataFrame({'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
                                 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 
                                 'Linear SVC', 'Decision Tree', 'CatBoost','MLP'],
                       'Score': [acc_svc, acc_knn, acc_log, acc_random_forest, acc_gaussian, acc_perceptron,
                                 acc_sgd, acc_linear_svc, acc_decision_tree, acc_catboost, acc_mlp]})

models.sort_values(by = 'Score', ascending = False, ignore_index = True)

### 6.3.2 K-fold cross validation

It is important to not get too carried away with models with impressive training accuracy as what we should focus on instead is the model's ability to predict out-of-samples data, in other words, data our model has not seen before.

This is where k-fold cross validation comes in. K-fold cross validation is a technique whereby a subset of our training set is kept aside and will act as holdout set for testing purposes. Here is a great [video](https://www.youtube.com/watch?v=fSytzGwwBVw) explaining the concept in more detail. 

In [None]:
# Create a list which contains classifiers 

classifiers = []
classifiers.append(LogisticRegression())
classifiers.append(SVC())
classifiers.append(KNeighborsClassifier(n_neighbors = 5))
classifiers.append(GaussianNB())
classifiers.append(Perceptron())
classifiers.append(LinearSVC())
classifiers.append(SGDClassifier())
classifiers.append(DecisionTreeClassifier())
classifiers.append(RandomForestClassifier())
classifiers.append(CatBoostClassifier())
classifiers.append(MLPClassifier())


len(classifiers)

In [None]:
# Create a list which contains cross validation results for each classifier

cv_results = []
for classifier in classifiers:#each result has 10 subcompoents for each section of the data that was made test if cv equals 10
    cv_results.append(cross_val_score(classifier, X_train, Y_train, scoring = 'accuracy', cv = 5))#try other cv's

In [None]:
# Mean and standard deviation of cross validation results for each classifier  

cv_mean = []
cv_std = []
for cv_result in cv_results:
    cv_mean.append(cv_result.mean())
    cv_std.append(cv_result.std())

In [None]:
cv_res = pd.DataFrame({'Cross Validation Mean': cv_mean, 'Cross Validation Std': cv_std, 'Algorithm': ['Logistic Regression', 'Support Vector Machines', 'KNN', 'Gausian Naive Bayes', 'Perceptron', 'Linear SVC', 'Stochastic Gradient Descent', 'Decision Tree', 'Random Forest', 'CatBoost','MLP']})
cv_res.sort_values(by = 'Cross Validation Mean', ascending = False, ignore_index = True)

In [None]:
sns.barplot('Cross Validation Mean', 'Algorithm', data = cv_res, order = cv_res.sort_values(by = 'Cross Validation Mean', ascending = False)['Algorithm'], palette = 'Set3', **{'xerr': cv_std})
plt.ylabel('Algorithm')
plt.title('Cross Validation Scores')

As we can see, support vector machines has the highest cross validation mean and thus we will proceed with this model.

### 6.3.3 Hyperparameter tuning for SVM

Hyperparameter tuning is the process of tuning the parameters of a model. Here I will tune the parameters of support vector classifier using GridSearchCV.

In [None]:
# param_grid = {'n_neighbors': [1,2,3,4,5,6],  
#              # 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
#              # 'kernel': ['rbf']}  
# }
# grid = GridSearchCV(KNeighborsClassifier(), param_grid, refit = True, verbose = 3) 

# grid.fit(X_train, Y_train) 

In [None]:
param_grid = {'alpha': [0,1e-5,1e-4,1e-3,1e-2,1e-1,1],  
             # 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
             # 'kernel': ['rbf']}  
}
grid = GridSearchCV(MLPClassifier(), param_grid, refit = True, verbose = 3) 

grid.fit(X_train, Y_train) 

In [None]:
print("Best parameters: ", grid.best_params_) 
print("Best estimator: ", grid.best_estimator_)# what is this

In [None]:
# Training accuracy

# #svc = KNeighborsClassifier(C = 100, gamma = 0.01, kernel = 'rbf')
# knn = KNeighborsClassifier(n_neighbors=6)
# knn.fit(X_train,Y_train)
# #svc.fit(X_train, Y_train)
# Y_pred = knn.predict(X_test)
# acc_svc = round(knn.score(X_train, Y_train) * 100, 2)
# acc_svc

In [None]:
#svc = KNeighborsClassifier(C = 100, gamma = 0.01, kernel = 'rbf')
mlp = grid.best_estimator_
trainedAndTunedlMLP=mlp.fit(X_train,Y_train)
#svc.fit(X_train, Y_train)
Y_pred = mlp.predict(X_test)
acc_svc = round(mlp.score(X_train, Y_train) * 100, 2)
acc_svc

In [None]:
# Mean cross validation score

cross_val_score(mlp, X_train, Y_train, scoring = 'accuracy', cv = 5).mean()

Our mean cross validation score improved slightly.

In [None]:
# Survival predictions by support vector classifier

Y_pred

In [None]:
len(Y_pred)

### 6.3.4 Ensembles

In [None]:
best_trained_MLP =trainedAndTunedlMLP #still not sure what this estimator is
trained_knn = KNNtrained
trained_lg = LGtrained


voting_clf_soft = VotingClassifier(estimators = [('mlp',best_trained_MLP),('knn',trained_knn),('lg',trained_lg)], voting = 'soft') 



print('voting_clf_soft :',cross_val_score(voting_clf_soft,X_train,Y_train,cv=5))
print('voting_clf_soft mean :',cross_val_score(voting_clf_soft,X_train,Y_train,cv=5).mean())

params = {'weights' : [[1,1,1],[1,2,1],[1,1,2],[2,1,1],[2,2,1],[1,2,2],[2,1,2]]}

vote_weight = GridSearchCV(voting_clf_soft, param_grid = params, cv = 5, verbose = True, n_jobs = -1)
best_clf_weight = vote_weight.fit(X_train, Y_train)
voting_clf_sub = best_clf_weight.best_estimator_.predict(X_test)

voting_clf_soft.fit(X_train, Y_train)
Y_pred =  voting_clf_soft.predict(X_test).astype(int)

# 7. Preparing data for submission

In [None]:
ss.head()

In [None]:
ss.shape

We want our submission dataframe to have 418 rows and 2 columns, PassengerId and Survived. 

In [None]:
# Create submission dataframe

submit = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': Y_pred})
submit.head()

In [None]:
submit.shape

Our dataframe is ready for submission!

In [None]:
# Create and save csv file 

submit.to_csv("submission.csv", index = False)

# 8. Possible extensions to improve model accuracy

1. Analyse ticket and cabin features
    - Do these features help predict passenger survival?
    - If yes, consider including them in the training set instead of dropping
2. Come up with alternative features in feature engineering
    - Is there any other features you can potentially create from existing features in the dataset
3. Remove features that are less important
    - Does removing features help reduce overfitting in the model?
4. Ensemble modelling
    - This is a more advanced technique whereby you combine prediction results from multiple machine learning models

# 9. Conclusion

You should achieve a submission score of 0.77511 if you follow exactly what I have done in this notebook. In other words, I have correctly predicted 77.5% of the test set. I highly encourage you to work through this project again and see if you can improve on this result.

If you found any mistakes in the notebook or places where I can potentially improve on, feel free to reach out to me. Let's help each other get better - happy learning!

My platforms: 
- [Facebook](https://www.facebook.com/chongjason914)
- [Instagram](https://www.instagram.com/chongjason914)
- [Twitter](https://www.twitter.com/chongjason914)
- [LinkedIn](https://www.linkedin.com/in/chongjason914)
- [YouTube](https://www.youtube.com/channel/UCQXiCnjatxiAKgWjoUlM-Xg?view_as=subscriber)
- [Medium](https://www.medium.com/@chongjason)

## References
https://github.com/chongjason914/kaggle-titanic 

https://www.kaggle.com/code/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

https://github.com/murilogustineli/Titanic-Classification

https://www.kaggle.com/code/kenjee/titanic-project-example/notebook

