# Problem Definition

**Titanic Survival Prediction:**

Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

**Variables and Their Types:**

Survival: Survival -> 0 = No, 1 = Yes

Pclass: Ticket class -> 1 = 1st, 2 = 2nd, 3 = 3rd

Sex: Sex

Age: Age in years

SibSp: # of siblings / spouses aboard the Titanic

Parch: # of parents / children aboard the Titanic

Ticket: Ticket number

Fare: Passenger fare

Cabin: Cabin number

Embarked: Port of Embarkation -> C = Cherbourg, Q = Queenstown, S = Southampton

**Variable Notes:**

Pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# Data Understanding (Exploratory Data Analysis)

## Importing Librarires

**numpy:** A fundamental package for scientific computing with Python

**pandas:** An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python

**matplotlib:** A Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms

**seaborn:** A Python data visualization library based on matplotlib; it provides a high-level interface for drawing attractive and informative statistical graphics

In [None]:
# data analysis libraries:
import numpy as np
import pandas as pd

# data visualization libraries:
import matplotlib.pyplot as plt
import seaborn as sns

# to ignore warnings:
import warnings
warnings.filterwarnings('ignore')

# to display all columns:
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split, GridSearchCV

## Loading Data

In [None]:
# Read train and test data with pd.read_csv():

train_data = pd.read_csv("./data/titanic/train.csv")
test_data = pd.read_csv("./data/titanic/test.csv")

In [None]:
# copy data in order to avoid any change in the original:

train = train_data.copy()
test = test_data.copy()

## First Looking at Data

In [None]:
# Look at first few lines with head():

train.head()

In [None]:
test.head()

In [None]:
train.dtypes

In [None]:
# Convert some data types into categorical:

train.Pclass = pd.Categorical(train.Pclass)
train.Name = pd.Categorical(train.Name)
train.Sex = pd.Categorical(train.Sex)
train.SibSp = pd.Categorical(train.SibSp)
train.Parch = pd.Categorical(train.Parch)
train.Ticket = pd.Categorical(train.Ticket)
train.Cabin = pd.Categorical(train.Cabin)
train.Embarked = pd.Categorical(train.Embarked)

test.Pclass = pd.Categorical(test.Pclass)
test.Name = pd.Categorical(test.Name)
test.Sex = pd.Categorical(test.Sex)
test.SibSp = pd.Categorical(test.SibSp)
test.Parch = pd.Categorical(test.Parch)
test.Ticket = pd.Categorical(test.Ticket)
test.Cabin = pd.Categorical(test.Cabin)
test.Embarked = pd.Categorical(test.Embarked)

In [None]:
train.dtypes

In [None]:
test.dtypes

**Numerical features:** Age (Continuous), Fare (Continuous), SibSp (Discrete), Parch (Discrete)

**Categorical features:** Survived, Sex, Embarked, Pclass

**Alphanumeric features:** Ticket, Cabin

In [None]:
# Structural information about the data:

train.info()

**Comments:**

There are 891 passengers totally in the training set.

The Age feature is missing approximately 19.8%

The Cabin feature is missing approximately 77.1%

The Embarked feature is missing 0.22%

## Checking of Missing Values and Basic Treatments

In [None]:
# Structural information about the data:

train.isnull().sum()

In [None]:
test.isnull().sum()

## Analysis and Visualization of Numeric and Categorical Variables

### Basic summary statistics about the numerical data

In [None]:
train.describe().T

### Classes of some categorical variables

In [None]:
train['Pclass'].value_counts()

In [None]:
train['Sex'].value_counts()

In [None]:
train['SibSp'].value_counts()

In [None]:
train['Parch'].value_counts()

In [None]:
train['Ticket'].value_counts()

In [None]:
train['Cabin'].value_counts()

In [None]:
train['Embarked'].value_counts()

### Visualization

In general, barplot is used for categorical variables while histogram, density and boxplot are used for numerical data.

#### Pclass vs survived:

In [None]:
sns.barplot(x = 'Pclass', y = 'Survived', data = train);

In [None]:
print("Percentage of Pclass1 survived:",  train["Survived"][train["Pclass"] == 1].value_counts(normalize=True)[1]*100)
print("Percentage of Pclass2 survived:",  train["Survived"][train["Pclass"] == 2].value_counts(normalize=True)[1]*100)
print("Percentage of Pclass3 survived:",  train["Survived"][train["Pclass"] == 3].value_counts(normalize=True)[1]*100)

#### Age vs survived:

In [None]:
#sort the ages into logical categories

train["Age_new"] = train["Age"].fillna(-0.5)
test["Age_new"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
mylabels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age_new"], bins, labels = mylabels)
test['AgeGroup'] = pd.cut(test["Age_new"], bins, labels = mylabels)

train.AgeGroup = pd.Categorical(train.AgeGroup)
test.AgeGroup = pd.Categorical(test.AgeGroup)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train);

In [None]:
train.pivot_table('Survived', index = 'Sex', columns = 'AgeGroup')

#### SibSp vs survived:

In [None]:
sns.barplot(x = 'SibSp', y = 'Survived', data = train);

In [None]:
print("Percentage of SibSp = 0 who survived:", train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 1 who survived:", train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 2 who survived:", train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 3 who survived:", train["Survived"][train["SibSp"] == 3].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 4 who survived:", train["Survived"][train["SibSp"] == 4].value_counts(normalize = True)[1]*100)

#### Parch vs survived:

In [None]:
sns.barplot(x = 'Parch', y = 'Survived', data = train);

#### Sex vs survived:

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', data = train);

In [None]:
print("Percentage of female survived:",  train["Survived"][train["Sex"] == "female"].value_counts(normalize=True)[1]*100)
print("Percentage of male survived:",  train["Survived"][train["Sex"] == "male"].value_counts(normalize=True)[1]*100)

Look at age groups:

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', hue = 'AgeGroup', data = train);

## Report based on visual data

* People with higher socioeconomic class had a higher rate of survival.

* Babies were more likely to survive than any other age group.

* People with more siblings or spouses aboard were less likely to survive. However, contrary to expectations, people with no siblings or spouses were less likely to survive than those with one or two.

* People with less than four parents or children aboard were more likely to survive than those with four or more. People traveling alone were less likely to survive than those with 1-3 parents or children.

* Females had a much higher chance of survival than males.

# Data Preparation

## Deleting Unnecessary Variables

In [None]:
train.head()

In [None]:
train.dtypes

**transform sex into numerical data**

**drop cabin, ticket, name, age variables**

### Ticket

In [None]:
# We can drop the Ticket feature since it is unlikely to have useful information

train = train.drop(['Ticket'], axis = 1)
test = test.drop(['Ticket'], axis = 1)

train.head()

### Age_new

In [None]:
# Age_new was created for creating AgeGroup; unknowns were -0.5. We can delete now.

train = train.drop(['Age_new'], axis = 1)
test = test.drop(['Age_new'], axis = 1)

train.head()

## Outlier Treatment

In [None]:
train.describe().T

Fare max datasında bir anormallik var gibi. Bu numerik datayı boxplot ile görselleştir.

In [None]:
# It looks like there is a problem in Fare max data. Visualize with boxplot.

sns.boxplot(x = train['Fare']);

In [None]:
Q1 = train['Fare'].quantile(0.25)
Q3 = train['Fare'].quantile(0.75)
IQR = Q3 - Q1

lower_limit = Q1- 1.5*IQR
lower_limit

upper_limit = Q3 + 1.5*IQR
upper_limit

In [None]:
# observations with Fare data higher than the upper limit:

train['Fare'] > (upper_limit)

In [None]:
outlier_tf = train['Fare'] > (upper_limit)

In [None]:
train["Fare"][train['Fare'] > (upper_limit)]

In [None]:
outliers = train['Fare'][outlier_tf]
outliers.index


In [None]:
train.sort_values("Fare", ascending=False).head()

In [None]:
# In boxplot, there are too many data higher than upper limit; we can not change all. Just repress the highest value -512- 

train['Fare'] = train['Fare'].replace(512.3292, 300)

In [None]:
train.sort_values("Fare", ascending=False).head()

In [None]:
test.sort_values("Fare", ascending=False)

In [None]:
test['Fare'] = test['Fare'].replace(512.3292, 300)

In [None]:
test.sort_values("Fare", ascending=False)

## Missing Value Treatment

### Age

Age'i doldurmak için title'lar kullanılacak. Title'ları cogaltmak için combine train and test data:

In [None]:
# Use titles to fill missing Age value. Combine train and test data:

combine = [train, test]
combine = pd.concat(combine, ignore_index = True)
combine.head()

In [None]:
# Missing values in combine:

combine.isnull().sum()

In [None]:
# Create Title variable in combine; take titles from Name data:

combine["Title"] = combine["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)

combine.head()

combine içinden title, age ve survived değişkenlerini seç, groupby ile title'lara göre grupla, aggregate ile farklı değişkenlere farklı fonksiyonlar uygula:

In [None]:
combine[["Title","Age","Survived"]].groupby("Title").aggregate({"Age":["count","mean","median","std"], 
                                                                "Survived": "mean"})

In [None]:
combine[(combine["Age"].isnull()) & (combine["Title"] == "Master")]

In [None]:
combine[["Title","Age","Survived"]].groupby("Title").agg({"Age":["count","mean","median","std", lambda x: x.isnull().sum()], 
                                                                "Survived": "mean"})

**Note:** You can use agg or apply for isnull but agg can use different functions for different columns.

In [None]:
combine[["Title","Age"]].groupby('Title').agg({'Age': lambda x: x.isnull().sum()})

In [None]:
combine[["Title","Age"]].groupby('Title').apply(lambda x: x.isnull().sum())

In [None]:
combine["Age"].isnull().sum()

In [None]:
# Each missing value in Age according to titles will be filled with its own average value:

combine["Age"] = combine[["Title","Age"]].groupby("Title").transform(lambda x: x.fillna(x.mean()))

combine.head(10)

In [None]:
combine["Age"].isnull().sum()

In [None]:
# First 891 rows according to PassengerId were in the train data while others were in the test data

train["Age"] = pd.DataFrame(combine["Age"][0:891])

In [None]:
train.head(10)

In [None]:
test["Age"] = pd.DataFrame(combine["Age"][891:len(combine)]).values

In [None]:
test.tail()

In [None]:
train["Age"].isnull().sum()

In [None]:
test["Age"].isnull().sum()

In [None]:
# Assign again to AgeGroup:

bins = [0, 5, 12, 18, 24, 35, 60, np.inf]
mylabels = ['Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = mylabels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = mylabels)

#train.AgeGroup = pd.Categorical(train.AgeGroup)
#test.AgeGroup = pd.Categorical(test.AgeGroup)

In [None]:
train["Age"].isnull().sum()

In [None]:
test["Age"].isnull().sum()

In [None]:
train["AgeGroup"].isnull().sum()

In [None]:
test["AgeGroup"].isnull().sum()

### Embarked

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
# Fill NA with the most frequent value:

train = train.fillna({"Embarked": "S"})

### Fare

In [None]:
test[test["Fare"].isnull()]

In [None]:
test[["Pclass","Fare"]].groupby("Pclass").mean()

In [None]:
test["Fare"] = test["Fare"].fillna(12)

In [None]:
test["Fare"].isnull().sum()

### Cabin

In [None]:
# Create CabinBool variable which states if someone has a Cabin data or not:

train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

train = train.drop(['Cabin'], axis = 1)
test = test.drop(['Cabin'], axis = 1)

train.head()

## Variable Transformation

### Embarked

In [None]:
# Map each Embarked value to a numerical value:

embarked_mapping = {"S": 1, "C": 2, "Q": 3}

train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

In [None]:
train.head()

### Sex

In [None]:
# Convert Sex values into 1-0:

from sklearn import preprocessing

lbe = preprocessing.LabelEncoder()
train["Sex"] = lbe.fit_transform(train["Sex"])
test["Sex"] = lbe.fit_transform(test["Sex"])

In [None]:
train.head()

### Name - Title

In [None]:
train["Title"] = train["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)
test["Title"] = test["Name"].str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
train.head()

In [None]:
train['Title'] = train['Title'].replace(['Lady', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

In [None]:
test['Title'] = test['Title'].replace(['Lady', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')
test['Title'] = test['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train[["Title","PassengerId"]].groupby("Title").count()

In [None]:
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [None]:
# Map each of the title groups to a numerical value

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royal": 5, "Rare": 6}

train['Title'] = train['Title'].map(title_mapping)

In [None]:
train.isnull().sum()

In [None]:
test['Title'] = test['Title'].map(title_mapping)

In [None]:
test.head()

In [None]:
train = train.drop(['Name'], axis = 1)
test = test.drop(['Name'], axis = 1)

In [None]:
train.head()

### AgeGroup

In [None]:
# Map each Age value to a numerical value:

age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

In [None]:
train.head()

In [None]:
#dropping the Age feature for now, might change:

train = train.drop(['Age'], axis = 1)
test = test.drop(['Age'], axis = 1)

In [None]:
train.head()

### Fare

In [None]:
# Map Fare values into groups of numerical values:

train['FareBand'] = pd.qcut(train['Fare'], 4, labels = [1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4, labels = [1, 2, 3, 4])

In [None]:
# Drop Fare values:

train = train.drop(['Fare'], axis = 1)
test = test.drop(['Fare'], axis = 1)

In [None]:
train.head()

## Feature Engineering

### Family Size

In [None]:
train.head()

In [None]:
train["FamilySize"] = train_data["SibSp"] + train_data["Parch"] + 1

In [None]:
test["FamilySize"] = test_data["SibSp"] + test_data["Parch"] + 1

In [None]:
# Create new feature of family size:

train['Single'] = train['FamilySize'].map(lambda s: 1 if s == 1 else 0)
train['SmallFam'] = train['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
train['MedFam'] = train['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
train['LargeFam'] = train['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
train.head()

In [None]:
# Create new feature of family size:

test['Single'] = test['FamilySize'].map(lambda s: 1 if s == 1 else 0)
test['SmallFam'] = test['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
test['MedFam'] = test['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
test['LargeFam'] = test['FamilySize'].map(lambda s: 1 if s >= 5 else 0)

In [None]:
test.head()

### Embarked & Title

In [None]:
# Convert Title and Embarked into indicator values:

train = pd.get_dummies(train, columns = ["Title"])
train = pd.get_dummies(train, columns = ["Embarked"], prefix="Em")

In [None]:
train.head()

In [None]:
test = pd.get_dummies(test, columns = ["Title"])
test = pd.get_dummies(test, columns = ["Embarked"], prefix="Em")

In [None]:
test.head()

### Pclass

In [None]:
# Create categorical values for Pclass:

train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train, columns = ["Pclass"],prefix="Pc")

In [None]:
test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test, columns = ["Pclass"],prefix="Pc")

In [None]:
train.head()

In [None]:
test.head()

### Ticket

In [None]:
train_data[["Ticket","PassengerId"]].groupby("Ticket").count()

In [None]:
TicketPre = []

for i in list(train_data.Ticket):
    if not i.isdigit() :
        TicketPre.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix
    else:
        TicketPre.append("X")
        
train["TicketPre"] = TicketPre


In [None]:
train["TicketPre"].head()

In [None]:
train[["TicketPre","Survived"]].groupby("TicketPre").agg({"TicketPre": "count", "Survived": "mean"}).sort_values("Survived", ascending = False)

In [None]:
TicketPre = []
for i in list(test_data.Ticket):
    if not i.isdigit() :
        TicketPre.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix
    else:
        TicketPre.append("X")
        
test["TicketPre"] = TicketPre


In [None]:
test["TicketPre"].head()

In [None]:
train = pd.get_dummies(train, columns = ["TicketPre"], prefix="T")

In [None]:
train.head()

In [None]:
test = pd.get_dummies(test, columns = ["TicketPre"], prefix="T")

In [None]:
test.head()

**Check and compare the columns in the train and the test:

In [None]:
train.columns

In [None]:
test.columns

In [None]:
set(train.columns) == set(test.columns)

In [None]:
# Columns in both data:

train.columns.intersection(test.columns)

In [None]:
# Columns in train but not in test:

train.columns.difference(test.columns)

In [None]:
test["T_AS"] = 0
test["T_CASOTON"] = 0
test["T_Fa"] = 0
test["T_LINE"] = 0
test["T_PPP"] = 0
test["T_SCOW"] = 0
test["T_SOP"] = 0
test["T_SP"] = 0
test["T_SWPP"] = 0
test["Title_5"] = 0

In [None]:
# Columns in test but not in train:

test.columns.difference(train.columns)

In [None]:
train["T_A"] = 0
train["T_AQ3"] = 0
train["T_AQ4"] = 0
train["T_LP"] = 0
train["T_SCA3"] = 0
train["T_STONOQ"] = 0

# Modeling, Evaluation and Model Tuning

## Spliting the train data

In [None]:
from sklearn.model_selection import train_test_split

predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.20, random_state = 0)

In [None]:
x_train.shape

In [None]:
x_val.shape

## Testing with Different Models
I will be testing the train data by using following models:

* Gaussian Naive Bayes
* Logistic Regression
* Support Vector Machines
* Perceptron
* Decision Tree Classifier
* Random Forest Classifier
* KNN or k-Nearest Neighbors
* Stochastic Gradient Descent
* Gradient Boosting Classifier

For each model, we set the model, fit it with 80% of our training data, predict for 20% of the training data and check the accuracy.

## Model1: Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
y_pred = gaussian.predict(x_val)
acc_gaussian = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gaussian)

## Model2: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_logreg)

## Model3: Support Vector Machines

In [None]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_val)
acc_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_svc)

## Model4: Linear SVC

In [None]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC()
linear_svc.fit(x_train, y_train)
y_pred = linear_svc.predict(x_val)
acc_linear_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_linear_svc)

## Model5: Perceptron

In [None]:
from sklearn.linear_model import Perceptron

perceptron = Perceptron()
perceptron.fit(x_train, y_train)
y_pred = perceptron.predict(x_val)
acc_perceptron = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_perceptron)

## Model6: Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
y_pred = decisiontree.predict(x_val)
acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_decisiontree)

## Model7: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

## Model8: KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_val)
acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_knn)

In [None]:
knn_params = {"n_neighbors": np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, knn_params, cv=10)
knn_cv.fit(x_train, y_train)
print("The best score:" + str(knn_cv.best_score_))
print("The best parameters: " + str(knn_cv.best_params_))

In [None]:
knn = KNeighborsClassifier(3)
knn_tuned = knn.fit(x_train, y_train)
y_pred = knn_tuned.predict(x_val)
acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_knn)

## Model9: Stochastic Gradient Descent

In [None]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(x_train, y_train)
y_pred = sgd.predict(x_val)
acc_sgd = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_sgd)

## Model10: Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)
y_pred = gbk.predict(x_val)
acc_gbk = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gbk)

In [None]:
xgb_params = {
        'n_estimators': [100, 500, 1000, 2000],
        'subsample': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5,6],
        'learning_rate': [0.1,0.01,0.02,0.05],
        "min_samples_split": [2,5,10]}

In [None]:
xgb = GradientBoostingClassifier()

xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 10, n_jobs = -1, verbose = 2)

In [None]:
xgb_cv_model.fit(x_train, y_train)

In [None]:
xgb_cv_model.best_params_

In [None]:
xgb = GradientBoostingClassifier(learning_rate = 0.01, 
                    max_depth = 5,
                    min_samples_split = 5,
                    n_estimators = 100,
                    subsample = 0.6)

In [None]:
xgb_tuned =  xgb.fit(x_train,y_train)

In [None]:
y_pred = xgb_tuned.predict(x_val)
acc_gbk = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gbk)

## Choosing the Best Model

In [None]:
train.head()

In [None]:
test.head()

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 'Linear SVC', 
              'Decision Tree', 'Stochastic Gradient Descent', 'Gradient Boosting Classifier'],
    'Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_perceptron,acc_linear_svc, acc_decisiontree,
              acc_sgd, acc_gbk]})
models.sort_values(by='Score', ascending=False)

# Deployment

In [None]:
test

In [None]:
#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = xgb_tuned.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)

In [None]:
output.head()

# Report

# Resources

https://numpy.org/

https://pandas.pydata.org/

https://matplotlib.org/#

http://seaborn.pydata.org/

https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner

https://www.kaggle.com/startupsci/titanic-data-science-solutions

https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish?scriptVersionId=320209