# Titanic

## Dataset
Here is the dataset with the information about the Titanic passengers: https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/titanic/train.csv

- The column `Survived` corresponds to the fact that the passenger had survived.
- The column `PassengerId` is technical, and you may ignore it

Your task is to predict which passengers survived the Titanic shipwreck.
A more detailed description of the datset: https://www.kaggle.com/c/titanic/data

## Your task
Train `RandomForestClassifier` that predicts the value in the column `Survived`.
Select best parameters using cross-validation and evaluate the quality on 20% test data.
Use the accuracy score.

## Keep in mind
- Some columns may contain missing data. Use the combination of `DataFrame.fillna()` and `DataFrame.mean()` to handle missing values.
- Some columns may be categorical. Use `pandas.get_dummies()` to one-hot them.

## Optional task
Submit your solution to Kaggle.
You will need to create the submission file in the format `PassengerID1,Prediction1 PassengerID2,Prediction2 ...`.
Follow the link https://www.kaggle.com/c/titanic/data for further instructions.

**Hint**: you may need to extract categorical features for both train and test using `pandas.get_dummies`.
The standard trick is to concatenate (vertically) train and test, extract features and split into train and test back.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/titanic/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df['Age'] = df['Age'].fillna(df['Age'].mean())
dfd = pd.get_dummies(df['Sex'])
df = pd.concat([df, dfd], axis=1)
dfE = pd.get_dummies(df['Embarked'])
df = pd.concat([df, dfE], axis=1)
df['family_members'] = df['SibSp'] + df['Parch']
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
dfT = pd.get_dummies(df['Title'])
df = pd.concat([df, dfT], axis=1)


cabin_only = df[["Cabin"]].copy()
cabin_only["Cabin_Data"] = cabin_only["Cabin"].isnull().apply(lambda x: not x)
cabin_only["Deck"] = cabin_only["Cabin"].str.slice(0,1)
cabin_only["Room"] = cabin_only["Cabin"].str.slice(1,5).str.extract("([0-9]+)", expand=False).astype("float")
cabin_only.drop(["Cabin", "Cabin_Data"], axis=1, inplace=True, errors="ignore")
cabin_only["Deck"] = cabin_only["Deck"].fillna("N")
cabin_only["Room"] = cabin_only["Room"].fillna(cabin_only["Room"].mean())
def one_hot_column(df, label, drop_col=False):
    '''
    This function will one hot encode the chosen column.
    Args:
        df: Pandas dataframe
        label: Label of the column to encode
        drop_col: boolean to decide if the chosen column should be dropped
    Returns:
        pandas dataframe with the given encoding
    '''
    one_hot = pd.get_dummies(df[label], prefix=label)
    if drop_col:
        df = df.drop(label, axis=1)
    df = df.join(one_hot)
    return df


def one_hot(df, labels, drop_col=False):
    '''
    This function will one hot encode a list of columns.
    Args:
        df: Pandas dataframe
        labels: list of the columns to encode
        drop_col: boolean to decide if the chosen column should be dropped
    Returns:
        pandas dataframe with the given encoding
    '''
    for label in labels:
        df = one_hot_column(df, label, drop_col)
    return df
cabin_only = one_hot(cabin_only, ["Deck"],drop_col=True)
df = pd.concat([df, cabin_only], axis=1)
df.head()

NameError: ignored

In [None]:
y = np.array(df['Survived'])

X_features1 = df[['Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'Room', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G','Deck_N','Deck_T', 'C','Q','S', 'family_members', 'Capt', 'Col',	'Countess',	'Don',	'Dr',	'Jonkheer',	'Lady',	'Major',	'Master',	'Miss',	'Mlle',	'Mme',	'Mr',	'Mrs',	'Ms',	'Rev',	'Sir']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_features1, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

rf_clf = GridSearchCV(
    RandomForestClassifier(),
    {
        'n_estimators': [10000],
    },
    scoring=make_scorer(accuracy_score, greater_is_better=True),
    cv=5
).fit(X_train, y_train)

print("Best parms:", rf_clf.best_params_)
print("Best accuracy on validation:", rf_clf.best_score_)
print("Accuracy on test:", accuracy_score(y_test, rf_clf.predict(X_test)))

Best parms: {'n_estimators': 10000}
Best accuracy on validation: 0.8216292134831461
Accuracy on test: 0.8044692737430168


In [None]:
df1 = pd.read_csv('https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/datasets/titanic/test.csv')
df1.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
df1['Age'] = df1['Age'].fillna(df1['Age'].mean())
df1['Fare'] = df1['Fare'].fillna(df1['Age'].mean())
dfd1 = pd.get_dummies(df1['Sex'])
df1 = pd.concat([df1, dfd1], axis=1)
dfE1 = pd.get_dummies(df1['Embarked'])
df1 = pd.concat([df1, dfE1], axis=1)
df1['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
dfT1 = pd.get_dummies(df1['Title'])
df1 = pd.concat([df1, dfT1], axis=1)
df1['family_members'] = df1['SibSp'] + df1['Parch']
df1['Deck_T'] = df['Deck_T']
df1['Col'] = df['Col']
df1['Capt'] = df['Capt']
df1['Mlle'] = df['Mlle']
df1['Sir'] = df['Sir']
df1['Jonkheer'] = df['Jonkheer']
df1['Lady'] = df['Lady']
df1['Major'] = df['Major']
df1['Ms'] = df['Ms']
df1['Countess'] = df['Countess']



cabin_only = df1[["Cabin"]].copy()
cabin_only["Cabin_Data"] = cabin_only["Cabin"].isnull().apply(lambda x: not x)
cabin_only["Deck"] = cabin_only["Cabin"].str.slice(0,1)
cabin_only["Room"] = cabin_only["Cabin"].str.slice(1,5).str.extract("([0-9]+)", expand=False).astype("float")
cabin_only.drop(["Cabin", "Cabin_Data"], axis=1, inplace=True, errors="ignore")
cabin_only["Deck"] = cabin_only["Deck"].fillna("N")
cabin_only["Room"] = cabin_only["Room"].fillna(cabin_only["Room"].mean())
def one_hot_column(df, label, drop_col=False):
    '''
    This function will one hot encode the chosen column.
    Args:
        df: Pandas dataframe
        label: Label of the column to encode
        drop_col: boolean to decide if the chosen column should be dropped
    Returns:
        pandas dataframe with the given encoding
    '''
    one_hot = pd.get_dummies(df[label], prefix=label)
    if drop_col:
        df = df.drop(label, axis=1)
    df = df.join(one_hot)
    return df


def one_hot(df, labels, drop_col=False):
    '''
    This function will one hot encode a list of columns.
    Args:
        df: Pandas dataframe
        labels: list of the columns to encode
        drop_col: boolean to decide if the chosen column should be dropped
    Returns:
        pandas dataframe with the given encoding
    '''
    for label in labels:
        df = one_hot_column(df, label, drop_col)
    return df
cabin_only = one_hot(cabin_only, ["Deck"],drop_col=True)
df1 = pd.concat([df1, cabin_only], axis=1)
df1.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,female,male,C,Q,S,Title,Don,Dr,Master,Miss,Mme,Mr,Mrs,Rev,family_members,Deck_T,Col,Capt,Mlle,Sir,Jonkheer,Lady,Major,Ms,Countess,Room,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_N
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,1,0,1,0,Mr,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1,0,0,0,1,Mrs,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0,1,0,1,0,Miss,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0,1,0,0,1,Mrs,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1,0,0,0,1,Mr,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1


In [None]:
X_features2 = df1[['Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'Room', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G','Deck_N','Deck_T', 'C', 'Q', 'S', 'family_members', 'Capt', 'Col',	'Countess',	'Don',	'Dr',	'Jonkheer',	'Lady',	'Major',	'Master',	'Miss',	'Mlle',	'Mme',	'Mr',	'Mrs',	'Ms',	'Rev',	'Sir']].values
print("y_test :", rf_clf.predict(X_features2))

y_test : [0 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0
 0 0 1 1 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0
 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0
 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1
 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0
 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0
 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1
 0 1 0 1 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0
 1 1 1 1 1 0 0 1 0 0 1]


In [None]:
df1['Survived'] = rf_clf.predict(X_features2)
df1.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,female,male,C,Q,S,Title,Don,Dr,Master,Miss,Mme,Mr,Mrs,Rev,family_members,Deck_T,Col,Capt,Mlle,Sir,Jonkheer,Lady,Major,Ms,Countess,Room,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_N,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,1,0,1,0,Mr,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1,0,0,0,1,Mrs,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0,1,0,1,0,Miss,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0,1,0,0,1,Mrs,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1,0,0,0,1,Mr,0,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,0,0,47.651685,0,0,0,0,0,0,0,1,0


In [None]:
df1[['PassengerId', 'Survived']].to_csv('t.csv', index=False)

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.25, random_state=42)

In [None]:
Xtrain= df_train[['Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'Room', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G','Deck_N','Deck_T', 'C','Q','S', 'family_members', 'Capt', 'Col',	'Countess',	'Don',	'Dr',	'Jonkheer',	'Lady',	'Major',	'Master',	'Miss',	'Mlle',	'Mme',	'Mr',	'Mrs',	'Ms',	'Rev',	'Sir']].values
Xtest = df_test[['Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'Room', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G','Deck_N','Deck_T', 'C','Q','S', 'family_members', 'Capt', 'Col',	'Countess',	'Don',	'Dr',	'Jonkheer',	'Lady',	'Major',	'Master',	'Miss',	'Mlle',	'Mme',	'Mr',	'Mrs',	'Ms',	'Rev',	'Sir']].values
ytrain = df_train[['Survived']]
ytest = df_test['Survived']

In [None]:
a = rf_clf.predict(Xtest)

df_test['misclassified'] = abs(a - ytest)
df_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,female,male,C,Q,S,family_members,Title,Capt,Col,Countess,Don,Dr,Jonkheer,Lady,Major,Master,Miss,Mlle,Mme,Mr,Mrs,Ms,Rev,Sir,Room,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_N,Deck_T,misclassified
709,710,1,3,"Moubarek, Master. Halim Gonios (""William George"")",male,29.699118,1,1,2661,15.2458,,C,0,1,1,0,0,2,Master,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
439,440,0,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.000000,0,0,C.A. 18723,10.5000,,S,0,1,0,0,1,0,Mr,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
840,841,0,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.000000,0,0,SOTON/O2 3101287,7.9250,,S,0,1,0,0,1,0,Mr,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
720,721,1,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.000000,0,1,248727,33.0000,,S,1,0,0,0,1,1,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.000000,1,0,2651,11.2417,,C,1,0,1,0,0,1,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,1
290,291,1,1,"Barber, Miss. Ellen ""Nellie""",female,26.000000,0,0,19877,78.8500,,S,1,0,0,0,1,0,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
300,301,1,3,"Kelly, Miss. Anna Katherine ""Annie Kate""",female,29.699118,0,0,9234,7.7500,,Q,1,0,0,1,0,0,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
333,334,0,3,"Vander Planke, Mr. Leo Edmondus",male,16.000000,2,0,345764,18.0000,,S,0,1,0,0,1,2,Mr,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
208,209,1,3,"Carr, Miss. Helen ""Ellen""",female,16.000000,0,0,367231,7.7500,,Q,1,0,0,1,0,0,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,50.49,0,0,0,0,0,0,0,1,0,0
136,137,1,1,"Newsom, Miss. Helen Monypeny",female,19.000000,0,2,11752,26.2833,D47,S,1,0,0,0,1,2,Miss,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,47.00,0,0,0,1,0,0,0,0,0,0


In [None]:
for i in range(len(a)):
  if a[i] == -1:
    print(i)