<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Gathering-data" data-toc-modified-id="Gathering-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Gathering data</a></span></li><li><span><a href="#Preparing-the-data---cleaning-and-exploration" data-toc-modified-id="Preparing-the-data---cleaning-and-exploration-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparing the data - cleaning and exploration</a></span><ul class="toc-item"><li><span><a href="#Title-column" data-toc-modified-id="Title-column-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Title column</a></span></li><li><span><a href="#Age-column" data-toc-modified-id="Age-column-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Age column</a></span></li><li><span><a href="#Fare-column" data-toc-modified-id="Fare-column-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Fare column</a></span></li><li><span><a href="#Sex-column" data-toc-modified-id="Sex-column-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Sex column</a></span></li><li><span><a href="#Embarked-column" data-toc-modified-id="Embarked-column-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Embarked column</a></span></li><li><span><a href="#FamilySize-and-IsAlone-columns" data-toc-modified-id="FamilySize-and-IsAlone-columns-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>FamilySize and IsAlone columns</a></span></li><li><span><a href="#Cabin-column" data-toc-modified-id="Cabin-column-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Cabin column</a></span></li><li><span><a href="#Other-columns" data-toc-modified-id="Other-columns-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Other columns</a></span></li><li><span><a href="#Select-features" data-toc-modified-id="Select-features-2.9"><span class="toc-item-num">2.9&nbsp;&nbsp;</span>Select features</a></span></li></ul></li><li><span><a href="#Choosing-a-model" data-toc-modified-id="Choosing-a-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Choosing a model</a></span></li><li><span><a href="#Training-models" data-toc-modified-id="Training-models-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Training models</a></span></li><li><span><a href="#Evaluating-models" data-toc-modified-id="Evaluating-models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluating models</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span><ul class="toc-item"><li><span><a href="#Tuning-with-RandomizedSearchCV" data-toc-modified-id="Tuning-with-RandomizedSearchCV-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Tuning with RandomizedSearchCV</a></span></li><li><span><a href="#Tuning-with-GridSearchCV" data-toc-modified-id="Tuning-with-GridSearchCV-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Tuning with GridSearchCV</a></span><ul class="toc-item"><li><span><a href="#GradientBoostingClassifier" data-toc-modified-id="GradientBoostingClassifier-6.2.1"><span class="toc-item-num">6.2.1&nbsp;&nbsp;</span>GradientBoostingClassifier</a></span></li><li><span><a href="#XGBClassifier" data-toc-modified-id="XGBClassifier-6.2.2"><span class="toc-item-num">6.2.2&nbsp;&nbsp;</span>XGBClassifier</a></span></li><li><span><a href="#RandomForestClassifier" data-toc-modified-id="RandomForestClassifier-6.2.3"><span class="toc-item-num">6.2.3&nbsp;&nbsp;</span>RandomForestClassifier</a></span></li></ul></li></ul></li><li><span><a href="#Making-Predictions" data-toc-modified-id="Making-Predictions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Making Predictions</a></span></li></ul></div>


# **Introduction**

The notebook contains a complete exploration of the Titanic dataset and modeling it with different models and also exploring the use of H2O for modeling and generating the submission file which is used for Kaggle submission.

## **DataSet**

Download the data from this link and upload it to the run time in Google colab to run the notebook

https://www.kaggle.com/c/titanic/data

# Gathering data

First load the training and test data in two separate DataFrames.

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib.pyplot import plot as plt
%matplotlib inline

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combined = [train_df,test_df]
combined_df = pd.concat(combined, sort=False)

FileNotFoundError: ignored

# Preparing the data - cleaning and exploration

Let's have a look at the 5 first rows.

In [None]:
train_df.head(5)

And display some statistics about numerical columns.

In [None]:
train_df.describe()

Let's see how many missing data we have.

In [None]:
combined_df.isnull().sum()

## Title column

Here I will create a function to create a Title column extracted from the Name column, replace some synonyms, and finally change all titles with less than 10 occurences by 'Misc'.

In [None]:
def create_title_column(dataframe):
    dataframe['Title'] = dataframe['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
    dataframe['Title'] = dataframe['Title'].replace('Mlle', 'Miss')
    dataframe['Title'] = dataframe['Title'].replace('Ms', 'Miss')
    dataframe['Title'] = dataframe['Title'].replace('Mme', 'Mrs')
    
    stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
    title_names = (dataframe['Title'].value_counts() < stat_min) #this will create a true false series with title name as index
    #apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
    dataframe['Title'] = dataframe['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
    
    return dataframe['Title']

In [None]:
for df in combined:
    df['Title'] = create_title_column(df)

print(train_df['Title'].value_counts())

Finally the column is converted as a numerical categorical column (Master=0, Misc=1, Miss=2, Mr=3, Mrs=4).

In [None]:
for df in combined:
    df['Title'] = pd.Categorical(df['Title']).codes
    
print(train_df['Title'].value_counts())

## Age column

There are 263 persons with missing Age. We'll fill the missing values with some mean...  
But let's first explore how the age is distributed across Pclass and if there's any difference between male and female:

In [None]:
plot = sns.violinplot('Pclass', 'Age', data=train_df, inner='quartile', hue='Sex', split=True)
plot_title = plot.set_title('Distribution of Age across Pclass')

The higher the Pclass, the higher the age. But there's no big difference between male and female.  
So we'll replace missing values with the mean age of the appropriate Pclass.

In [None]:
mean_age_by_pclass = combined_df.groupby('Pclass').mean()['Age']
mean_age_by_pclass

In [None]:
def compute_age(row):
    if pd.isnull(row['Age']):
        return mean_age_by_pclass[row['Pclass']]
    return row['Age']

In [None]:
for df in combined:
    df["Age"] = df.apply(compute_age, axis=1)

In [None]:
sns.distplot(train_df['Age'])

Continuous values like Age are usually more difficult from predictive modeling point of view.  
So we will convert the column to bins.  
But before we do that, we create a new column that will inform us if a given person is a child or not: _women and children first_.

In [None]:
for df in combined:
    df['IsChild'] = df['Age']<15

In [None]:
def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [quantiles[q] for q in quantiles.keys()]

In [None]:
def compute_band(row, column_name, boundaries):
    i=0
    for boundary in boundaries:
        if row[column_name] < boundary:
            return i
        i=i+1
    return len(boundaries)

As I don't know yet how many bins I want, I'll create multiple columns with different bins size.

In [None]:
for i in np.arange(3, 6):
    age_boundaries = get_quantile_based_boundaries(combined_df.Age, i)
    for df in combined:
        df["Age_band_" + str(i)] = df.apply(lambda row : compute_band(row, 'Age', age_boundaries), axis=1)

In [None]:
sns.distplot(train_df.Age_band_5)

## Fare column

In [None]:
sns.distplot(train_df.Fare)

There's one missing fare, so we'll fill it with the mean fare given Pclass. 

In [None]:
missing_fare = test_df[test_df['Fare'].isna()]
missing_fare

In [None]:
fare_groupby_pclass = combined_df.groupby('Pclass').mean()['Fare']
fare_groupby_pclass

In [None]:
mean_fare_for_Pclass3 = fare_groupby_pclass.loc[missing_fare.iloc[0].Pclass]
test_df['Fare'].fillna(mean_fare_for_Pclass3, inplace=True)

For the same reason than Agen, we will create 5 bands of Fare.

In [None]:
for i in np.arange(3, 6):
    fare_boundaries = get_quantile_based_boundaries(combined_df.Fare, i)
    for df in combined:
        df["Fare_band_" + str(i)] = df.apply(lambda row : compute_band(row, 'Fare', fare_boundaries), axis=1)

In [None]:
sns.distplot(train_df.Fare_band_5)

## Sex column

Let's convert the Sex column to be numerical.

In [None]:
train_df.Sex.value_counts()

In [None]:
for df in combined:
    df.Sex = pd.get_dummies(df.Sex, drop_first=True)

In [None]:
train_df.Sex.value_counts()

## Embarked column

First fill empty values with the most frequent embarked.  
Then convert the column to be numerical.

In [None]:
train_df.Embarked.value_counts()

In [None]:
freq_port = df.Embarked.dropna().mode()[0]

for df in combined:
    df.Embarked.fillna(freq_port, inplace=True)
    df.Embarked = pd.Categorical(df.Embarked).codes

In [None]:
train_df.Embarked.value_counts()

## FamilySize and IsAlone columns

Let's create two new columns:
- FamilySize is the sum of SibSp and Parch + 1 (the person itself)
- IsAlone: whether this person has embarked alone or with his family

In [None]:
for df in combined:
    df['FamilySize'] = df.SibSp + df.Parch + 1
    df['IsAlone'] = ((df.SibSp + df.Parch)==0)*1

## Cabin column

Here I will extract some information that may be intersting from the Cabin column: the letter of the cabin, its number, and whether it's odd or not.

In [None]:
import re
def extract_cabin_nr(cabin):
    """ Extracts the cabin number.  If there no number found, return NaN """
    if not pd.isnull(cabin):
        cabin = cabin.split(' ')[-1]    # if several cabins on ticket, take last one
        re_numb = r'[A-Z]([0-9]+)'
        try:
            number = int(re.findall(re_numb, cabin)[0])
            return number
        except:
            return np.nan
    else:
        return np.nan

In [None]:
def extract_cabin_letter(cabin):
    """ Extracts the cabin letter.  If there no letter found, return NaN """
    if not pd.isnull(cabin):
        cabin = cabin.split(' ')[-1]    # if several cabins on ticket, take last one
        re_char = r'([A-Z])[0-9]+'
        try:
            character = re.findall(re_char, cabin)[0]
            return character
        except:
            return np.nan
    else:
        return np.nan

In [None]:
for df in combined:
    df['Cabin_char'] = list(map(extract_cabin_letter, df['Cabin']))
    df['Cabin_nr'] = list(map(extract_cabin_nr, df['Cabin']))
    df['Cabin_nr_odd'] = df.Cabin_nr.apply(lambda x: np.nan if x == np.nan else x%2)
    
    # deal with the NaN's in some of our newly created columns
    df['Cabin_char'].fillna(value=-9999, inplace=True)
    df['Cabin_nr'].fillna(value=-9999, inplace=True)
    df['Cabin_nr_odd'].fillna(value=-9999, inplace=True)

## Other columns
Let's drop the columns we don't need anymore.

In [None]:
train_df = train_df.drop(['Name','Cabin','Ticket','Fare','Age'], 1)

In [None]:
train_df = pd.get_dummies(train_df, drop_first=True)
train_df.describe()

## Select features

I have tried many different selection of features in the models I tested.  
I also explored Feature ranking with recursive feature elimination and cross-validated selection of the best number of features with [RFECV](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html).  
I finally selected a mix between my feeling and what RFECV has proposed.

In [None]:
X = np.array(train_df.drop(['Survived','PassengerId'], 1))
training_features = np.array(train_df.drop(['Survived','PassengerId'], 1).columns)
y = np.array(train_df['Survived'])

I need my first classifier to explorer feature ranking.  
Machine learning is an iterative process so here I'll use the classifier that I used for my the best submission: [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

In [None]:
from sklearn import model_selection
import xgboost as xgb

clf = xgb.XGBClassifier(verbose=1)
#for i in np.arange(3, 21):
cv = model_selection.StratifiedKFold(n_splits=6, shuffle=True, random_state=1)
scores = model_selection.cross_val_score(clf, X, y, cv=cv, n_jobs=-1, scoring='accuracy', verbose=1)
clf.fit(X,y)
print('n_split=' + str(i))
print(scores)
print('Accuracy: %.3f stdev: %.3f' % (np.mean(np.abs(scores)), np.std(scores)))
print()


In [None]:
from sklearn.feature_selection import RFECV

print("features used during training: ")
print(training_features)
print("")

featselect = RFECV(estimator=clf, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
featselect.fit(X,y)

print("features proposed by RFECV: "),
print(training_features[featselect.support_])

In [None]:
#features = ['Pclass', 'Sex', 'Age', 'IsAlone', 'Fare', 'Embarked', 'Title', 'Cabin_nr']
features = ['Pclass','Sex', 'IsChild', 'Age_band_5', 'Fare_band_4', 'Fare_band_5', 'Title', 'FamilySize', 'Cabin_nr']

# Choosing a model

The goal of this project is to have a first real experience in machine learning, python, sklearn and try many different classifiers.  
So let's create an algorithm that will try many different classifiers, and try many different hyper parameters thanks to [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).  
This algorithm will also record the scores and best parameters for every classifier.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier

Here are the hyper parameters and their tested values:

In [None]:
n_estimators = [10,50,100,200,400]
n_neighbors = [3,4,5,6]
learning_rates = [1.0, 0.3, 0.1, 0.03, 0.01, 0.005, 0.003]
criterion = ['gini', 'entropy']
max_features = ['log2', 'sqrt','auto']
hidden_layer_sizes = [(100,), (100,100), (50,100,50)]
C = [0.1, 1, 3, 10, 30, 100]
gamma = [1, 0.3, 0.1, 0.03, 0.01, 0.001]
max_depth = [2, 3, 4]
reg_lambda = [0.50]
loss = ['deviance', 'exponential']

And the many classifiers I give a try with their hyper parameters:

In [None]:
models_and_grid_params = [
    #(SGDClassifier, {'penalty': ['l2', 'l1']}),
    (xgb.XGBClassifier, {'n_estimators': n_estimators, 'learning_rate': learning_rates, 'max_depth': max_depth, 'reg_lambda': reg_lambda}),
    (AdaBoostClassifier, {'n_estimators': n_estimators, 'learning_rate': learning_rates}),
    #(ExtraTreesClassifier, {'n_estimators': n_estimators}),
    (GradientBoostingClassifier,{'loss': loss, 'learning_rate': learning_rates}), 
    (RandomForestClassifier, {'n_estimators': n_estimators, 'criterion': criterion,'max_features': max_features}),
    #(KNeighborsClassifier, {'n_neighbors': n_neighbors}),
    #(MLPClassifier,{'hidden_layer_sizes': hidden_layer_sizes}),
                   #(SVC,{'C': C, 'gamma': gamma}),
                   #(GaussianProcessClassifier,{}),
                   (DecisionTreeClassifier,{})
                   #(GaussianNB,{})
                   ]

scores = pd.DataFrame(columns=['Model', 'Estimator', 'Trial', 'Best Params', 'Accuracy Score'])

# Training models

Here I split training data into random train (60%) and test subsets (40%). The test subset will be used to evaluate how the classifier can generalize with unseen data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df[features], train_df['Survived'], test_size=0.4, random_state=42)

Define train_model function that will create a given model, and search for the best parameters across a given param_grid.
It appends the results to the scores dataframe for later evaluation.

In [None]:
def train_model(estimator_class, param_grid, scores, X_train, X_test, y_train, y_test, verbose=0):
    estimator = estimator_class()
    
    if verbose==1:
        print('Training ' + type(estimator).__name__ + '...')
    
    
    cv = model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

    model = GridSearchCV(estimator, param_grid, refit=True, cv=cv, scoring='accuracy', verbose=verbose, n_jobs=-1)
    model.fit(X_train,y_train)
    #pred = model.predict(X_test)
    
    #accuracy = round(model.score(X_test, y_test) * 100, 3)
    trial_row = scores[scores['Model']==type(estimator).__name__]
    trial = 1
    if not trial_row.empty:
        trial = int(trial_row['Trial'].max()) + 1
    scores = scores.append({'Model' : type(estimator).__name__, 
                            'Estimator' : model.best_estimator_, 
                            'Trial': trial,
                            'Best Params': str(model.best_params_),
                            'Accuracy Score': round(model.best_score_*100,3) #accuracy
                           }, ignore_index=True)
    
    return scores, model.best_estimator_, model.best_params_, round(model.best_score_*100,3)

In [None]:
def print_results(model, params, accuracy, scores=None):
    print(type(model).__name__ + ' works best (' + str(accuracy) + '%) with ' + str(params))
    if not scores is None:
        scores = scores.sort_values('Accuracy Score', ascending=False)
    return scores

Now let's train all the models:

In [None]:
for model_class, param_grid in models_and_grid_params:
    scores, best_model, best_params, accuracy = train_model(model_class, param_grid, scores, X_train, X_test, y_train, y_test, verbose=1)
    print_results(best_model, best_params, accuracy)

# Evaluating models

The dataframe below shows the best results for the trained models, with the highest score on top.

In [None]:
def print_scores(scores):
    return scores.sort_values('Accuracy Score', ascending=False)

In [None]:
print_scores(scores)

# Hyperparameter Tuning

## Tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1900, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
criterion = ['gini', 'entropy']

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'criterion': criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

In [None]:
rf_random = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, n_iter = 200, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

In [None]:
print('The best parameters after tuning are: ')
print()
pprint(rf_random.best_params_)

print()
best_model = rf_random.best_estimator_
old_score = float(scores[scores['Model']==type(best_model).__name__]['Accuracy Score'])
rf_random_score = round(best_model.score(X_test, y_test) * 100, 2)
print('Final Score on validation data: ' + str(rf_random_score) + '% (+' + str(round(rf_random_score-old_score,2)) + '%)')

In [None]:
feature_importances_df = pd.DataFrame(best_model.feature_importances_).transpose()
feature_importances_df.columns = features
axes = sns.barplot(data=feature_importances_df)
_ = axes.set_title('Feature importances in final model')

## Tuning with GridSearchCV

### GradientBoostingClassifier

In [None]:
learning_rates = [0.12, 0.11, 0.1, 0.09, 0.08]
max_depth = [3,5,8]
loss = ['deviance', 'exponential']

tuned_param_grid = {'learning_rate': learning_rates, 'loss': loss, 'max_depth': max_depth,
    "min_samples_split": np.linspace(0.1, 0.5, 6),
    "min_samples_leaf": np.linspace(0.1, 0.5, 6),
    "max_features":["log2","sqrt"],
    "criterion": ["friedman_mse",  "mae"]}


In [None]:
scores, gb_tuned, gb_tuned_params, gb_accuracy = train_model(GradientBoostingClassifier, tuned_param_grid, scores, X_train, X_test, y_train, y_test, verbose=1)

In [None]:
print_results(gb_tuned, gb_tuned_params, gb_accuracy, scores)

### XGBClassifier

In [None]:
n_estimators = [8,10,11,12,13,14] #,100,200]
learning_rates = [0.3, 0.25, 0.2, 0.15, 0.1] #, 0.05, 0.03]
max_depth = [3]
reg_lambda = [.43, .44, .45, .46,.47, .48]
colsample_bytree = [.8, .9, 1]
reg_alpha = [0] #, 1e-06, 1e-07] #, 0.002, 0.003] #, 0.004, 0.005, 0.01]
booster = ['gbtree'] #, 'gblinear', 'dart']

tuned_param_grid = {'n_estimators': n_estimators, 'learning_rate': learning_rates, 'max_depth': max_depth, 
                    'reg_lambda': reg_lambda, 'reg_alpha': reg_alpha,
                    'colsample_bytree': colsample_bytree, 'booster': booster
                   }

pprint(tuned_param_grid)

In [None]:
scores, xgb_tuned, xgb_tuned_params, xgb_accuracy = train_model(xgb.XGBClassifier, tuned_param_grid, scores, X_train, X_test, y_train, y_test, verbose=1)

In [None]:
print_results(xgb_tuned, xgb_tuned_params, xgb_accuracy, scores)

### RandomForestClassifier

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1000, stop = 1600, num = 7)]
# Number of features to consider at every split
max_features = ['sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.arange(16,23)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.arange(9,12)]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1]
# Method of selecting samples for training each tree
bootstrap = [True]
criterion = ['entropy']

tuned_param_grid = {'n_estimators': n_estimators,
               'criterion': criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(tuned_param_grid)

In [None]:
scores, rf_tuned, rf_tuned_params, rf_accuracy = train_model(RandomForestClassifier, tuned_param_grid, scores, X_train, X_test, y_train, y_test, verbose=1)

In [None]:
print_results(rf_tuned, rf_tuned_params, rf_accuracy, scores)

# Making Predictions

In [None]:
best_model = xgb_tuned

In [None]:
y_pred_test = best_model.predict(X_test)
y_pred_valid = best_model.predict(test_df[features])
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": y_pred_valid
    })
submission.to_csv('submission.csv', index=False)

In [None]:
from sklearn import metrics

In [None]:
print('Train set')
print(metrics.classification_report(y_train, best_model.predict(X_train)))
print('Test set')
print(metrics.f1_score(y_test, y_pred_test))

# **Using H2O above process can be simplified**

## **Install Java**

run the following code to install Java

You’re using a publicly available virtual machine so as with most things in Colab, everytime you connect you need to set up and install your packages as if it was a brand new Ubuntu server, controlled from a Jupyter Notebook, which is why we use “!” to instigate the bash function.

In [None]:
! apt-get install default-jre
!java -version

## **Install H2O**

In [None]:
! pip install h2o

In [None]:
import pandas as pd
import numpy as numpy
import h2o
#from h2o.estimators.gbm import H2OGradientBoostingEstimator
#from h2o.grid.grid_search import H2OGridSearch
from h2o.automl import H2OAutoML
from sklearn import metrics
from sklearn.metrics import roc_auc_score

In [None]:
# initalize H2o
h2o.init()

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

# Treat Missing Values

In [None]:
all = pd.concat([train, test], sort = False)
all.info()

In [None]:
#Fill Missing numbers with median for Age and Fare
all['Age'] = all['Age'].fillna(value=all['Age'].median())
all['Fare'] = all['Fare'].fillna(value=all['Fare'].median())

#Treat Embarked
all['Embarked'] = all['Embarked'].fillna('S')

#Bin Age
#Age
all.loc[ all['Age'] <= 16, 'Age'] = 0
all.loc[(all['Age'] > 16) & (all['Age'] <= 32), 'Age'] = 1
all.loc[(all['Age'] > 32) & (all['Age'] <= 48), 'Age'] = 2
all.loc[(all['Age'] > 48) & (all['Age'] <= 64), 'Age'] = 3
all.loc[ all['Age'] > 64, 'Age'] = 4 

#Cabin
all['Cabin'] = all['Cabin'].fillna('Missing')
all['Cabin'] = all['Cabin'].str[0]

#Family Size & Alone 
all['Family_Size'] = all['SibSp'] + all['Parch'] + 1
all['IsAlone'] = 0
all.loc[all['Family_Size']==1, 'IsAlone'] = 1

## Extra Features: Title

In [None]:
#Title
import re
def get_title(name):
    title_search = re.search(' ([A-Za-z]+\.)', name)
    
    if title_search:
        return title_search.group(1)
    return ""

In [None]:
all['Title'] = all['Name'].apply(get_title)
all['Title'].value_counts()

In [None]:
all['Title'] = all['Title'].replace(['Capt.', 'Dr.', 'Major.', 'Rev.'], 'Officer.')
all['Title'] = all['Title'].replace(['Lady.', 'Countess.', 'Don.', 'Sir.', 'Jonkheer.', 'Dona.'], 'Royal.')
all['Title'] = all['Title'].replace(['Mlle.', 'Ms.'], 'Miss.')
all['Title'] = all['Title'].replace(['Mme.'], 'Mrs.')
all['Title'].value_counts()

In [None]:
#Drop unwanted variables
all_1 = all.drop(['Name', 'Ticket'], axis = 1)
all_1.head()

In [None]:
all_dummies = pd.get_dummies(all_1, drop_first = True)
all_dummies.head()

## **Converting Pandas dataframe to H2O frame**

In [None]:
all_train = h2o.H2OFrame(all_dummies[all_dummies['Survived'].notna()])
all_test = h2o.H2OFrame(all_dummies[all_dummies['Survived'].isna()])

## Train/Test Split

In [None]:
target = 'Survived'
features = [f for f in all_train.columns if f not in ['Survived','PassengerId']]

In [None]:
train_df, valid_df, test_df = all_train.split_frame(ratios=[0.7, 0.15], seed=2018)

In [None]:
train_df[target] = train_df[target].asfactor()
valid_df[target] = valid_df[target].asfactor()
test_df[target] = test_df[target].asfactor()

## **Build Model**

max_runtime_secs increase this to 5000 for more better accuracy and more models

In [None]:
predictors = features

aml = H2OAutoML(max_models = 50, max_runtime_secs=500, seed = 1)
aml.train(x=predictors, y=target, training_frame=train_df, validation_frame=valid_df)

In [None]:
lb = aml.leaderboard
lb

In [None]:
aml.leader.params.keys()

In [None]:
aml.leader.model_id

In [None]:
pred_val = aml.predict(test_df[predictors])[0].as_data_frame()
pred_val

## Check Accuracy

In [None]:
true_val = (test_df[target]).as_data_frame()
prediction_auc = roc_auc_score(pred_val, true_val)
prediction_auc

## Final Predictions

In [None]:
TestForPred = all_test.drop(['PassengerId', 'Survived'], axis = 1)

In [None]:
fin_pred = aml.predict(TestForPred[predictors])[0].as_data_frame()

In [None]:
PassengerId = all_test['PassengerId'].as_data_frame()

In [None]:
h2o_Sub = pd.DataFrame({'PassengerId': PassengerId['PassengerId'].tolist(), 'Survived':fin_pred['predict'].tolist() })
h2o_Sub.head()

In [None]:
h2o_Sub.to_csv("1_auto_h2o_50_Submission.csv", index = False)

Copyright 2021 Abhishek Gargha Maheshwarappa and Nicholas Brown

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE