# About 

### Kaggle Competition | Titanic Machine Learning from Disaster

>The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.  This sensational tragedy shocked the international community and led to better safety regulations for ships.

>One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.  Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

>In this contest, we ask you to complete the analysis of what sorts of people were likely to survive.  In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

>This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning."

From the competition [homepage](http://www.kaggle.com/c/titanic-gettingStarted).

### Goal for this Notebook:
End to end example of predicting survival of Titanic passengers with Goggle Colaboratory PaaS using:

* Kaggle API
* Python 3
* Github
* Google Drive

#### This Notebook will show basic examples of: 


##### Data Handling
* Getting dataset with Kaggle CLI
* Importing Data with Pandas
* Cleaning Data
* Submitting predictions with Kaggle CLI
* Mounting Google Drive as a partition in Google Colab
* Using github for file transfer


##### Data Analysis
Various libraries

##### Valuation of the Analysis
* Output the results from the IPython Notebook to Kaggle

#### Required Libraries:
* [NumPy](http://www.numpy.org/)
* [IPython](http://ipython.org/)
* [Pandas](http://pandas.pydata.org/)
* [SciKit-Learn](http://scikit-learn.org/stable/)
* [SciPy](http://www.scipy.org/)
* [StatsModels](http://statsmodels.sourceforge.net/)
* [Patsy](http://patsy.readthedocs.org/en/latest/)
* [Matplotlib](http://matplotlib.org/)

#### References
* [Official Kaggle CLI API](https://github.com/kaggle/kaggle-api)
* [Collection of google colaboratory notebooks](https://github.com/todun/googlecolab)
* [Kaggle](https://www.kaggle.com/)
* [Google Colab Free GPU Tutorial](https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d)
* [Tutorial on Using Google Colab for Kaggle Competition](https://medium.com/@burakteke/tutorial-on-using-google-colab-for-kaggle-competition-620393c22821)
* [Make your Kaggle Submissions with Kaggle Official API!](https://medium.com/@nokkk/make-your-kaggle-submissions-with-kaggle-official-api-f49093c04f8a)
* [Google Colaboratory Cheat Sheet](https://medium.com/@rahul.metangale/google-colaboratory-cheat-sheet-24b99813b0f0)
* [How to Upload large files to Google Colab and remote Jupyter notebooks](https://medium.freecodecamp.org/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa)
* [INTRODUCTION TO CONVOLUTIONAL NEURAL NETWORKS](https://medium.com/@johnolafenwa/introduction-to-convolutional-neural-networks-60e113744c4)
* [COMPONENTS OF NEURAL NETWORKS](https://medium.com/@johnolafenwa/components-of-neural-networks-2787589f464c)
* [INTRODUCTION TO NEURAL NETWORKS](https://medium.com/@johnolafenwa/introduction-to-neural-networks-ca7eab1d27d7)
* [Running Jupyter Notebook on Colab](https://medium.com/@margaretmz/running-jupyter-notebook-with-colab-f4a29a9c7156)

# 0. Define global variables

These are *NOT HYPER PARAMETERS* and primarily for code maintenance

In [4]:
KAGGLE_COMPETITION_NAME='titanic'
GDRIVE_ROOT='/content/NBB_GDrive'
GDRIVE_KAGGLE_TITANIC='/content/my_google_drive/kaggle-titanic'
GDRIVE_KAGGLE_TITANIC_AI_CODE='/content/NBB_GDrive/kaggle-titanic/kt-ai-code'
RESULTS='/content/my_google_drive/kaggle-titanic/results'
RESULT_CSV='/content/NBB_GDrive/kaggle-titanic/results/submission.csv'

KAGGLE_CLI_COLAB_DATA='/content/.kaggle/competitions'
KAGGLE_CLI_COLAB_ROOT='/content/.kaggle'

IS_MAX_SCORE_SUBMISSION = True # create submission.csv of maximum result
FINAL = False # indicate the predication using trained models is saved to CSV

# 1. link google drive to google colaboratory file system

This mounts the google drive as a file system in the google colaboratory Virtual Machine(VM)


!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:nikbearbrown/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [None]:
from google.colab import drive
drive.mount('/GRrive')

##### mount my google drive

In [None]:
!mkdir -p NBB_GDrive
!google-drive-ocamlfuse my_google_drive

### put titanic dataset into google drive

##### Create folder for competition data & AI

In [None]:
!ls $GDRIVE_ROOT

In [None]:
!mkdir $GDRIVE_ROOT/kaggle-titanic

##### check your related drive folder exist or not.

In [None]:
!ls $GDRIVE_KAGGLE_TITANIC

### Link to kaggle

Use the Kaggle CLI to get the competition dataset

##### install kaggle

In [None]:
!pip install kaggle

#### kaggle API Credentials

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com//account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials.

Place this file on your Google Drive anywhere.

With the next snippet you download your credentials to Colab and you can start using Kaggle API

##### manually upload kaggle.json to (anywhere ) on google drive

##### transfer kaggle api keys from google drive to colab file system

In [None]:
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

##### confirm transfer of kaggle api keys to google colab

In [None]:
!ls $KAGGLE_CLI_COLAB_ROOT/kaggle.json

##### manually delete kaggle.json from google drive

##### confirm manual delete of kaggle.json from google drive
Should get an error like

    ls: cannot access '/content/my_google_drive/kaggle.json': No such file or directory

In [None]:
!ls $GDRIVE_ROOT/kaggle.json

##### download dataset from kaggle
using the official kaggle cli 

In [None]:
!kaggle competitions download -c $KAGGLE_COMPETITION_NAME

##### confirm data download

In [None]:
!ls $KAGGLE_CLI_COLAB_DATA/$KAGGLE_COMPETITION_NAME

##### move kaggle data into gdrive data folder

uploaded all of the competiton data from kaggle to mounted g-drive folder “kaggle-titanic”

In [None]:
!cp $KAGGLE_CLI_COLAB_DATA/$KAGGLE_COMPETITION_NAME/* $GDRIVE_KAGGLE_TITANIC

##### verify data folder contents

In [None]:
!ls $GDRIVE_KAGGLE_TITANIC

### put AI code into mounted google drive

##### [Download and move classifier folder/files into created GDrive folder](https://medium.freecodecamp.org/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa)
Use github for file transfer

In [None]:
!rm -rf $GDRIVE_KAGGLE_TITANIC_AI_CODE
!pip install -q xlrd
!git clone https://github.com/aisaturday/kaggle-titanic-ai-code
!mkdir $GDRIVE_KAGGLE_TITANIC/kt-ai-code  
!mv ./kaggle-titanic-ai-code/*.py ./kaggle-titanic-ai-code/*.txt $GDRIVE_KAGGLE_TITANIC/kt-ai-code  
!rm -rf ./kaggle-titanic-ai-code

##### verify AI code

In [None]:
!ls $GDRIVE_KAGGLE_TITANIC_AI_CODE

##### [import a python file from google drive into google colab python environment](https://stackoverflow.com/a/20749411)


In [None]:
import sys 
import os
sys.path.append(os.path.abspath("/content/my_google_drive/kaggle-titanic/kt-ai-code"))

!pip install -r $GDRIVE_KAGGLE_TITANIC_AI_CODE/requirements.txt

import titanic_predict as tp # see https://github.com/aisaturday/kaggle-titanic-ai-code for more details

##### verify mount to my google drive

In [None]:
!ls $GDRIVE_KAGGLE_TITANIC_AI_CODE

# 2. setup project

### Data Handling
#### Let's read our data in using pandas:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

%matplotlib inline

train_data = pd.read_csv('/content/my_google_drive/kaggle-titanic/train.csv')
test_data = pd.read_csv('/content/my_google_drive/kaggle-titanic/test.csv')
p_id = test_data['PassengerId']
data = pd.concat([train_data, test_data])

### Let's take a look:

train.csv **data** has 891 observations, or passengers, to analyze here:
    
    Int64Index: 891 entries, 0 to 890

Each column tells us something about each of our observations, like their `name`, `sex` or `age`. These colunms  are called a features of our dataset. You can think of the meaning of the words column and feature as interchangeable for this notebook. 

After each feature it lets us know how many values it contains. While most of our features have complete data on every observation, like the `survived` feature here: 

    survived    891  non-null values 

some are missing information, like the `age` feature: 

    age         714  non-null values 

These missing values are represented as `NaN`s.

### Take care of missing values:
The features `ticket` and `cabin` have many missing values and so can’t add much value to our analysis. To handle this we will drop them from the dataframe to preserve the integrity of our dataset.

To do that we'll use this line of code to drop the features entirely:

    df = df.drop(['ticket','cabin'], axis=1) 


While this line of code removes the `NaN` values from every remaining column / feature:
   
    df = df.dropna()
     
Now we have a clean and tidy dataset that is ready for analysis. Because `.dropna()` removes an observation from our data even if it only has 1 `NaN` in one of the features, it would have removed most of our dataset if we had not dropped the `ticket` and `cabin`  features first.



In [None]:
data.drop('PassengerId', axis=1, inplace=True)
survived = data['Survived'].dropna()
data['Survived'].fillna(-1, inplace=True)

In [None]:
processed_data = tp.preprocess_data(data)

training_data = processed_data[data['Survived'] != -1]
testing_data = processed_data[data['Survived'] == -1]

training_data.drop('Survived', axis=1, inplace=True)
testing_data.drop('Survived', axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(training_data, survived, test_size=0.20, random_state=42)

# 3. Training the models

This is used to generate models that perform the predictions

##### get pretrained models

These are gotten via sklearn python library

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

##### train using transfer learning
Using pre-trained models, the training is done faster

In [None]:
models = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=10),
    RandomForestClassifier(n_estimators=100),
    MLPClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()
]

scores ={} # dictionary of model with associated accuracy score
for model in models:
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores[model] = score
    print(score)
    
model_with_max_score = max(scores, key=lambda k: scores[k])   
print(f"{model_with_max_score} has maximum score of {scores[model_with_max_score]}")    

# 4. Predicting with models

To see how well (how minimized the training loss) training is done, predictions are done with unseen test dataset

#### create results directory
used to store predictions made with the trained models


In [None]:
!mkdir $GDRIVE_KAGGLE_TITANIC/results

##### verify results

In [None]:
!ls $RESULTS

##### TIPS



* [**numpy.savetxt** without hash mark at beginning of header line](https://stackoverflow.com/a/17361181)

      If you want to get rid of it, pass comments='' as option to savetxt.

* [Saving arrays as columns with np.savetxt](https://stackoverflow.com/a/15193026)

      np.savetxt('myfile.txt', np.c_[x,y,z])
      
* [Writing CSV files with NumPy and pandas](https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781783553358/5/ch05lvl1sec50/writing-csv-files-with-numpy-and-pandas)

      The NumPy savetxt() function is the counterpart of the NumPy loadtxt() function and can save arrays in delimited file formats such as CSV. Save the array we created with the following function call:
      
      np.savetxt('np.csv', a, fmt='%.2f', delimiter=',', header=" #1,  #2,  #3,  #4")
      
* [**numpy.savetxt** *fmt* parameter ](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.savetxt.html)     

      d or i : signed decimal integer

      e or E : scientific notation with e or E.

      f : decimal floating point

      g,G : use the shorter of e,E or f

      o : signed octal

      s : string of characters

      u : unsigned decimal integer
     

In [None]:
if IS_MAX_SCORE_SUBMISSION:
    model.fit(training_data, survived)
    prediction = model.predict(testing_data)
    np.savetxt('/content/my_google_drive/kaggle-titanic/results/submission.csv', np.c_[p_id, prediction], fmt=['%g', '%g'], delimiter=",", header="PassengerId,Survived", comments='')                
    
if FINAL:

    models = [
        KNeighborsClassifier(3),
        SVC(kernel="linear", C=0.025),
        SVC(gamma=2, C=1),
        DecisionTreeClassifier(max_depth=10),
        RandomForestClassifier(n_estimators=100),
        MLPClassifier(),
        AdaBoostClassifier(),
        GaussianNB(),
        QuadraticDiscriminantAnalysis()
    ]    
    
    for model in models:
        model.fit(training_data, survived)
        prediction = model.predict(testing_data)
        np.savetxt('/content/my_google_drive/kaggle-titanic/results/submission-{}.csv'.format(model), np.c_[p_id, prediction], fmt=['%g', '%g'], delimiter=",", header="PassengerId,Survived", comments='')                
    

# 5.  [Submit the Result wtih Kaggle CLI](https://www.kaggle.com/c/titanic/)

Use the official Kaggle CLI to make submissions to titanic kaggle competion ***AFTER the COMPETITION RULES HAVE BEEN ACCEPTED*** or the kaggle submission would fail


##### view previous submissions
using the official kaggle cli 

In [None]:
!kaggle competitions submissions -c $KAGGLE_COMPETITION_NAME

##### make submission to kaggle
using the official kaggle cli 

In [None]:
!kaggle competitions submit -c $KAGGLE_COMPETITION_NAME -f $RESULT_CSV  -m 'test kaggle cli 3'

##### confirm kaggle submission

In [None]:
!kaggle competitions submissions -c $KAGGLE_COMPETITION_NAME