# Our first Kaggle Competition.

For this assignment we are going to submit our first submission in Kaggle for the [the Titanic Dataset Competition](https://www.kaggle.com/c/titanic/data).

Kaggle is a website where data scientists can compete on Data Science competitions where the goal is to provide the best predictions for a specific dataset. Companies launch these competitions and usually give substantial rewards (in the order of thousands of dollars).

For the titanic competition, the dataset has passenger information for every passenger that was aboard the titanic on its first (and last trip). 

The target variable is whether the passenger died or not when the cruise ship sank
You can download the competition data (and check the data dictionary) on [kaggle](https://www.kaggle.com/c/titanic/data).

You will use the training data (file `train.csv`) to train your classifier, and will create submissions for the `test.csv`. 

Basically, you have to submit a csv file on the shape:

```
PassengerId,Survived
892,0
893,1
894,1
895,0
...
```

Where the PassengerId are the ids of the passengers on the `test.csv` dataset and `Survived` is your model prediction about the passenger (0, die, 1 survives).

In order to submit a file you have to create a profile on the website. Then you can upload the submission using the Website or using the [kaggle api](https://github.com/Kaggle/kaggle-api)

In [1]:
from sklearn.linear_model import LogisticRegression
titanic = LogisticRegression()

In [75]:
import pandas as pd
import warnings
warnings.simplefilter("ignore")

In [46]:
train = pd.read_csv('titanic_data/train.csv')
train[train['Survived'] == 0].head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0,,S


male = 0; female = 1

In [14]:
train1 = train.replace({'male': 0,'female': 1})
train1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


Replaced NaN in age with median

In [64]:
train1[train1['Age'].notna()]['Age'].median()

28.0

In [65]:
train1[train1['Age'].notna()]['Age'].mean()

29.36158249158249

In [80]:
survived = train1[train1['Survived']==1]
died = train1[train1['Survived']==0]
survived.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [68]:
survived.Pclass.value_counts(True)

1    0.397661
3    0.347953
2    0.254386
Name: Pclass, dtype: float64

In [69]:
survived.Sex.value_counts(True)

1    0.681287
0    0.318713
Name: Sex, dtype: float64

In [92]:
died.Sex.value_counts(True)

0    0.852459
1    0.147541
Name: Sex, dtype: float64

In [71]:
survived.Parch.value_counts(True)

0    0.681287
1    0.190058
2    0.116959
3    0.008772
5    0.002924
Name: Parch, dtype: float64

In [72]:
max_age=survived.Age.max()
min_age=survived.Age.min()
print('Max age: {}\nMin age: {}'.format(max_age, min_age))

Max age: 80.0
Min age: 0.42


In [85]:
survived['Age_type'] = 'child/baby'
survived['Age_type'][(survived['Age'] > 12) & (survived['Age'] <= 18)] = 'teen'
survived['Age_type'][(survived['Age'] > 18) & (survived['Age'] <= 30)] = 'young adult'
survived['Age_type'][(survived['Age'] > 30) & (survived['Age'] <= 60)] = 'adult'
survived['Age_type'][(survived['Age'] > 60)] = 'senior'
survived.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_type
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,adult
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,young adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,adult
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S,young adult
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C,teen


In [81]:
died['Age_type'] = 'child/baby'
died['Age_type'][(died['Age'] > 12) & (died['Age'] <= 18)] = 'teen'
died['Age_type'][(died['Age'] > 18) & (died['Age'] <= 30)] = 'young adult'
died['Age_type'][(died['Age'] > 30) & (died['Age'] <= 60)] = 'adult'
died['Age_type'][(died['Age'] > 60)] = 'senior'

In [91]:
len(train1)

891

In [90]:
survived.Age_type.value_counts()

young adult    148
adult          119
child/baby      40
teen            30
senior           5
Name: Age_type, dtype: int64

In [88]:
died.Age_type.value_counts()

young adult    299
adult          164
teen            40
child/baby      29
senior          17
Name: Age_type, dtype: int64

In [15]:
train1['Age'] = train1['Age'].fillna(train1[train1['Age'].notna()]['Age'].median())

In [18]:
train1.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",0,28.0,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C


In [102]:
X_train = train1[['Sex','Age','SibSp','Parch']]
y_train = train1[['Survived']]

In [52]:
test = pd.read_csv('titanic_data/test.csv')
test.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [101]:
test1 = test.replace({'male': 0,'female': 1})
test1['Age'] = test1['Age'].fillna(test1[test1['Age'].notna()]['Age'].median())
X_test = test1[['Sex','Age','SibSp','Parch']]
X_test.head()

Unnamed: 0,Sex,Age,SibSp,Parch
0,0,34.5,0,0
1,1,47.0,1,0
2,0,62.0,0,0
3,0,27.0,0,0
4,1,22.0,1,1


In [103]:
titanic.fit(X_train, y_train)
predictions = titanic.predict(X_test)

predictions[:10]

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

In [36]:
test1['Survived'] = predictions

In [37]:
test1.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,S,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,S,1


In [104]:
pred_prob = titanic.predict_proba(X_test)
pred_prob[:10]

array([[0.79503807, 0.20496193],
       [0.31922266, 0.68077734],
       [0.8437458 , 0.1562542 ],
       [0.77994522, 0.22005478],
       [0.27010978, 0.72989022],
       [0.75193672, 0.24806328],
       [0.22015433, 0.77984567],
       [0.83482998, 0.16517002],
       [0.19637337, 0.80362663],
       [0.85802998, 0.14197002]])

In [48]:
import numpy as np
def probabilities_to_classes(prediction_probabilities, threshold=0.5):
    predictions = np.zeros([len(prediction_probabilities), ])
    predictions[prediction_probabilities[:,1]>=threshold] = 1
    return predictions

In [105]:
pred2 = probabilities_to_classes(pred_prob, threshold=0.6)

In [106]:
test1['Survived'] = pred2
test1.Survived = test1.Survived.astype(int)

In [107]:
submit = test1[['PassengerId','Survived']]
submit.head(20)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [108]:
submit.to_csv('titanic_submit.csv', index=False)

#  Submissions

First

* Used log regression; no changes to param; Pclass, Sex, Age, SibSp, Parch as independent variables
* Score: 0.74162

Second

* Same criteria as submission, changed threshold to 0.6
* Score: 0... didn't change floats to ints in predictions... (facepalm)

Third

* Changed floats to ints
* Score: 0.78468