### Get and Clean Data

In [14]:
import numpy as np
import pandas as pd

#get training data
data_train = pd.read_csv('train.csv')

data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


From the kaggle website here is the meaning of each column:  

| Variable | Definition                                 | Key                       |
|----------|--------------------------------------------|---------------------------|
| survival | Survival                                   | 0 = No, 1 = Yes           |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex      | Sex                                        |                           |
| Age      | Age in years                               |                           |
| sibsp    | # of siblings / spouses aboard the Titanic |                           |
| parch    | # of parents / children aboard the Titanic |                           |
| ticket   | Ticket number                              |                           |
| fare     | Passenger fare                             |                           |
| cabin    | Cabin number                               |                           |
| embarked | Port of Embarkation                        |C = Cherbourg, Q = Queenstown, S = Southampton |                           |


Next, we will clean up the data a bit. I will remove the name and ticket columns, convert the sex coloumn to either a 1 (male) or 0 (female). I will also put ages and fares into groups. 

In [15]:
#drop columns
data_train = data_train.drop(["Name", "Cabin", "Ticket"], axis = 1)

#group people by ages
data_train.Age = data_train.Age.fillna(-0.5)
bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
group_nums = [0, 1, 2, 3, 4, 5, 6, 7]
categories = pd.cut(data_train.Age, bins, labels=group_nums)
data_train.Age = categories

#group people by fare
bins = (0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
group_nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
categories = pd.cut(data_train.Fare, bins, labels=group_nums)
data_train.Fare = categories

#convert sex to intergers
data_train['Sex'] = data_train['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

#convert embarked to an integer
freq_port = data_train.Embarked.dropna().mode()[0]
data_train['Embarked'] = data_train['Embarked'].fillna(freq_port)
data_train['Embarked'] = data_train['Embarked'].map( {'C': 0, 'Q': 1, 'S': 2} ).astype(int)

#display
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,4,1,0,1,2
1,2,1,1,1,6,1,0,8,0
2,3,1,3,1,5,0,0,1,2
3,4,1,1,1,5,1,0,6,2
4,5,0,3,0,5,0,0,1,2


For the age column, 0 is an unknown age, 1 is a baby, 2 is a child, 3 is a preteen, 4 is a teenager, 5 is a twenties adult, 6 is an adult, and 7 is a senior. 

For the fare column, the person had a fare up to but less then x*10, where x is the column value. 

### Logistic Regression on Data

In [16]:
data_train = data_train.values #convert to numpy ndarray

In logistic funcition we want to have values between 0 and 1. For this the hypothesis is:
![h(x) = g(theta*x)](hypothesis.png)

Where:
![g(z) = 1 over 1 + e to the negative z](sigmoid.png)



In [None]:
import math
# compute sigmoid function
def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-1.0 * z))

