### Get and Clean Data

In [1]:
import numpy as np
import pandas as pd

#get training data
data_train = pd.read_csv('train.csv')

data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


From the kaggle website here is the meaning of each column:  

| Variable | Definition                                 | Key                       |
|----------|--------------------------------------------|---------------------------|
| survival | Survival                                   | 0 = No, 1 = Yes           |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex      | Sex                                        |                           |
| Age      | Age in years                               |                           |
| sibsp    | # of siblings / spouses aboard the Titanic |                           |
| parch    | # of parents / children aboard the Titanic |                           |
| ticket   | Ticket number                              |                           |
| fare     | Passenger fare                             |                           |
| cabin    | Cabin number                               |                           |
| embarked | Port of Embarkation                        |C = Cherbourg, Q = Queenstown, S = Southampton |                           |


Next, we will clean up the data a bit. I will remove the name and ticket columns, convert the sex coloumn to either a 1 (male) or 0 (female). I will also put ages and fares into groups. 

In [2]:
#drop columns
data_train = data_train.drop(["Name", "Cabin", "Ticket"], axis = 1)

#group people by ages
data_train.Age = data_train.Age.fillna(-0.5)
bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
group_nums = [0, 1, 2, 3, 4, 5, 6, 7]
categories = pd.cut(data_train.Age, bins, labels=group_nums)
data_train.Age = categories

#group people by fare
data_train.Fare = data_train.Fare.fillna(10)

bins = (-1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 50000)
group_nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
categories = pd.cut(data_train.Fare, bins, labels=group_nums)
data_train.Fare = categories

#convert sex to intergers
data_train['Sex'] = data_train['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

#convert embarked to an integer
freq_port = data_train.Embarked.dropna().mode()[0]
data_train['Embarked'] = data_train['Embarked'].fillna(freq_port)
data_train['Embarked'] = data_train['Embarked'].map( {'C': 0, 'Q': 1, 'S': 2} ).astype(int)

#display
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,4,1,0,1,2
1,2,1,1,1,6,1,0,8,0
2,3,1,3,1,5,0,0,1,2
3,4,1,1,1,5,1,0,6,2
4,5,0,3,0,5,0,0,1,2


For the age column, 0 is an unknown age, 1 is a baby, 2 is a child, 3 is a preteen, 4 is a teenager, 5 is a twenties adult, 6 is an adult, and 7 is a senior. 

For the fare column, the person had a fare up to but less then x*10, where x is the column value. 

### Logistic Regression on Data

In [3]:
data_train.sample(200)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
193,194,1,2,0,1,1,1,3,2
703,704,0,3,0,4,0,0,1,1
556,557,1,1,1,6,1,0,4,0
325,326,1,1,1,6,0,0,11,0
270,271,0,1,0,0,0,0,4,2
770,771,0,3,0,4,0,0,1,2
110,111,0,1,0,6,0,0,6,2
419,420,0,3,1,2,0,2,3,2
638,639,0,3,1,6,0,5,4,2
370,371,1,1,0,4,1,0,6,0


In [4]:
data_train.insert(0, 'Ones', 1) # add column of ones for matrix multiplication later
X = data_train.drop(["Survived"], axis = 1)
Y = data_train["Survived"]
X = np.matrix(X.values) 
Y = np.matrix(Y.values) 
theta = np.zeros(X.shape[1])
theta = np.matrix(theta)

In [5]:
print(X.shape)
print(Y.shape)
print(theta.shape)

(891, 9)
(1, 891)
(1, 9)


In logistic funcition we want to have values between 0 and 1. For this the hypothesis is:
![h(x) = g(theta*x)](hypothesis.png)

Where:
![g(z) = 1 over 1 + e to the negative z](sigmoid.png)

The hypothesis uses the sigmoid function to return a value between 1 or 0.  

In [6]:
import scipy.special as sp
# compute sigmoid function
def sigmoid(z):
    l = z.astype(float) #explicitly make each entry a float
    #return 1 / (1 + np.exp(-l))
    return sp.expit(l)

Next, we want to compute a cost function. What the cost function does is determine the error between the predicted value and the actual value. But minimizing the cost function, we minmize the error (aka we will be getting better perdictions!)

The cost function is: 
![cost-function](cost-function.png)

where the superscript i represents the ith example, and m is the total number of examples. 

In [16]:
def cost_function(theta, X, Y):
    '''Determine the cost. 
    theta is an array holding parameters.
    X is the input features. 
    Y is the actual classification. '''
    np.seterr(divide='raise')
    #print("theta11111: ", theta)
    theta = np.matrix(theta)
    #print("theta: ", theta)
    hypothesis = sigmoid(X * theta.T)
    #print(hypothesis)
    first_half = -Y * np.log(hypothesis)
    
    second_half = (1 - Y) * np.log(1 - hypothesis)
    
    overall_cost = (first_half - second_half) / X.shape[0]
    #this is a vectorized implementation of the cost function
    #therefore 'overall_cost' is a single element matrix with the total cost
    
    #this next line just converts it to an integer value
    #print(overall_cost)
    return np.sum(overall_cost) 
print(cost_function(theta, X, Y))

0.69314718056


The cost function calculate the error of a given theta. We want to minimize the cost (ie error) as much as possible and find the best possible theta parameters. 

Therefore, for each parameter, we calculate the gradient and then adjust the parameter by that gradient. 

NOTE: Remember that the cost function is a convex function and we are looking for the local minimum. The gradient always points in the direction of greatest change, and so by finding and moving is the direction of the gradient, we move closer to the minimum. 

To calculate the gradient for parameter j, we use:
![gradient](gradient.png)

In [17]:
def gradient(theta, X, Y):
    #print("grad theta 1111 :", theta)
    theta = np.matrix(theta)
    #print("grad theta:", theta.T)
   
    hypothesis = sigmoid(X * theta.T)
    
    error = (hypothesis - Y)
    
    grad = np.zeros(X.shape[1])
    
    
    #calculate the gradient for each parameter
    for i in range(X.shape[1]):
        grad[i] = np.sum(error * X[:,i]) / X.shape[0]
        
    #print(grad)
    return np.matrix(grad)

The above code does not execute gradient decent. Rather it executes one step of gradient decent. 

To do gradient decent, we will use an optimization function from the scipy library which will repeatedly call the cost function, trying to minimize it, while using a gradient function to determine how to minimize theta. 

In [18]:
import scipy.optimize as opt
opt_theta = opt.fmin_tnc(func=cost_function, x0=theta, fprime=gradient, args = (X, Y))

cost(opt_theta[0], X, Y)

FloatingPointError: divide by zero encountered in log