# Titanic Prediction Problem Example
______________________________________________________________________________________________________

In [9]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import sklearn.linear_model


trainSet = pd.read_csv(r"Titanic\Data\train.csv")

**The above code imports necessary libraries and reads in values from a .csv file and loads them into the _DataFrame type_ defined within the pandas library.**

Entries are accessed via an indexing based on the columns of the csv file. For example, the TrainSet['Age'] access the column of the csv containing all of the ages.

In [10]:
# Drop vals
# trainSet.dropna(inplace=True)

# Replace vals with mean
trainSet.fillna(trainSet.mean(), inplace=True)

**The entries without a value are replaced with a _NaN_ value which will cause problems in training if not handled correctly.** 

There are two main options here, using the **dropna** function, which removes rows that have _NaN_ values, or the **fillna** function which replaces these values with a preset value (i.e. the average).

In [11]:
# Map version
#trainSet['Sex'] = map(lambda x: 1 if x == 'male' else 0, trainSet['Sex'])

# List comprehension version
trainSet['Sex'] = [1 if i == 'male' else '0' for i in trainSet['Sex'] ]

**This code prepares the data in the column which lists passengers in terms of gender. In order to train the logisitic regression model, this should be converted into numerical data**

The two methods shown here are equally valid ways in completing that task; however, the second option, which uses a list comprehension, is much more readable that the first.

_Note: A for loop can also be used, but it is slower than a list comprehension and does not provide a significant increase in readability in this particular instance_

In [12]:
X = np.array(trainSet[['Age','Sex','Pclass']]).reshape(-1,3)
y = trainSet['Survived']
model = sklearn.linear_model.LogisticRegression()
model.fit(X,y)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

**The data needs to be reshaped so that it can be passed into the learning function.**

The learning function in this model is **Logistic Regression**.


In [13]:
testSet = pd.read_csv(r"Titanic\Data\test.csv")
testSet.fillna(trainSet.mean(), inplace=True)
testSet['Sex'] = map(lambda x: 1 if x == 'male' else 0, testSet['Sex'])
X_test = np.array(testSet[['Age','Sex','Pclass']]).reshape(-1,3)


**This is the code to prepare the testing data**

In [14]:
dataList = []
for item in X_test:
        res = model.predict(item.reshape(1,-1))
        dataList.append(*res) 

**This code iterates over each sample and and uses the previously trained model to make a list of predictions**

The ***** in front of res strips the values of extra characters (i.e. [1] becomes 1)

In [16]:
dataDict = {'PassengerId': testSet['PassengerId'], 'Survived': dataList}

dataDF = pd.DataFrame.from_dict(dataDict)

dataDF.to_csv(path_or_buf=r"Titanic\Data\predictions.csv",mode='w',index=False)

print(dataDF)

     PassengerId  Survived
0            892         0
1            893         0
2            894         0
3            895         0
4            896         1
5            897         0
6            898         1
7            899         0
8            900         1
9            901         0
10           902         0
11           903         0
12           904         1
13           905         0
14           906         1
15           907         1
16           908         0
17           909         0
18           910         1
19           911         0
20           912         0
21           913         0
22           914         1
23           915         0
24           916         1
25           917         0
26           918         1
27           919         0
28           920         0
29           921         0
..           ...       ...
388         1280         0
389         1281         0
390         1282         0
391         1283         1
392         1284         0
3

**This takes the data that was created and writes it to a csv file**