Tutorials:

* https://www.kaggle.com/c/titanic/details/getting-started-with-python
* https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii
* https://www.dataquest.io/mission/74/getting-started-with-kaggle
* https://www.dataquest.io/mission/75/improving-your-submission

    

While the book provides a good introduction to a variety of topics, we thought another helpful exercise would be to analyze a dataset. Since our textbook omits pandas, we used the exercise to explore pandas as well. All of the tutorials listed above use the [titanic dataset](https://www.kaggle.com/c/titanic/data) from kaggle. The goal is to predict the survivors of the Titanic disaster. There are two datasets that we'll be using for this, train.csv and test.csv. The training set includes whether each passenger survived or not.

https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii uses pandas to explore and manipulate data.

In [None]:
import pandas as pd

# text import is pretty simple
titanic = pd.read_csv("train.csv")


You can see that we've read in the dataset using pandas. Pandas uses three data structures: Series, DataFrames, and Panels. Series have a single dimension and must contain all the same data type. DataFrames have two dimensions, which can be thought of as a tabular structure, and contain named columns of potentially different data types. Panels are three-dimensional structures. 

You can [learn more about data structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html) through pandas documentation.

We look at the data type of the titanic object that we've created with the pandas package. 

In [None]:
type(titanic)

Next, we explore the dataset.  A [list of pandas functions](http://pandas.pydata.org/pandas-docs/version/0.15.1/api.html) describes the usage of each function. 

We look at the `head` attribute for the titanic object. All the attributes a pandas DataFrame can be found by using tab for code completion after the dot operator. Use object.(tab) and then scroll down using the down arrow to see what's available. We can look at the head and tail of the titanic object.

### Linear Regression

The first prediction uses linear regression to make the prediction

In [None]:
from sklearn.linear_model import LinearRegression
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# intiailize algorithm class
lm = LinearRegression()

Next we make our predictions. We initialize it as a list.

In [None]:
predictions = []

for train, test, in kf:
    # create a subset of training rows using the predictors and training indices
    train_predictors = (titanic[predictors].iloc[train,:])
    # create a variable for index numbers of target variable
    train_target = titanic["Survived"].iloc[train]
    # create a linear model using the predictor and target variables
    lm.fit(train_predictors, train_target)
    # now make predictions and store them in our predictions list.
    test_predictions = lm.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
    

Next we figure out our prediction error. Kaggle uses percentage of correct predictions. Each set of predictions is a numpy array so we concatenate them using numpy (since they're numpy arrays). (This is what the tutorial says but I think what they really mean is it's a list and numpy has a lot of nice functions that deal with lists).


In [None]:
#type(predictions[1])
# type(predictions)
# predictions[1]

In [None]:
import numpy as np

predictions = np.concatenate(predictions, axis=0)



In [None]:
# map predictions to outcomes (0 or 1)
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0

type(predictions)



In [None]:
predictions[10]

In [None]:
accuracy = sum(predictions[predictions == titanic["Survived"]])/len(predictions)

In [None]:
print(accuracy)

### Logistic Regression 

Prediction accuracy with linear regression was 78.3 %. A model built using logistic regression may do better.

In [None]:
titanic.head(5) 

In [None]:
titanic.tail(5)

To see the data types of the columns in the object use `.dtypes`

In [None]:
titanic.dtypes

`object.info()` gives us more metadata, including number of non-null values in each column, the data type of each column, and a summary of the object.

In [None]:
titanic.info()

If we only want to view the column headers, we can use the .columns attribute.

In [None]:
print(titanic.columns)
print(titanic.columns.values)

We can also look at a summary of data using the .describe() method. This provides summary statistics for each column. 

In [None]:
print(titanic.describe())

Using describe with it's default parameters only gives us the columns with numerical value. We can specify include='all' and it gives us every column.

In [None]:
print(titanic.describe(include = 'all'))

A [second tutorial from dataquest](https://www.dataquest.io/mission/74/getting-started-with-kaggle) uses the pandas library, and scikit-learn. It cleans up the dataset and runs it through a couple of regression models. 

We note from the data above that we have some missing values. A count of the variables age and cabin returns fewer than 891 values. The count is of values that aren't null, NA or NAN. We fill in the missing values of the column with the median age. There may be a better way to do this. 

We select columns using the index.

In [None]:
titanic["Age"].head(10)

We can use the .fillna method. This method takes one argument: the value used to replace the missing values.

In [None]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

Next we need to convert the "Sex" column into a numerical value so that python can work with it. It's currently a  string.

In [None]:
type(titanic["Sex"][0])

In [None]:
print(titanic["Sex"].head(10))

In [None]:
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# you can also do this like so:
#titanic["Sex"] = titanic["Sex"].map( {'female' : 0, 'male' : 1}).astype(int)

Next we need to convert the embarked column. It has a couple of missing values, which we replace with the most common value using the mode method.


In [None]:
# Why isn't this working?
#titanic["Embarked"] = titanic["Embarked"].fillna(titanic["Embarked"].mode())
# But this works.
titanic["Embarked"] = titanic["Embarked"].fillna("S")

Then we need to look to see what values exist in the Embarked column.

In [None]:
print(titanic["Embarked"].unique())

In [None]:
# printed this value because I was trying to get the mode attribute to work.
# index 829 originally contained NaN
print(titanic["Embarked"][829])

In [None]:
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2


Now the file is in the right format. Using scikit-learn, create a linear model. There's also this KFold function, which is a [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) implementation. Kfold takes the original sample and randomly partitions it into k equal sized sub-samples. It then uses the each sub-sample as validation against the rest of the dataset. These results are combined to produce a single estimation. A typical approach is to use 10 "folds."

We'll use the scikit-learn [KFold function](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html)

This function take the following as parameters:

* n: int, the total number of elements 
* n_folds: int, number of folds, defaults to 3
* shuffle: bool, whether to shuffle the data first
* random state: , none, int, RandomState, when shuffle = True, int sets seed. If none, us default numpy rng for shuffling.

In [None]:

from sklearn.cross_validation import KFold

# the tutorial doesn't use shuffle but sets random state. 
# the shape attribute returns the dimensionality of the object. Using an index of zero returns the number of rows
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

# just curious
print(kf)

In [None]:
from sklearn import linear_model

logm = linear_model.LogisticRegression(random_state = 1)

scores = cross_validation.cross_val_score(logm, titanic[predictors], titanic["Survived"], cv =3)

print(scores.mean())

Apply the same changes to the test set as the train set

In [None]:
titanic_test = pd.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

Create File to Submit to Kaggle