Tutorials:

* https://www.kaggle.com/c/titanic/details/getting-started-with-python
* https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii
* https://www.dataquest.io/mission/74/getting-started-with-kaggle
* https://www.dataquest.io/mission/75/improving-your-submission

    

While the book provides a good introduction to a variety of topics, we thought another helpful exercise would be to analyze a dataset. Since our textbook omits pandas, we used the exercise to explore pandas as well. All of the tutorials listed above use the [titanic dataset](https://www.kaggle.com/c/titanic/data) from kaggle. The goal is to predict the survivors of the Titanic disaster. There are two datasets that we'll be using for this, train.csv and test.csv. The training set includes whether each passenger survived or not.

https://www.kaggle.com/c/titanic/details/getting-started-with-python-ii uses pandas to explore and manipulate data.

In [321]:
import pandas as pd

# text import is pretty simple
titanic = pd.read_csv("train.csv")


You can see that we've read in the dataset using pandas. Pandas uses three data structures: Series, DataFrames, and Panels. Series have a single dimension and must contain all the same data type. DataFrames have two dimensions, which can be thought of as a tabular structure, and contain named columns of potentially different data types. Panels are three-dimensional structures. 

You can [learn more about data structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html) through pandas documentation.

We look at the data type of the titanic object that we've created with the pandas package. 

In [322]:
type(titanic)

pandas.core.frame.DataFrame

Next, we explore the dataset.  A [list of pandas functions](http://pandas.pydata.org/pandas-docs/version/0.15.1/api.html) describes the usage of each function. 

We look at the `head` attribute for the titanic object. All the attributes a pandas DataFrame can be found by using tab for code completion after the dot operator. Use object.(tab) and then scroll down using the down arrow to see what's available. We can look at the head and tail of the titanic object.

In [323]:
titanic.head(5) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [324]:
titanic.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


To see the data types of the columns in the object use `.dtypes`

In [325]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

`object.info()` gives us more metadata, including number of non-null values in each column, the data type of each column, and a summary of the object.

In [326]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


If we only want to view the column headers, we can use the .columns attribute.

In [327]:
print(titanic.columns)
print(titanic.columns.values)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


We can also look at a summary of data using the .describe() method. This provides summary statistics for each column. 

In [328]:
print(titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


Using describe with it's default parameters only gives us the columns with numerical value. We can specify include='all' and it gives us every column.

In [329]:
print(titanic.describe(include = 'all'))

        PassengerId    Survived      Pclass                         Name  \
count    891.000000  891.000000  891.000000                          891   
unique          NaN         NaN         NaN                          891   
top             NaN         NaN         NaN  Shorney, Mr. Charles Joseph   
freq            NaN         NaN         NaN                            1   
mean     446.000000    0.383838    2.308642                          NaN   
std      257.353842    0.486592    0.836071                          NaN   
min        1.000000    0.000000    1.000000                          NaN   
25%      223.500000    0.000000    2.000000                          NaN   
50%      446.000000    0.000000    3.000000                          NaN   
75%      668.500000    1.000000    3.000000                          NaN   
max      891.000000    1.000000    3.000000                          NaN   

         Sex         Age       SibSp       Parch  Ticket        Fare    Cabin  \
count 

A [second tutorial from dataquest](https://www.dataquest.io/mission/74/getting-started-with-kaggle) uses the pandas library, and scikit-learn. It cleans up the dataset and runs it through a couple of regression models. 

We note from the data above that we have some missing values. A count of the variables age and cabin returns fewer than 891 values. The count is of values that aren't null, NA or NAN. We fill in the missing values of the column with the median age. There may be a better way to do this. 

We select columns using the index.

In [330]:
titanic["Age"].head(10)

0    22
1    38
2    26
3    35
4    35
5   NaN
6    54
7     2
8    27
9    14
Name: Age, dtype: float64

We can use the .fillna method. This method takes one argument: the value used to replace the missing values.

In [331]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

In [332]:
titanic["Age"].head(10)

0    22
1    38
2    26
3    35
4    35
5    28
6    54
7     2
8    27
9    14
Name: Age, dtype: float64

Next we need to convert the "Sex" column into a numerical value so that python can work with it. It's currently a  string.

In [333]:
type(titanic["Sex"][0])

str

In [334]:
print(titanic["Sex"].head(10))

0      male
1    female
2    female
3    female
4      male
5      male
6      male
7      male
8    female
9    female
Name: Sex, dtype: object


In [335]:
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# you can also do this like so:
#titanic["Sex"] = titanic["Sex"].map( {'female' : 0, 'male' : 1}).astype(int)

In [336]:
print(titanic["Sex"].head(10))

0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
8    1
9    1
Name: Sex, dtype: object


Next we need to convert the embarked column. It has a couple of missing values, which we replace with the most common value using the mode method.


In [337]:
# Why isn't this working?
#titanic["Embarked"] = titanic["Embarked"].fillna(titanic["Embarked"].mode())
# But this works.
titanic["Embarked"] = titanic["Embarked"].fillna("S")

Then we need to look to see what values exist in the Embarked column.

In [338]:
print(titanic["Embarked"].unique())

['S' 'C' 'Q']


In [339]:
# printed this value because I was trying to get the mode attribute to work.
# index 829 originally contained NaN
print(titanic["Embarked"][829])

S


In [340]:
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2


Now the file is in the right format. Using scikit-learn, create a linear model. There's also this KFold function, which is a [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) implementation. Kfold takes the original sample and randomly partitions it into k equal sized sub-samples. It then uses the each sub-sample as validation against the rest of the dataset. These results are combined to produce a single estimation. A typical approach is to use 10 "folds."

We'll use the scikit-learn [KFold function](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html)

This function take the following as parameters:

* n: int, the total number of elements 
* n_folds: int, number of folds, defaults to 3
* shuffle: bool, whether to shuffle the data first
* random state: , none, int, RandomState, when shuffle = True, int sets seed. If none, us default numpy rng for shuffling.

In [341]:

from sklearn.cross_validation import KFold

# the tutorial doesn't use shuffle but sets random state. 
# the shape attribute returns the dimensionality of the object. Using an index of zero returns the number of rows
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

# just curious
print(kf)

sklearn.cross_validation.KFold(n=891, n_folds=3, shuffle=False, random_state=1)


### Linear Regression

The first prediction uses linear regression to make the prediction

In [342]:
from sklearn.linear_model import LinearRegression
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# intiailize algorithm class
lm = LinearRegression()

Next we make our predictions. We initialize it as a list.

In [343]:
predictions = []

for train, test, in kf:
    # create a subset of training rows using the predictors and training indices
    train_predictors = (titanic[predictors].iloc[train,:])
    # create a variable for index numbers of target variable
    train_target = titanic["Survived"].iloc[train]
    # create a linear model using the predictor and target variables
    lm.fit(train_predictors, train_target)
    # now make predictions and store them in our predictions list.
    test_predictions = lm.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
    

Next we figure out our prediction error. Kaggle uses percentage of correct predictions. Each set of predictions is a numpy array so we concatenate them using numpy (since they're numpy arrays). (This is what the tutorial says but I think what they really mean is it's a list and numpy has a lot of nice functions that deal with lists).


In [344]:
#type(predictions[1])
# type(predictions)
# predictions[1]

In [345]:
import numpy as np

predictions = np.concatenate(predictions, axis=0)



In [346]:
# map predictions to outcomes (0 or 1)
predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0

type(predictions)



numpy.ndarray

In [347]:
predictions[10]

1.0

In [348]:
accuracy = sum(predictions[predictions == titanic["Survived"]])/len(predictions)

  if __name__ == '__main__':


In [349]:
print(accuracy)

0.783389450056


### Logistic Regression 

Prediction accuracy with linear regression was 78.3 %. A model built using logistic regression may do better.

In [350]:
from sklearn import linear_model

logm = linear_model.LogisticRegression(random_state = 1)

scores = cross_validation.cross_val_score(logm, titanic[predictors], titanic["Survived"], cv =3)

print(scores.mean())

0.787878787879


Apply the same changes to the test set as the train set

In [351]:
titanic_test = pd.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

Create File to Submit to Kaggle

In [352]:
# Initialize the algorithm class
logm2 = linear_model.LogisticRegression(random_state=1)

# Train the algorithm using all the training data
logm2.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = logm2.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

In [353]:
submission.to_csv("kaggle.csv", index=False)

### Column Descriptions

* PassengerId: Numerical Id
* Survived: Whether passenger survived (1) or didn't (0)
* Pclass: Class the passenger was in
* Name: Name of the passenger
* Sex: Gender of the passenger
* Age: Age of the passenger
* SibSp: Number of siblings and spouses the passenger had on board
* Parch: Number of parents and children the passenger had on board
* Ticket: The ticket number of the passenger
* Fare: How much the passenger paid for their ticket
* Cabin: Which cabin the passenger was in
* Embarked: The passengers point of embarkation