As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.

We will use a version of the Titanic (titanic3.xls) from [here](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls). We converted the .xls to .csv for easier manipulation but left the data is otherwise unchanged.

We need to read in all the lines from the (titanic3.csv) file, set aside the keys from the first line, and find our labels (who survived or died) and data (attributes of that person). Let's look at the keys and some corresponding example lines.

In [2]:
!ls

Clase5.ipynb                       README.md
Clase7.ipynb                       Tarea_clase 4.ipynb
Clase7_LogisticRegressionMod.ipynb [34mdatasets[m[m
Clase_7.ipynb


In [4]:
import os
import pandas as pd

titanic = pd.read_csv(os.path.join("datasets","titanic3.csv"))
print(titanic.columns)

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')


Here is a broad description of the keys and what they mean:

```
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
survival        Survival
                (0 = No; 1 = Yes)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
boat            Lifeboat
body            Body Identification Number
home.dest       Home/Destination
```

In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:

In [5]:
labels= titanic.survived.values
features=titanic[["pclass","sex","age","sibsp","parch","fare","embarked"]]

We clearly want to discard the "boat" and "body" columns for any classification into survived vs not survived as they already contain this information. The name is unique to each person (probably) and also non-informative. 

For a first try, we will use "pclass", "sibsp", "parch", "fare" and "embarked" as our features:

In [6]:
labels =
features = 

In [7]:
labels

array([1, 1, 0, ..., 0, 0, 0])

In [6]:
features.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
0,1,female,29.0,0,0,211.3375,S
1,1,male,0.9167,1,2,151.55,S
2,1,female,2.0,1,2,151.55,S
3,1,male,30.0,1,2,151.55,S
4,1,female,25.0,1,2,151.55,S


The data now contains only useful features, but they are not in a format that the machine learning algorithms can understand.

We need to transform the strings "male" and "female" into binary variables that indicate the gender, and similarly for "embarked".

We can do that using the pandas get_dummies function:

In [7]:
pd.get_dummies(features).head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,1,29.0,0,0,211.3375,1,0,0,0,1
1,1,0.9167,1,2,151.55,0,1,0,0,1
2,1,2.0,1,2,151.55,1,0,0,0,1
3,1,30.0,1,2,151.55,0,1,0,0,1
4,1,25.0,1,2,151.55,1,0,0,0,1


This transformation successfully encoded the string columns. However, one might argue that the class is also a categorical variable. 

We can explicitly list the columns to encode using the ``columns`` parameter, and include ``pclass``:

In [8]:
features_dummies = pd.get_dummies(features,columns=["pclass","sex","embarked"])
features_dummies.head(n=16)

Unnamed: 0,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,29.0,0,0,211.3375,1,0,0,1,0,0,0,1
1,0.9167,1,2,151.55,1,0,0,0,1,0,0,1
2,2.0,1,2,151.55,1,0,0,1,0,0,0,1
3,30.0,1,2,151.55,1,0,0,0,1,0,0,1
4,25.0,1,2,151.55,1,0,0,1,0,0,0,1
5,48.0,0,0,26.55,1,0,0,0,1,0,0,1
6,63.0,1,0,77.9583,1,0,0,1,0,0,0,1
7,39.0,0,0,0.0,1,0,0,0,1,0,0,1
8,53.0,2,0,51.4792,1,0,0,1,0,0,0,1
9,71.0,0,0,49.5042,1,0,0,0,1,1,0,0


In [10]:
data = features_dummies.values

In [11]:
data[0]

array([ 29.    ,   0.    ,   0.    , 211.3375,   1.    ,   0.    ,
         0.    ,   1.    ,   0.    ,   0.    ,   0.    ,   1.    ])

In [12]:
import numpy as np
np.isnan(data).any()

True

True

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

train_data, test_data, train_labels, test_labels = train_test_split(data,labels,random_state=0)

imp=Imputer()
imp.fit(train_data)
train_data_finite=imp.transform(train_data)
test_data_finite=imp.transform(test_data)

In [14]:
np.isnan(train_data_finite).any()

False

In [15]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression().fit(train_data_finite,train_labels)
print("logistic regression score: %f" % 
      lr.score(test_data_finite, test_labels))

logistic regression score: 0.792683


In [16]:
features_dummies.columns

Index(['age', 'sibsp', 'parch', 'fare', 'pclass_1', 'pclass_2', 'pclass_3',
       'sex_female', 'sex_male', 'embarked_C', 'embarked_Q', 'embarked_S'],
      dtype='object')

In [17]:
data[0]

array([ 29.    ,   0.    ,   0.    , 211.3375,   1.    ,   0.    ,
         0.    ,   1.    ,   0.    ,   0.    ,   0.    ,   1.    ])

In [22]:
psl=np.array([ 19.    ,   0.    ,   0.    , 211.3375,   1.    ,   0.    ,
         0.    ,   1.    ,   0.    ,   0.    ,   0.    ,   1.    ])
lr.predict(psl[np.newaxis,:])

array([1])