# Introduction to sklearn: Feature vectorization and preprocessing

In [None]:
%matplotlib inline
import sklearn
import pandas as pd
import numpy as np

### Flow chart of a machine learning experiment in sklearn

![Flow chart](supervised_scikit_learn.png)

The flow chart above is borrowed from the AstroML [tutorial on sklearn](http://www.astroml.org/sklearn_tutorial/general_concepts.html). 

Compared to the setup of the first part it includes an additional *feature vectorization* step.

### Task: Prediction of survival on the Titanic

The data is from a (currently running!) [competition](https://www.kaggle.com/c/titanic-gettingStarted) on the Machine learning contest site [Kaggle](https://www.kaggle.com). You are asked to predict whether a person survived the diaster given demographic such as gender and age, as well as additional information, such the price of the ticket, the room number of the cabin, etc.

In [None]:
titanic = pd.read_csv("titanic_train.csv", index_col=0)
D_train = titanic.copy()
del D_train['Survived']

Let's have a look at the data. For convenience, the variable descriptions from the [dataset homepage](https://www.kaggle.com/c/titanic-gettingStarted/data?genderclassmodel.csv) on Kaggle is reproduced below:

````
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)
````

In [None]:
titanic.head()

### Getting to know the data

#### What are good predictors of survival?

In [None]:
titanic.corr().Survived

#### Survival per gender, grouped by passenger class

In [None]:
titanic.groupby(['Pclass', 'Sex']).Survived.mean()

### Handling missing values

Percentage of values missing per column

In [None]:
D_train.isnull().sum() / float(len(D_train))

Considering that the majority of values in `Cabin` are missing, we decide to drop this column. Missing values in `Embarked` and `Cabin` are replaced with the most common value. Of course, one should be careful about such decisions, since they may greatly impact the predictive power of the final model. 

In [None]:
del D_train['Cabin']
D_train.Embarked.fillna(D_train.Embarked.mode()[0], inplace=True)
D_train.Age.fillna(D_train.Age.mode()[0], inplace=True)

Recheck missing values

In [None]:
D_train.isnull().sum() / float(len(D_train))

### Preparing the input using `DictVectorizer`

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
vec = DictVectorizer()
vec

The `.fit` method of `DictVectorizer` accepts a list of dictionaries where keys are feature names

In [None]:
toy_data = [{'sex': 'M', 'age': 45}, {'sex': 'F', 'age': 29, 'first_class': True}]
vec.fit(toy_data)
X = vec.transform(toy_data)
X

In [None]:
# Same result, but combines the two operations
X = vec.fit_transform(toy_data)
X

Visualizing the result

In [None]:
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Tip: If your dataset is in a `DataFrame`, you can get the list of dictionaries format using the `to_dict` method

In [None]:
titanic.iloc[:2].to_dict('records')

### Transform the Titanic dataset

In [None]:
vec = DictVectorizer()
y = titanic.Survived.values

# Note we must use `D_train`. Otherwise we are including the attribute to predict (Survived) in the training set
X = vec.fit_transform(D_train.to_dict('records'))
X

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print classification_report(y_test, y_pred)

In [None]:
print("Classifier classes", clf.classes_)
features = pd.Series(clf.coef_[0], index=vec.get_feature_names())

In [None]:
importance_order = features.abs().order(ascending=False).index
features[importance_order].head(20)

### Standardizing the data

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale train and test separatedly
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train.toarray())
X_test_s = scaler.transform(X_test.toarray())

clf = LogisticRegression()
clf.fit(X_train_s, y_train)
y_pred = clf.predict(X_test_s)
print(classification_report(y_test, y_pred))

In [None]:
features = pd.Series(clf.coef_[0], index=vec.get_feature_names())
importance_order = features.abs().order(ascending=False).index
features[importance_order].head(20)