<img style="center" width="300" src="static/images/logo-training.png" />

<h1  style="text-align:center"> Notebook #4 </h1>

<h1>Feature engineering</h1>
<img width="600" src="https://api.ning.com/files/ewwzspTVVqZ7yGyi4JAL8UaSr7FgAFg4HhNKRKM51v3ofDqR0VcBGJkio9C6je8BKC7DeCrxiZ91hpB0c*C6RlNOd04RPyK2/powertools.png" />

In this notebook we will learn how to create new features and deal with categorical variables.
Some models are able to deal with categorical variable, you will discover it in the next notebook.
Here we will show you how to transform the data to make it usable by a logistic regression. Indeed, "numerical" models, such as the logistic regression can only work with digits, and nothing else. We need to find a way to transform categorical features (namely, <code>Country</code>, <code>Campain</code> and <code>gender</code>) into numerical ones.


<p>
    
</p>

In [None]:
#We import the usuals packages and the model from sklearn 
import pandas as pd
import numpy as np
import matplotlib.pyplot as pp

In [None]:
dataset = pd.read_csv("./data/customerLifetimeValue.csv", sep=";")
#We take the columns we need for our models and get the underlying matrix
X_numeric = dataset[["price_first_item_purchased", "pages_visited"]].copy()
#We also take a categorical variable
X_categorical = dataset["Country"].copy()
#and we create a new feature and add it to the X_numeric DataFrame
X_numeric = X_numeric.assign(priceByVisited_pages = X_numeric["price_first_item_purchased"]/X_numeric["pages_visited"])
#We binarize the target, all value greater than a given revenue will become positive (1), other negative(0)
Y = dataset["revenue"].apply(lambda x: 0 if x <= 175 else 1).values

In [None]:
from sklearn.preprocessing import LabelBinarizer
#We fill missing categorical value with "unknown"
X_categorical.fillna("unknown", inplace=True)
my_binarizer = LabelBinarizer()
binarized_categories = my_binarizer.fit_transform(X_categorical)

This is how to convert categorical variables to numerical variables : each distinct category valuehas a column. For all other category we fill value with 0 except for the right one we fill with a 1.

In [None]:
#To avoid computation problems, we need to drop one column of the binarized categories matrix.
#All estimated coefficients will be relative to the category we dropped
binarized_categories = binarized_categories[:, 1:]
#then we concatenate the matrix with the numerical variables
X = np.hstack([X_numeric.values, binarized_categories])


<p>
A new data set has been produces, it contains the following features :
    <ul>
        <li><code>price_frist_item_purchased</code> and <code>pages_visited</code></li>
        <li><code>Country</code> that has been encoded as as many features as distinct values (one column per possible value)
        <li><code>priveByVisitedPages</code> the feature we created</li>
    </ul>
 </p>
 
<p>
    Creating new features is a solid way to improve a model's performance :
</p>

In [None]:
#We create test and train datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=1337)
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import roc_auc_score
train_score = roc_auc_score(y_train, model.predict(X_train))
test_score = roc_auc_score(y_test, model.predict(X_test))
print("train score : %f, test score : %f"%(train_score, test_score))

<p>
    By computing a new feature and by encoding the feature <code>Country</code> we significantly improved the performance of the model.
</p>

<hr>
<h1 style="text-align:center; color:orange">YOUR TURN</h1>
<hr>

<b>A] binarize another feature and create a new one ! </b>

<b>B] Feature engineering requires a good business knowledge. Look at the age feature. What do you notice ? How can this help you creating a new feature ?</b>