In this notebook we will learn how to create new variables and deal with categorical variable with linear model.
Some models are able to deal with categorical variable, you will discover it in the next notebook.
Here we will show you how to transform the data to make it usable by a linear regression

In [85]:
#We import the usuals packages and the model from sklearn 
import pandas as pd
import numpy as np
import matplotlib.pyplot as pp

In [90]:
dataset = pd.read_csv("/Users/jeanbaptiste/Downloads/customerLifetimeValue.csv", sep=";")
#We take the columns we need for our models and get the underlying matrix
X_numeric = dataset[["price_first_item_purchased", "pages_visited"]].values
#We also take a categorical variable
X_categorical = dataset["Country"]
#and we create a new feature
dataset["price/visited_pages"] = dataset["price_first_item_purchased"] / dataset["pages_visited"]
X_new_feature  = dataset["price/visited_pages"].values.reshape((-1, 1))
#We binarize the target, all value greater than a given revenue will become positive (1), other negative(0)
y = dataset["revenue"].values
y[y <= 175] = 0
y[y > 175] = 1

In [91]:
from sklearn.preprocessing import LabelBinarizer
#We fill missing categorical value with "unknown"
X_categorical.fillna("unknown", inplace=True)
my_binarizer = LabelBinarizer()
binarized_categories = my_binarizer.fit_transform(X_categorical)

This is how to convert categorical variables to numeric variables : each category has a column. For all other category we fill value with 0 except for the right one we fill with a 1.

In [92]:
#To avoid computation problems, we need to drop one column of the binarized categories matrix.
#All estimated coefficients will be relative to the category we dropped
binarized_categories = binarized_categories[:, 1:]
#then we concatenate the matrix with the numerical variables
X = np.hstack([X_numeric, binarized_categories, X_new_feature])
X

array([[ 44.        ,   6.        ,   0.        , ...,   0.        ,
          0.        ,   7.33333333],
       [117.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,  23.4       ],
       [ 44.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   8.8       ],
       ...,
       [ 15.5       ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   3.1       ],
       [ 44.        ,   8.        ,   0.        , ...,   0.        ,
          0.        ,   5.5       ],
       [ 44.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   8.8       ]])

In [95]:
#We create test and train datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1337)
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [96]:
from sklearn.metrics import roc_auc_score
train_score = roc_auc_score(y_train, model.predict(X_train))
test_score = roc_auc_score(y_test, model.predict(X_test))
print("train score : %f, test score : %f"%(train_score, test_score))

train score : 0.773554, test score : 0.770236


Now it's your turn : binarize another categorical variable and add it to the model, then compare your score with what we did!