# First steps

In this notebook we will discover few essentials ideas about model selection.
We will test one of the most basic models : the linear regression

In [9]:
#We import the usuals packages and the model from sklearn 
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

In [31]:
dataset = pd.read_csv("/Users/jeanbaptiste/Downloads/customerLifetimeValue.csv", sep=";")
#We take the columns we need for our models and get the underlying matrix
X = dataset[["price_first_item_purchased", "pages_visited"]].values
#We binarize the target, all value greater than a given revenue will become positive (1), other negative(0)
y = dataset["revenue"].values
print(dataset["revenue"].describe())
y[y <= 175] = 0
y[y > 175] = 1

count    29299.000000
mean       177.451654
std         69.052396
min         28.000000
25%        133.000000
50%        172.000000
75%        217.000000
max        549.000000
Name: revenue, dtype: float64


To estimate model's precision, we need to test is with data the model didn't fit on. We will split our sample in two dataframes : test and train

In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1337)

Linear models are simple, fast to fit and easily understandable. Those models have good performances on small and medium datasets. 
We will fit a very simple model on the train data. Note the model API is the same on the whole sklearn package.

In [27]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

One of the most important thing about the precision of a model is to choose a metric. Here we will chose the R2 metric, but there are a lot of precision criterion for classification or regression.

In [28]:
from sklearn.metrics import roc_auc_score
train_score = roc_auc_score(y_train, model.predict(X_train))
test_score = roc_auc_score(y_test, model.predict(X_test))
print("train score : %f, test score : %f"%(train_score, test_score))

train score : 0.589723, test score : 0.582032


But the score we get here can be biased. There is a probability to get a specific subsamble when splitting. To avoid it, we can use K-Fold validation.
With K-fold validation, we train the model on all the folds except one, test precision on the last fold. Then we put back the test set with the other sets and take another fold as test fold. We repeat those steps until all folds have been used as test fold.

In [30]:
from sklearn.model_selection import KFold
folds_maker = KFold(n_splits=10)
train_score = []
test_score = []
for train_index, test_index in folds_maker.split(X):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index]
    model = LinearRegression()
    model.fit(X_train, y_train)
    train_score.append(roc_auc_score(y_train, model.predict(X_train)))
    test_score.append(roc_auc_score(y_test, model.predict(X_test)))
print("train score : %f, test score : %f"%(np.mean(train_score), np.mean(test_score)))

train score : 0.693303, test score : 0.693174


We fit the model one time per fold, so this method can become really slow if the model takes time to fit.
Now it's your turn : fit another linear model with two variables of your choice and evaluate model performance with a train-test split and a K-Fold validation.