Some models are very interesting to work with because they show how a simple idea can lead to a powerful model.
We will here show the K-neighbors model. The idea is to find the K nearest neighbors of a point and mean their label to estimate the value for this point.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as pp

In [2]:
dataset = pd.read_csv("/Users/jeanbaptiste/Downloads/customerLifetimeValue.csv", sep=";")
#We take the columns we need for our models and get the underlying matrix
X_numeric = dataset[["price_first_item_purchased", "pages_visited"]].values
#We also take a categorical variable
X_categorical = dataset["Country"]
#and we create a new feature
dataset["price/visited_pages"] = dataset["price_first_item_purchased"] / dataset["pages_visited"]
X_new_feature  = dataset["price/visited_pages"].values.reshape((-1, 1))
#We binarize the target, all value greater than a given revenue will become positive (1), other negative(0)
y = dataset["revenue"].values
y[y <= 175] = 0
y[y > 175] = 1

In [3]:
from sklearn.preprocessing import LabelBinarizer
#We fill missing categorical value with "unknown"
#like linear model, KNN needs binarized categories
X_categorical.fillna("unknown", inplace=True)
my_binarizer = LabelBinarizer()
binarized_categories = my_binarizer.fit_transform(X_categorical)

In [4]:
#We join data, but with this model we don't need to drop the first column
#then we concatenate the matrix with the numerical variables
X = np.hstack([X_numeric, binarized_categories, X_new_feature])
X

array([[ 44.        ,   6.        ,   0.        , ...,   0.        ,
          0.        ,   7.33333333],
       [117.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,  23.4       ],
       [ 44.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   8.8       ],
       ...,
       [ 15.5       ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   3.1       ],
       [ 44.        ,   8.        ,   0.        , ...,   0.        ,
          0.        ,   5.5       ],
       [ 44.        ,   5.        ,   0.        , ...,   0.        ,
          0.        ,   8.8       ]])

In [10]:
#We create test and train datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1337)
#first we try with five neighbors
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')

In [11]:
from sklearn.metrics import roc_auc_score
train_score = roc_auc_score(y_train, model.predict(X_train))
test_score = roc_auc_score(y_test, model.predict(X_test))
print("train score : %f, test score : %f"%(train_score, test_score))

train score : 0.770252, test score : 0.749126


as you can see, the model works well but it require the number of neighbors to select. To define the best number of neighbors, we will test different values to get the one optimizing the test set

In [14]:
for n in range(2, 20) :
    model = KNeighborsClassifier(n_neighbors=n)
    model.fit(X_train, y_train)
    predicted_test = model.predict(X_test)
    predicted_train = model.predict(X_train)
    score_test = roc_auc_score(y_test, predicted_test)
    score_train = roc_auc_score(y_train, predicted_train)
    print("train score : %f, test score : %f"%(score_train, score_test))

train score : 0.725738, test score : 0.691086
train score : 0.773477, test score : 0.736759
train score : 0.771147, test score : 0.748106
train score : 0.772679, test score : 0.739475
train score : 0.771405, test score : 0.745620
train score : 0.771006, test score : 0.746818
train score : 0.768164, test score : 0.747966
train score : 0.769715, test score : 0.750417
train score : 0.770252, test score : 0.749126
train score : 0.768981, test score : 0.753394
train score : 0.767429, test score : 0.751369
train score : 0.767771, test score : 0.754303
train score : 0.770884, test score : 0.754945
train score : 0.767276, test score : 0.749748
train score : 0.768643, test score : 0.754563
train score : 0.764238, test score : 0.750114
train score : 0.766601, test score : 0.751879
train score : 0.762652, test score : 0.750374


This way to get best number of neighbors is called "grid search". Now it's your turn : choose your variables and use K-Fold to predict your target. Use K-Fold to see the difference in results.