# CTR using logistic classifier

In this notebook , we are going to use logistic classifier to predict ad click taking our avazu dataset. Now, tree based algorithms works fine for the this task but as the data grows, they become very computationaly inefficient very fast. 

Logistic based algorithms are considered most scalable algorithms for classification purposes in machine learning and are used widely when dataset is huge and we have limited computation power. Although, they might not give us accuracy as good as tree based algorithms, they really help in reducing our time to train a model and for online learning applications(ofcourse AD click prediction is one of them.)

So, lets implement it using scikit-learn.

In [3]:
import pandas as pd
n_rows = 300000

df = pd.read_csv("train.gz", nrows = n_rows)
df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1.000009e+18,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,1.000017e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,1.000037e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,1.000064e+19,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,1.000068e+19,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


Dropping the columns which seems to be useless.

In [4]:
X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'], axis=1).values
Y = df['click'].values

In [5]:
X.shape

(300000, 19)

# Training and validation sets

 We are going to split the data taking first 90% of data as train set and last 10% as validation set as data is chronological.

In [6]:
n_train = int(n_rows * 0.9)

X_train = X[:n_train]
Y_train = Y[:n_train]
X_test = X[n_train:]
Y_test = Y[n_train:]

# One Hot Encoding

In [8]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X_train_enc = enc.fit_transform(X_train)

In [9]:
X_train_enc[0]
print(X_train_enc[0])

  (0, 2)	1.0
  (0, 6)	1.0
  (0, 188)	1.0
  (0, 2608)	1.0
  (0, 2679)	1.0
  (0, 3771)	1.0
  (0, 3885)	1.0
  (0, 3929)	1.0
  (0, 4879)	1.0
  (0, 7315)	1.0
  (0, 7319)	1.0
  (0, 7475)	1.0
  (0, 7824)	1.0
  (0, 7828)	1.0
  (0, 7869)	1.0
  (0, 7977)	1.0
  (0, 7982)	1.0
  (0, 8021)	1.0
  (0, 8189)	1.0


In [10]:
X_test_enc = enc.transform(X_test)

# Training a logistic regression model

In this section, we are going to train a logistic classifier using SGDclassifier(Stoichastic Gradient Descent classifier) which means that model weights are updated after every datapoint that goes in our model. This is really fast compared to standard gradient descent and tends to be more accurate as well

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_lr = SGDClassifier(loss='log', penalty=None, fit_intercept=True, n_iter=10, learning_rate='constant', eta0=0.01)
sgd_lr.fit(X_train_enc.toarray(), Y_train)

pred = sgd_lr.predict_proba(X_test_enc.toarray())[:, 1]
print('Training samples: {0}, AUC on testing set: {1:.3f}'.format(n_train, roc_auc_score(Y_test, pred)))

# Feature selection with L1 regularisation


In [None]:
sgd_lr_l1 = SGDClassifier(loss='log', penalty='l1', alpha=0.0001, fit_intercept=True, n_iter=10, learning_rate='constant', eta0=0.01)
sgd_lr_l1.fit(X_train_enc.toarray(), Y_train)

coef_abs = np.abs(sgd_lr_l1.coef_)
coef_abs

# bottom 10 weights and the corresponding 10 least important features
print(np.sort(coef_abs)[0][:10])

feature_names = enc.get_feature_names()
bottom_10 = np.argsort(coef_abs)[0][:10]
print('10 least important features are:\n', feature_names[bottom_10])

# top 10 weights and the corresponding 10 most important features
print(np.sort(coef_abs)[0][-10:])
top_10 = np.argsort(coef_abs)[0][-10:]
print('10 most important features are:\n', feature_names[top_10])

# Online learning

Now, as we can train our model and can select features using L1 regularisation, we are done with our logistic classifier.

Its time to talk about online learning. Online learning basically means that data keeps coming in stream and we use the newest data to refine our model each times it comes in.

Obviously, this is helpful in many cases where data is collected all the time and that is also the case with AD click. So, it works like that. 
1. Data is being collected everyday from websites and stored in a database.
2. This data is used to train our model(or refine) every 10 days or so.
3. Newer model is used to show new ads to users and the process repeats.

The good thing about this kind of learning is that trends are always changing and our model knows those trends and can work better than when trained with a lot of data at once(known as Batch learning).

One more thing to note here is that sometimes data that we have right now(our batch) is too big to fit in the memory and so, online learning can be used to take data as stream instead of batch(like streams of 1000 datapoints everytime) and it can be trained easily.


In [None]:
sgd_lr_l1 = SGDClassifier(loss='log', penalty='l1', alpha=0.0001, fit_intercept=True, n_iter=10, learning_rate='constant', eta0=0.01)
sgd_lr_l1.fit(X_train_enc.toarray(), Y_train)

coef_abs = np.abs(sgd_lr_l1.coef_)
print(coef_abs)

# bottom 10 weights and the corresponding 10 least important features
print(np.sort(coef_abs)[0][:10])

feature_names = enc.get_feature_names()
bottom_10 = np.argsort(coef_abs)[0][:10]
print('10 least important features are:\n', feature_names[bottom_10])

# top 10 weights and the corresponding 10 most important features
print(np.sort(coef_abs)[0][-10:])
top_10 = np.argsort(coef_abs)[0][-10:]
print('10 most important features are:\n', feature_names[top_10])

In [None]:
# The number of iterations is set to 1 if using partial_fit.
sgd_lr_online = SGDClassifier(loss='log', penalty=None, fit_intercept=True, n_iter=1, learning_rate='constant', eta0=0.01)

In [None]:
# Use the first 1,000,000 samples for training, and the next 100,000 for testing
for i in range(10):
    x_train = X_train[i*100000:(i+1)*100000]
    y_train = Y_train[i*100000:(i+1)*100000]
    x_train_enc = enc.transform(x_train)
    sgd_lr_online.partial_fit(x_train_enc.toarray(), y_train, classes=[0, 1])

print("--- %0.3fs seconds ---" % (timeit.default_timer() - start_time))

x_test_enc = enc.transform(X_test)

pred = sgd_lr_online.predict_proba(x_test_enc.toarray())[:, 1]
print('Training samples: {0}, AUC on testing set: {1:.3f}'.format(n_train * 10, roc_auc_score(Y_test, pred)))

As we can see, for online leanring we got to use partial_fit function and fix the number of iterations to 1.

Now, this is the example when our data was on disc but too big. We can easily change it for data coming from any external server or database using python functionings. 

Some hyperparameter for SGDclassifier:
1. loss = 'log' defines that we want to use logitic clasifier. Default for SGD is SVM classifier.
2. penalty . ofcourse we can L1 or L2 regularisation.
3. n_iter . Number of iterations.
4. learning_rate.
5. eta0 = 0.01. Initial Leaning rate.