# OLX Assignment

Overview of the solution:

- First I tokenize and stem the text fields in Lucene - I did not find any polish stemmers for python, so I had to do in in Java. It also removes stopwords
- Then I create some basic features:
  - One-Hot Encoding of `category`, `price_type`, `category`+`price-type` interaction
  - Bag of Words for `title` and `description`
  - Interaction between `price_type` and `title`
  - Whether we have seen title previously or not, time since seen last, if seen 1 second ago, 1 minute ago, 1 hour ago, etc
  - Whether seen description - same as above
- Model I use is Linear SVM with L1-penalty
- Cross-Validation:
  - Data is ordered by `arrival_date` - so I kept the order and just split 
  - Reason: we train on historical data and want to see how well it will perform on future unseen data
  - Split into 3 parts: train, validation and testing. Validation is used for parameter tuning and testing is used for final model verificaiton
- For evaluation I use AUC. The final model's AUC is 0.768
- For feature importance I used XGBoost
- I spent ~4-5 hours on this assignment, and it wasn't enough to try many things. E.g. I haven't really used the price
- I have other feature engineering ideas which we can discuss during the interview - if you like the solution
  

In [1]:
import json

import re
import codecs
from collections import Counter

import numpy as np
import pandas as pd
import scipy.sparse as sp

from tqdm import tqdm

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score

import xgboost as xgb
from sklearn.cross_validation import train_test_split

In [2]:
df_all = []
with open('train-stemmed.json', 'r') as f:
    for line in tqdm(f):
        df_all.append(json.loads(line))

df_all = pd.DataFrame(df_all)



In [3]:
df_all.label = df_all.label.astype('uint8')
df_all.price_type.fillna('N/A', inplace=1)
df_all.arrival_date = pd.to_datetime(df_all.arrival_date)
df_all.user_created_at = pd.to_datetime(df_all.user_created_at)

In [4]:
(df_all.category_id.value_counts() == 1).mean()

0.18768768768768768

In [5]:
translation = {u'ć': u'c', u'ę': u'e', u'ą': u'a', u'ś': u's', u'ć': u'c', 
               u'ź': u'z', u'ż': u'z', u'ó': u'o', u'ł': u'l', u'ń': u'n'}
translation = {ord(k): v for (k, v) in translation.items()}

def remove_diacritics(s):
    return s.translate(translation)

In [6]:
df_all.description = df_all.description.apply(remove_diacritics)
df_all.title  = df_all.title.apply(remove_diacritics)

In [7]:
df_all['time_from_registration'] = df_all.arrival_date - df_all.user_created_at
df_all.time_from_registration = df_all.time_from_registration.astype('timedelta64[s]').astype('float32')
df_all['small_time_gap'] = df_all.time_from_registration < 10000

Let us check if we saw any titles previously. If we saw a title, we keep the time we saw it last

In [8]:
seen = {}
seen_cnt = Counter()
title_last_seen = []
title_seen_cnt = []

for title, date in zip(df_all.title, df_all.arrival_date):
    if title in seen:
        last_seen = seen[title]
        title_last_seen.append(last_seen)
    else:
        title_last_seen.append(pd.NaT)

    title_seen_cnt.append(seen_cnt[title])
    seen_cnt[title] += 1
    seen[title] = date

In [9]:
one_sec = pd.to_timedelta('1s')
one_min = pd.to_timedelta('1m')
one_hour = pd.to_timedelta('1h')
three_hours = pd.to_timedelta('3h')

In [10]:
df_all['title_seen_cnt'] = title_seen_cnt
df_all['title_last_seen'] = title_last_seen
df_all.title_last_seen = (df_all.arrival_date - df_all.title_last_seen)

df_all['title_seen_previously'] = ~df_all.title_last_seen.isnull()
df_all['title_seen_onesec_ago'] = df_all.title_last_seen <= one_sec
df_all['title_seen_onemin_ago'] = df_all.title_last_seen <= one_min
df_all['title_seen_onehour_ago'] = df_all.title_last_seen <= one_hour
df_all['title_seen_threehours_ago'] = df_all.title_last_seen <= three_hours

Same for description

In [11]:
seen = {}
seen_cnt = Counter()
desc_last_seen = []
desc_seen_cnt = []

for desc, date in zip(df_all.description, df_all.arrival_date):
    if desc in seen:
        last_seen = seen[desc]
        desc_last_seen.append(last_seen)
    else:
        desc_last_seen.append(pd.NaT)

    desc_seen_cnt.append(seen_cnt[title])
    seen_cnt[desc] += 1
    seen[desc] = date

In [12]:
df_all['desc_seen_cnt'] = desc_seen_cnt
df_all['desc_last_seen'] = desc_last_seen
df_all.desc_last_seen = (df_all.arrival_date - df_all.desc_last_seen)

df_all['desc_seen_previously'] = ~df_all.desc_last_seen.isnull()
df_all['desc_seen_onesec_ago'] = df_all.desc_last_seen <= one_sec
df_all['desc_seen_onemin_ago'] = df_all.desc_last_seen <= one_min
df_all['desc_seen_onehour_ago'] = df_all.desc_last_seen <= one_hour
df_all['desc_seen_threehours_ago'] = df_all.desc_last_seen <= three_hours

In [13]:
df_all.head()

Unnamed: 0,arrival_date,category_id,description,label,price,price_type,title,user_created_at,time_from_registration,small_time_gap,...,title_seen_onemin_ago,title_seen_onehour_ago,title_seen_threehours_ago,desc_seen_cnt,desc_last_seen,desc_seen_previously,desc_seen_onesec_ago,desc_seen_onemin_ago,desc_seen_onehour_ago,desc_seen_threehours_ago
0,2016-08-27 04:00:42,1637,malowanie glada glazur terakotay panele solidnka,1,0.0,,malowanie rac,2014-12-27 11:50:55,52589388.0,False,...,False,False,False,0,NaT,False,False,False,False,False
1,2016-08-27 04:00:42,1637,malowanie glada glazur terakotay panele solidnka,1,0.0,,malowanie glada,2014-12-27 11:50:55,52589388.0,False,...,False,False,False,0,0 days,True,True,True,True,True
2,2016-08-27 04:00:48,2447,praca kostka brukowy ukladac docinac plantowac...,0,0.0,,praca kostka brukowy,2014-01-22 14:51:56,81868128.0,False,...,False,False,False,0,NaT,False,False,False,False,False
3,2016-08-27 04:00:52,1823,witac dwien dzisiejszy sprzedawac mebloscianka...,1,350.0,price,mebloscianka dobra cena dobry stan,2016-07-14 16:27:55,3756777.0,False,...,False,False,False,0,NaT,False,False,False,False,False
4,2016-08-27 04:01:15,1707,sprzedac pary buteleczka daivobet 60g zel niem...,0,120.0,price,daivobey 60g,2014-11-25 13:28:13,55348384.0,False,...,False,False,False,0,NaT,False,False,False,False,False


Let's check if some features give us a good baseline:

In [14]:
print 'time_from_registration, %.3f' % roc_auc_score(df_all.label, df_all.time_from_registration)
print 'title_seen_cnt, %.3f' % roc_auc_score(df_all.label, df_all.title_seen_cnt)

time_from_registration, 0.512
title_seen_cnt, 0.533


Now let's split the data for Cross-Validation

In [15]:
TRAIN = 0
VAL = 1
TEST = 2

n = len(df_all)
full_train_size = int(0.7 * n)
train_size = int(0.7 * full_train_size)

df_all['test'] = TRAIN
df_all.loc[df_all.index > train_size, 'test'] = VAL
df_all.loc[df_all.index > full_train_size, 'test'] = TEST

In [16]:
y_train = df_all.label[df_all.test == TRAIN].values
y_val = df_all.label[df_all.test == VAL].values

y_fulltrain = df_all.label[df_all.test != TEST].values
y_test = df_all.label[df_all.test == TEST].values

train_idx = (df_all.test == TRAIN).values
val_idx = (df_all.test == VAL).values

fulltrain_idx = (df_all.test != TEST).values
test_idx = (df_all.test == TEST).values

First let's try using the price type only

In [17]:
price_type = df_all.price_type
price_type_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
price_type_matrix = price_type_vectorizer.fit_transform(price_type)

In [18]:
price_type_train = price_type_matrix[train_idx]
price_type_val = price_type_matrix[val_idx]

In [19]:
C = 1.0
svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
svm.fit(price_type_train, y_train)

pred = svm.decision_function(price_type_val)
auc = roc_auc_score(y_val, pred)
print 'C=%.3f, auc=%.3f' % (C, auc)

C=1.000, auc=0.569


Also we can try using only category

In [20]:
categories = df_all.category_id.astype('str')
cat_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
cat_matrix = cat_vectorizer.fit_transform(categories)

In [21]:
cat_train = cat_matrix[train_idx]
cat_val = cat_matrix[val_idx]

In [22]:
for C in [0.01, 0.1, 0.5, 1]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(cat_train, y_train)

    pred = svm.decision_function(cat_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.010, auc=0.637
C=0.100, auc=0.678
C=0.500, auc=0.683
C=1.000, auc=0.683


What if we combine category and price type?

In [23]:
pt_cat = price_type + "_" + categories
pt_cat_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
pt_cat_matrix = pt_cat_vectorizer.fit_transform(pt_cat)
pt_cat_matrix.shape

(95329, 667)

In [24]:
pt_cat_train = pt_cat_matrix[train_idx]
pt_cat_val = pt_cat_matrix[val_idx]

In [25]:
for C in [0.01, 0.1, 0.5, 1]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(pt_cat_train, y_train)

    pred = svm.decision_function(pt_cat_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.010, auc=0.622
C=0.100, auc=0.701
C=0.500, auc=0.707
C=1.000, auc=0.707


Using only this already gives 70% AUC.

Let's try to use titles alone and then descriptions alone

In [26]:
titles = df_all.title
title_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
title_matrix = title_vectorizer.fit_transform(titles)

In [27]:
title_train = title_matrix[train_idx]
title_val = title_matrix[val_idx]

In [28]:
for C in [0.01, 0.05, 0.1, 0.5]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(title_train, y_train)

    pred = svm.decision_function(title_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.010, auc=0.602
C=0.050, auc=0.672
C=0.100, auc=0.681
C=0.500, auc=0.670


In [29]:
desc = df_all.description
dect_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
desc_matrix = dect_vectorizer.fit_transform(desc)

In [30]:
desc_train = desc_matrix[train_idx]
desc_val = desc_matrix[val_idx]

In [31]:
for C in [0.01, 0.05, 0.1, 0.5]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(desc_train, y_train)

    pred = svm.decision_function(desc_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.010, auc=0.662
C=0.050, auc=0.694
C=0.100, auc=0.694
C=0.500, auc=0.662


We see that they are less performant than category+price type. 

Now let's try to append the category to each token in title. Also, some of the categories occur very infrequently, so we can replace them with something like "X"

In [32]:
def append_cat(cat, tokens):
    split = tokens.split()
    return ' '.join(cat + "_" + t for t in split)

In [33]:
cat_cnt = Counter(categories)

hfcats = categories.copy()
hfcats[hfcats.apply(cat_cnt.get) <= 15] = 'X'

In [34]:
hfcats_titles = hfcats.combine(titles, append_cat)

In [35]:
hfcats_title_vectorizer = CountVectorizer(binary=1, dtype='uint8', min_df=10)
hfcats_title_matrix = hfcats_title_vectorizer.fit_transform(hfcats_titles)
hfcats_title_matrix.shape

(95329, 6668)

In [36]:
hfcats_title_train = hfcats_title_matrix[train_idx]
hfcats_title_val = hfcats_title_matrix[val_idx]

In [37]:
for C in [0.1, 0.5, 0.7, 1]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(hfcats_title_train, y_train)

    pred = svm.decision_function(hfcats_title_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.100, auc=0.669
C=0.500, auc=0.678
C=0.700, auc=0.675
C=1.000, auc=0.671


The performance is almost the same as using plain titles.

Now let's put everything together and also use some of the features we created previsouly

In [38]:
df_all.title_seen_cnt = df_all.title_seen_cnt.astype('uint32')

In [39]:
orig_colums = ['title_seen_previously', 
               'title_seen_onesec_ago', 
               'title_seen_onemin_ago', 
               'title_seen_onehour_ago', 
               'title_seen_threehours_ago',
               'desc_seen_cnt',
               'desc_seen_previously',
               'desc_seen_onesec_ago',
               'desc_seen_onemin_ago',
               'desc_seen_onehour_ago',
               'desc_seen_threehours_ago',
               'small_time_gap']

X = sp.hstack([
        price_type_matrix, cat_matrix, pt_cat_matrix,
        title_matrix,
        desc_matrix,
        hfcats_title_matrix, 
        df_all[orig_colums].values.astype('float'),
    ], format='csr')
X

<95329x32877 sparse matrix of type '<type 'numpy.float64'>'
	with 3277965 stored elements in Compressed Sparse Row format>

In [40]:
X_train = X[train_idx]
X_val = X[val_idx]

In [41]:
for C in [0.01, 0.05, 0.1]:
    svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
    svm.fit(X_train, y_train)

    pred = svm.decision_function(X_val)
    auc = roc_auc_score(y_val, pred)
    print 'C=%.3f, auc=%.3f' % (C, auc)

C=0.010, auc=0.725
C=0.050, auc=0.761
C=0.100, auc=0.761


Value of $C=0.05$ looks reasonable, so we'll use it for the final model. Let's retrain it on the entire train+val set and then evaluate it on test

In [42]:
X_fulltest = X[fulltrain_idx]
X_test = X[test_idx]

In [43]:
C = 0.05
svm = LinearSVC(penalty='l1', dual=False, C=C, random_state=1)
svm.fit(X_fulltest, y_fulltrain)
pred = svm.decision_function(X_test)
auc = roc_auc_score(y_test, pred)
print 'auc=%.3f' % (auc)

auc=0.768


The final AUC is 0.768 which is in line with our Cross-Validation, so we didn't overfit

## Feature importance

I calculate the importance of categorical and text variables by:

- building individual models on these variables and then 
- evaluating the contribution of each model in an ensemble 
- I also add numerical variables there
- the ensembling technique is stacking
  - to prevent overfitting, I train these models on train, and apply them to validation
  - then I stack only the predictions on validation
  - finally, the validation data is also split into train and test to make sure xgb doesn't overfit
- I use fscore from xgb to rank features by importance - this is the number of times a feature is used

In [44]:
fi_train = pd.DataFrame()

svm = LinearSVC(penalty='l1', dual=False, C=1.0, random_state=1)
svm.fit(price_type_train, y_train)
fi_train['price_type'] = svm.decision_function(price_type_val)

svm = LinearSVC(penalty='l1', dual=False, C=0.5, random_state=1)
svm.fit(cat_train, y_train)
fi_train['category'] = svm.decision_function(cat_val)

svm = LinearSVC(penalty='l1', dual=False, C=0.5, random_state=1)
svm.fit(pt_cat_train, y_train)
fi_train['pt_cat'] = svm.decision_function(pt_cat_val)

svm = LinearSVC(penalty='l1', dual=False, C=0.1, random_state=1)
svm.fit(title_train, y_train)
fi_train['title'] = svm.decision_function(title_val)

svm = LinearSVC(penalty='l1', dual=False, C=0.05, random_state=1)
svm.fit(desc_train, y_train)
fi_train['desc'] = svm.decision_function(desc_val)

svm = LinearSVC(penalty='l1', dual=False, C=0.1, random_state=1)
svm.fit(hfcats_title_train, y_train)
fi_train['hfcats_title'] = svm.decision_function(hfcats_title_val)

df_val = df_all[df_all.test == VAL]
fi_train['time_from_registration'] = df_val.time_from_registration.values
fi_train['title_last_seen'] = df_val.title_last_seen.dt.seconds.values
fi_train['desc_last_seen'] = df_val.desc_last_seen.dt.seconds.values
fi_train['price'] = df_val.price.values

fi_train.head()

Unnamed: 0,price_type,category,pt_cat,title,desc,hfcats_title,time_from_registration,title_last_seen,desc_last_seen,price
0,-0.709288,-0.872748,-0.868735,-0.795494,-0.821338,-0.779687,35780424.0,100.0,,50.0
1,-0.703983,-0.595238,-0.72728,-0.730911,-0.622101,-0.69731,74114752.0,,,180.0
2,-0.703983,-0.872748,-0.908623,-0.787836,-0.79926,-0.877503,82266080.0,22339.0,,300.0
3,-0.703983,-0.701481,-0.790007,-0.761514,-0.8021,-0.729825,112347408.0,,,15.0
4,-0.709288,-0.701481,-0.765259,-0.773544,-0.656746,-0.894386,56841272.0,,,7.0


In [50]:
fi_train.corr()

Unnamed: 0,price_type,category,pt_cat,title,desc,hfcats_title,time_from_registration,title_last_seen,desc_last_seen,price
price_type,1.0,0.045328,0.311405,0.070014,0.116891,0.016803,-0.001709,-0.038156,-0.087992,-0.004706
category,0.045328,1.0,0.823139,0.369405,0.34844,0.454427,-0.041515,0.009836,-0.031583,-0.000925
pt_cat,0.311405,0.823139,1.0,0.35518,0.358891,0.42445,-0.02718,-0.001142,-0.040091,-0.009996
title,0.070014,0.369405,0.35518,1.0,0.466403,0.685262,-0.013708,-0.021927,-0.025111,-0.001817
desc,0.116891,0.34844,0.358891,0.466403,1.0,0.400582,-0.001113,-0.003371,0.018514,0.002691
hfcats_title,0.016803,0.454427,0.42445,0.685262,0.400582,1.0,-0.016322,-0.009666,0.008863,-0.003943
time_from_registration,-0.001709,-0.041515,-0.02718,-0.013708,-0.001113,-0.016322,1.0,0.014888,0.007475,-0.002217
title_last_seen,-0.038156,0.009836,-0.001142,-0.021927,-0.003371,-0.009666,0.014888,1.0,0.918523,0.01653
desc_last_seen,-0.087992,-0.031583,-0.040091,-0.025111,0.018514,0.008863,0.007475,0.918523,1.0,-0.014963
price,-0.004706,-0.000925,-0.009996,-0.001817,0.002691,-0.003943,-0.002217,0.01653,-0.014963,1.0


In [45]:
X_fi, X_fi_val, y_fi, y_fi_val = train_test_split(fi_train, y_val, test_size=0.2, random_state=1)

In [46]:
dtrain = xgb.DMatrix(X_fi, label=y_fi, feature_names=list(fi_train.columns))
dval = xgb.DMatrix(X_fi_val, label=y_fi_val, feature_names=list(fi_train.columns))
watchlist = [(dtrain, 'train'), (dval, 'val')]

In [47]:
xgb_pars = {
    'eta': 0.01,
    'gamma': 0,
    'max_depth': 6,
    'min_child_weight': 1,
    'max_delta_step': 0,
    'subsample': 1,
    'colsample_bytree': 1,
    'colsample_bylevel': 1,
    'lambda': 1,
    'alpha': 0,
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'nthread': 8,
    'seed': 42
}

In [48]:
n_estimators = 150
model = xgb.train(xgb_pars, dtrain, num_boost_round=n_estimators, verbose_eval=10, evals=watchlist)

[0]	train-auc:0.770331	val-auc:0.746386
[10]	train-auc:0.779498	val-auc:0.755073
[20]	train-auc:0.782698	val-auc:0.757453
[30]	train-auc:0.788296	val-auc:0.761578
[40]	train-auc:0.792206	val-auc:0.764498
[50]	train-auc:0.795856	val-auc:0.764895
[60]	train-auc:0.798147	val-auc:0.766376
[70]	train-auc:0.800251	val-auc:0.767069
[80]	train-auc:0.802492	val-auc:0.767156
[90]	train-auc:0.804297	val-auc:0.767609
[100]	train-auc:0.806101	val-auc:0.768718
[110]	train-auc:0.808090	val-auc:0.769767
[120]	train-auc:0.811202	val-auc:0.771340
[130]	train-auc:0.813151	val-auc:0.772883
[140]	train-auc:0.815301	val-auc:0.775022
[149]	train-auc:0.817036	val-auc:0.776527


In [49]:
sorted(model.get_fscore().items(), key=lambda (_, c): -c)

[('pt_cat', 1140),
 ('desc', 1091),
 ('price', 1035),
 ('hfcats_title', 972),
 ('title_last_seen', 832),
 ('title', 800),
 ('category', 693),
 ('time_from_registration', 691),
 ('desc_last_seen', 605),
 ('price_type', 325)]

The most important feature is the interaction between price type and category. No surprize - these features alone were able to give the best AUC. 

Price is also important, so including it to the model should imrove the performance