# Model with time decay

With this classifier model, we will use less weeks as features. Also, weeks futher away from the target week will be divided by a number proportional to the difference in week numbers. This is to ensure that weeks that are further away from the target week have less influence on the model.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import math
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, f1_score

This is standard by now. Load the dataset. Now we can load the dataset that contains only numeric attributes.

In [3]:
data = pd.read_csv('../data/data_numbers_only.csv')

We want to throw away some weeks. Let's find out the minimum and maximum.

In [4]:
data['week'].min()

23

In [5]:
data['week'].max()

36

Let's use weeks 23-29 as predictors and week 30 as the target. 

In [6]:
data = data[data['week'] <= 30]

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,week,user,tweets,total_length,total_words,hashtags,mentions,urls
1,1,25,0,1,134,24,0,1,0
2,2,30,00000000,1,77,10,0,0,1
12,12,28,000000knight,2,124,16,2,0,2
13,13,29,000000knight,1,84,11,1,0,1
14,14,30,000000knight,1,91,12,1,0,1


Also, drop columns `total_length`, `total_words`, and the unnamed one.

In [8]:
data = data.drop(['Unnamed: 0', 'total_words', 'total_length'], axis=1)

In [9]:
data.head()

Unnamed: 0,week,user,tweets,hashtags,mentions,urls
1,25,0,1,0,1,0
2,30,00000000,1,0,0,1
12,28,000000knight,2,2,0,2
13,29,000000knight,1,1,0,1
14,30,000000knight,1,1,0,1


Now make the pivot table.

In [10]:
pivot = data.pivot_table(index='user', columns='week', aggfunc=np.sum, fill_value=0)

In [11]:
pivot['target'] = pivot['tweets'][30] > 0
pivot = pivot.drop(30, axis=1, level=1)

Just as before, balance the dataset.

In [12]:
active = pivot[pivot['target'] == True]
inactive = pivot[pivot['target'] == False]

In [13]:
inactive = inactive.sample(active.shape[0])

In [14]:
balanced = pd.concat([active, inactive])

In [15]:
balanced.head()

Unnamed: 0_level_0,tweets,tweets,tweets,tweets,tweets,tweets,tweets,hashtags,hashtags,hashtags,...,mentions,mentions,urls,urls,urls,urls,urls,urls,urls,target
week,23,24,25,26,27,28,29,23,24,25,...,28,29,23,24,25,26,27,28,29,Unnamed: 21_level_1
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
00000000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,True
000000knight,0,0,0,0,0,2,1,0,0,0,...,0,0,0,0,0,0,0,2,1,True
00001001000001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,True
0000thefilm,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,True
0000update,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,True


Now it's time to apply the time decay. We will just divide the features by the difference between the week number and the target week number (30).

In [16]:
decay = balanced

In [17]:
target_week = 30
for week in range(23, target_week):
    decay.loc[:, ('tweets', week)] = decay['tweets'][week] / (target_week - week)
    decay.loc[:, ('hashtags', week)] = decay['hashtags'][week] / (target_week - week)
    decay.loc[:, ('mentions', week)] = decay['mentions'][week] / (target_week - week)
    decay.loc[:, ('urls', week)] = decay['urls'][week] / (target_week - week)

## Training

It's time to train the classifier. Let's hope it will have better scoring than the previous ones.

In [18]:
train_rows = np.random.rand(decay.shape[0]) < 0.7
train = decay[train_rows].drop('target', axis=1)
train_target = decay[train_rows]['target']
test = decay[~train_rows].drop('target', axis=1)
test_target = decay[~train_rows]['target']

In [19]:
%%time
reg = LogisticRegressionCV(n_jobs=-1, verbose=1, max_iter=400)
model = reg.fit(train, train_target)

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   33.5s finished


CPU times: user 9.13 s, sys: 1.53 s, total: 10.7 s
Wall time: 39.4 s


In [20]:
predicted = model.predict(test)

In [21]:
accuracy_score(test_target, predicted)

0.60977869935408102

In [22]:
f1_score(test_target, predicted, average='macro')

0.59660240693794497

In [23]:
f1_score(test_target, predicted, average='micro')

0.60977869935408102

And it's not :(. Probably because we only used 7 weeks for the training.

## More weeks

Let's try creating the same model but using all weeks with the time decay.

In [2]:
data = pd.read_csv('../data/data_numbers_only.csv')

In [3]:
pivot = data.pivot_table(index='user', columns='week', values=['tweets', 'hashtags', 'mentions', 'urls'], aggfunc=np.sum, fill_value=0)

In [4]:
pivot['target'] = pivot['tweets'][36] > 0
pivot = pivot.drop(36, axis=1, level=1)

In [5]:
active = pivot[pivot['target'] == True]
inactive = pivot[pivot['target'] == False]

In [6]:
inactive = inactive.sample(active.shape[0])

In [7]:
balanced = pd.concat([active, inactive])

Just as before, we will apply a time decay. Because there are more weeks, we will divide the values by the square root of the difference in weeks.

In [8]:
decay = balanced

In [9]:
target_week = 36
for week in range(23, target_week):
    decay.loc[:, ('tweets', week)] = decay['tweets'][week] / math.sqrt(target_week - week)
    decay.loc[:, ('hashtags', week)] = decay['hashtags'][week] / math.sqrt(target_week - week)
    decay.loc[:, ('mentions', week)] = decay['mentions'][week] / math.sqrt(target_week - week)
    decay.loc[:, ('urls', week)] = decay['urls'][week] / math.sqrt(target_week - week)

## New training

Let's train the second model and see how well it performs. But before we do that, we need to clean some variables because I'm out of memory :/.

In [10]:
data = pivot = balanced = active = inactive = reg = model = None

Now train the model.

In [11]:
train_rows = np.random.rand(decay.shape[0]) < 0.7
train = decay[train_rows].drop('target', axis=1)
train_target = decay[train_rows]['target']
test = decay[~train_rows].drop('target', axis=1)
test_target = decay[~train_rows]['target']

In [12]:
%%time
reg = LogisticRegressionCV(n_jobs=-1, verbose=1, max_iter=400)
model = reg.fit(train, train_target)

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   50.9s finished


CPU times: user 12.8 s, sys: 1.96 s, total: 14.7 s
Wall time: 58.8 s


In [13]:
predicted = model.predict(test)

In [14]:
accuracy_score(test_target, predicted)

0.76138199608473989

In [15]:
f1_score(test_target, predicted, average='macro')

0.75582648295050858

In [16]:
f1_score(test_target, predicted, average='micro')

0.76138199608473989