# Model with hashtags, mentions, and URLs

Use the list of hashtags, mentions, and URLs just as a count. These counts will be used as more features for the model.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sb
from datetime import datetime
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, f1_score

sb.set_style('dark')
plt.rcParams['figure.figsize'] = (14,10)

As always, load the dataset first.

In [2]:
data = pd.read_csv('../data/data.csv')

Change week timestamps into week numbers.

In [3]:
%%time
data['week'] = data['week'].apply(lambda w: datetime.strptime(w, '%Y-%m-%d 00:00:00').isocalendar()[1])

CPU times: user 4min 31s, sys: 1.51 s, total: 4min 32s
Wall time: 4min 33s


Map the lists for hashtags, mentions, and URLs to their counts.

In [4]:
%%time
data['hashtags'] = data['hashtags'].apply(lambda lst: 0 if lst == '{}' else (lst.count(',') + 1))
data['mentions'] = data['mentions'].apply(lambda lst: 0 if lst == '{}' else (lst.count(',') + 1))
data['urls'] = data['urls'].apply(lambda lst: 0 if lst == '{}' else (lst.count(',') + 1))

CPU times: user 36.9 s, sys: 2.92 s, total: 39.8 s
Wall time: 39.8 s


As before, get rid of the 43th week.

In [5]:
data = data[data['week'] < 40]

In [6]:
data.shape

(25354401, 8)

Finally, transform the data into a pivot table.

In [7]:
pivot = data.pivot_table(index='user', columns='week', aggfunc=np.sum, fill_value=0)
pivot.head()

Unnamed: 0_level_0,tweets,tweets,tweets,tweets,tweets,tweets,tweets,tweets,tweets,tweets,...,urls,urls,urls,urls,urls,urls,urls,urls,urls,urls
week,23,24,25,26,27,28,29,30,31,32,...,27,28,29,30,31,32,33,34,35,36
user,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
bdogg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00000000,0,0,0,0,0,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0,0
000000000000111,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
000000000101010,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [8]:
pivot.shape

(8261630, 84)

Add a target column.

In [9]:
pivot['target'] = pivot['tweets'][36] > 0
pivot = pivot.drop(36, axis=1, level=1)

The dataset is heavily imbalanced. Before training the regression, we need to balance it.

In [10]:
active = pivot[pivot['target'] == True]
inactive = pivot[pivot['target'] == False]

In [11]:
inactive = inactive.sample(active.shape[0])

In [12]:
balanced = pd.concat([active, inactive])

## Training

Now split the data into train and test. Then train the logistic regression.

In [13]:
train_rows = np.random.rand(balanced.shape[0]) < 0.7
train = balanced[train_rows].drop('target', axis=1)
train_target = balanced[train_rows]['target']
test = balanced[~train_rows].drop('target', axis=1)
test_target = balanced[~train_rows]['target']

In [14]:
%%time
reg = LogisticRegressionCV(n_jobs=-1, verbose=1, max_iter=400)
model = reg.fit(train, train_target)

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  9.4min finished


CPU times: user 2min 17s, sys: 10 s, total: 2min 27s
Wall time: 10min 50s


In [15]:
predicted = model.predict(test)

In [16]:
accuracy_score(test_target, predicted)

0.76100100714059193

In [17]:
f1_score(test_target, predicted, average='macro')

0.75549076502079426

In [18]:
f1_score(test_target, predicted, average='micro')

0.76100100714059193