# Binarizing hashtags

In this notebook, we'll load the CSV for binarizing hashtags and process it. Then we'll save it into another CSV to later concatenate it with the rest of the dataset.

In [1]:
import numpy as np
import pandas as pd
import multiprocessing as mp
from sklearn.preprocessing import MultiLabelBinarizer

In [2]:
data = pd.read_csv('../data/hashtags_for_binarizer.csv')

In [4]:
data.shape

(3827022, 6)

In [5]:
data.head()

Unnamed: 0,week,user,tweets,hashtags,mentions,urls
0,33,000000000000111,1,internetmarketing|internet|cel,,http://wefollow.com
1,32,0000001xx,1,tinychat,,http://tinychat.com/stickamraids
2,28,000000knight,2,Ebuyer845|Ebuyer478,,http://snurl.com/ebuyer1|http://snurl.com/ebuyer2
3,29,000000knight,1,Ebuyer111,,http://sn.im/ebuyer_sat
4,30,000000knight,1,Ebuyer355,,http://sn.im/ebuyer_sun


Now use the binarizer to process the hashtags column. We need to map the strings to arrays of hashtags.

In [6]:
%%time
bin_columns = MultiLabelBinarizer(sparse_output=True).fit_transform(data['hashtags'].apply(lambda x: x.split('|')))

CPU times: user 21.3 s, sys: 985 ms, total: 22.3 s
Wall time: 22.4 s


In [7]:
bin_columns.shape

(3827022, 1169693)

Get the number of rows with 1 for each hashtags. We will drop columns that have very few rows with 1.

In [8]:
hashtag_usage = bin_columns.sum(0)

In [9]:
%%time
columns_with_high_usage = bin_columns[:, np.where(hashtag_usage >= 200)[0]]
columns_with_low_usage = bin_columns[:, np.where(hashtag_usage < 200)[0]]

CPU times: user 7.58 s, sys: 4.41 s, total: 12 s
Wall time: 12.4 s


In [10]:
has_other_hashtags = columns_with_low_usage.sum(1).astype(bool).astype(int)

In [11]:
columns_with_high_usage.shape

(3827022, 4887)

This leaves us with `4887` columns for hashtags plus one for other hashtags. These are the hashtags that appear in at least 100 tweets. Now we need to drop the original `hashtags` column and insert the new columns.

In [12]:
data = data.drop('hashtags', axis=1).to_sparse(fill_value=0)

In [None]:
data['hashtag_other'] = pd.SparseSeries(np.asarray(has_other_hashtags).reshape(-1), fill_value=0)

In [None]:
%%time
num_columns = columns_with_high_usage.shape[1]
for i in range(num_columns):
    print('inserting column', i, '/', num_columns)
    data['hashtag_' + str(i)] = pd.SparseSeries(columns_with_high_usage[:, i].toarray().reshape(-1), fill_value=0)

inserting column 0 / 4887
inserting column 1 / 4887
inserting column 2 / 4887
inserting column 3 / 4887
inserting column 4 / 4887
inserting column 5 / 4887
inserting column 6 / 4887
inserting column 7 / 4887
inserting column 8 / 4887
inserting column 9 / 4887
inserting column 10 / 4887
inserting column 11 / 4887
inserting column 12 / 4887
inserting column 13 / 4887
inserting column 14 / 4887
inserting column 15 / 4887
inserting column 16 / 4887
inserting column 17 / 4887
inserting column 18 / 4887
inserting column 19 / 4887
inserting column 20 / 4887
inserting column 21 / 4887
inserting column 22 / 4887
inserting column 23 / 4887
inserting column 24 / 4887
inserting column 25 / 4887
inserting column 26 / 4887
inserting column 27 / 4887
inserting column 28 / 4887
inserting column 29 / 4887
inserting column 30 / 4887
inserting column 31 / 4887
inserting column 32 / 4887
inserting column 33 / 4887
inserting column 34 / 4887
inserting column 35 / 4887
inserting column 36 / 4887
inserting c

inserting column 299 / 4887
inserting column 300 / 4887
inserting column 301 / 4887
inserting column 302 / 4887
inserting column 303 / 4887
inserting column 304 / 4887
inserting column 305 / 4887
inserting column 306 / 4887
inserting column 307 / 4887
inserting column 308 / 4887
inserting column 309 / 4887
inserting column 310 / 4887
inserting column 311 / 4887
inserting column 312 / 4887
inserting column 313 / 4887
inserting column 314 / 4887
inserting column 315 / 4887
inserting column 316 / 4887
inserting column 317 / 4887
inserting column 318 / 4887
inserting column 319 / 4887
inserting column 320 / 4887
inserting column 321 / 4887
inserting column 322 / 4887
inserting column 323 / 4887
inserting column 324 / 4887
inserting column 325 / 4887
inserting column 326 / 4887
inserting column 327 / 4887
inserting column 328 / 4887
inserting column 329 / 4887
inserting column 330 / 4887
inserting column 331 / 4887
inserting column 332 / 4887
inserting column 333 / 4887
inserting column 334

inserting column 592 / 4887
inserting column 593 / 4887
inserting column 594 / 4887
inserting column 595 / 4887
inserting column 596 / 4887
inserting column 597 / 4887
inserting column 598 / 4887
inserting column 599 / 4887
inserting column 600 / 4887
inserting column 601 / 4887
inserting column 602 / 4887
inserting column 603 / 4887
inserting column 604 / 4887
inserting column 605 / 4887
inserting column 606 / 4887
inserting column 607 / 4887
inserting column 608 / 4887
inserting column 609 / 4887
inserting column 610 / 4887
inserting column 611 / 4887
inserting column 612 / 4887
inserting column 613 / 4887
inserting column 614 / 4887
inserting column 615 / 4887
inserting column 616 / 4887
inserting column 617 / 4887
inserting column 618 / 4887
inserting column 619 / 4887
inserting column 620 / 4887
inserting column 621 / 4887
inserting column 622 / 4887
inserting column 623 / 4887
inserting column 624 / 4887
inserting column 625 / 4887
inserting column 626 / 4887
inserting column 627

In [None]:
data.shape

We have the final dataset, let's save it to a CSV file. *TODO: The following code block raises an **IndexError** *

In [None]:
%%time
data.to_csv('../data/hashtags_binarized.csv', index=False)