# Binarizing lists

It's time to take into account not just the number of hashtags, URLs, and mentions, but the whole lists of them. In this notebook, we will prepare the original data for binarizing the columns with lists. (We can't do it in this notebook because the computer quickly runs out of memory...)

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime

In [2]:
data = pd.read_csv('../data/data.csv')

In [3]:
data.head()

Unnamed: 0,week,user,tweets,total_length,total_words,hashtags,mentions,urls
0,2009-08-31 00:00:00,bdogg,2,133,24,{},{},{}
1,2009-06-15 00:00:00,0,1,134,24,{},{simonjjames},{}
2,2009-07-20 00:00:00,00000000,1,77,10,{},{},{http://www.livestream.com/00000000}
3,2009-08-03 00:00:00,00000000,1,28,5,{},{octavaria},{}
4,2009-08-24 00:00:00,00000000,1,68,12,{},{},{}


Map the week timestamps into week numbers.

In [4]:
%%time
data['week'] = data['week'].apply(lambda w: datetime.strptime(w, '%Y-%m-%d 00:00:00').isocalendar()[1])

CPU times: user 4min 15s, sys: 951 ms, total: 4min 16s
Wall time: 4min 16s


Drop columns `total_length` and `total_words`.

In [5]:
data = data.drop(['total_length', 'total_words'], axis=1)

Now map the list of hashtags, mentions, and URLs to strings with `|` as a separator.

In [6]:
%%time
for c in ['hashtags', 'mentions', 'urls']:
    data[c] = data[c].apply(lambda s: s[1:-1].replace(',', '|'))

CPU times: user 41.3 s, sys: 3.48 s, total: 44.8 s
Wall time: 45 s


Get rid of the 43th week.

In [7]:
data = data[data['week'] < 40]

Just to make sure we didn't make a mistake, let's check the data type of hashtags.

In [8]:
data.head(10)

Unnamed: 0,week,user,tweets,hashtags,mentions,urls
0,36,bdogg,2,,,
1,25,0,1,,simonjjames,
2,30,00000000,1,,,http://www.livestream.com/00000000
3,32,00000000,1,,octavaria,
4,35,00000000,1,,,
5,36,00000000,1,,,
6,33,000000000000111,1,internetmarketing|internet|cel,,http://wefollow.com
7,32,000000000101010,1,,,
8,31,00000001,1,,,
9,32,00000001,1,,,


In [9]:
type(data.loc[0, 'hashtags'])

str

Save the data for later processing. In the next notebook, we will binarize the columns, one by one.

In [10]:
data.to_csv('../data/data_for_binarizer.csv')