# Binarizing lists

In the previous notebook, we prepared a CSV file that is ready to be used by the binarizer. In this notebook, we will filter only rows that are relevant for the binarizer.

In [1]:
import numpy as np
import pandas as pd
import multiprocessing as mp

Start with loading the pre-processed dataset.

In [2]:
data = pd.read_csv('../data/data_for_binarizer.csv')

In [3]:
data.head()

Unnamed: 0,week,user,tweets,hashtags,mentions,urls
0,36,bdogg,2,,,
1,25,0,1,,simonjjames,
2,30,00000000,1,,,http://www.livestream.com/00000000
3,32,00000000,1,,octavaria,
4,35,00000000,1,,,


The empty lists are loaded as `NaN`s. Before processing the data, let's analyze it a little and see how many rows actually contain any values within the lists.

In [4]:
data.shape

(25354401, 6)

In [5]:
data['hashtags'].dropna().shape

(3827022,)

In [6]:
data['mentions'].dropna().shape

(11330162,)

In [7]:
data['urls'].dropna().shape

(12550323,)

As we can see, the useful rows represent only about tenth of the original dataset for hashtags and urls, and less than half for mentions. Instead of applying the binerizer on the whole dataset, let's use only the relevant rows. We'll export these rows as CSVs.

In [8]:
hashtags = data.loc[data['hashtags'].dropna().index]

In [9]:
mentions = data.loc[data['mentions'].dropna().index]

In [10]:
urls = data.loc[data['urls'].dropna().index]

Now save the CSV on disk.

In [11]:
hashtags.to_csv('../data/hashtags_for_binarizer.csv', index=False)

In [12]:
mentions.to_csv('../data/mentions_for_binarizer.csv', index=False)

In [13]:
urls.to_csv('../data/urls_for_binarizer.csv', index=False)