This notebook is used to preprocess the dataset for input into our ML models using spacy for tokenization, lemmatization, and PoS tagging.

In [47]:
from __future__ import unicode_literals, print_function

import tld
import spacy
import numpy as np
from sklearn.preprocessing import LabelEncoder

import utils

spacy.load('en')

%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Pre-pre-processing
Read and clean article text for the types of articles we care about, namely those Alchemy has labeled as political, and that we have labeled as either "conservative" or "liberal".  We'll also extract top-level domain for later.  The output here is a pandas dataframe.

In [2]:
# Read data, excluding satirical because there are too few samples.

datadir = '../../news-crawler/data/articles/'
files = utils.get_file_list(datadir, exclude_regex='.*satirical')

print('Number of articles: {}'.format(len(files)))

Number of articles: 67292


In [3]:
%%time

# Use multiprocessing to pre-pre-process.

df = utils.create_dataframe(files)
print(df.shape)

(43226, 3)
CPU times: user 662 ms, sys: 305 ms, total: 967 ms
Wall time: 29.9 s


In [4]:
# Take a peek!

df.head()

Unnamed: 0,text,label,url
0,Big government has been crushing the United St...,conservative,http://dailysurge.com/2016/12/commentary-resto...
1,During the eight years of the Obama administra...,conservative,http://dailysurge.com/2016/12/commentary-welco...
2,We are witnessing the rise of a new “right” wh...,conservative,http://dailysurge.com/2016/11/commentary-28th-...
3,If there’s one thing that Americans find intol...,conservative,http://dailysurge.com/2016/11/commentary-retur...
4,What is Airbnb? Airbnb is an online marketplac...,conservative,http://dailysurge.com/2016/11/commentary-airbn...


In [7]:
# Extract top-level domain for later.

df['domain'] = df['url'].map(tld.get_tld)
df.head()

Unnamed: 0,text,label,url,domain
0,Big government has been crushing the United St...,conservative,http://dailysurge.com/2016/12/commentary-resto...,dailysurge.com
1,During the eight years of the Obama administra...,conservative,http://dailysurge.com/2016/12/commentary-welco...,dailysurge.com
2,We are witnessing the rise of a new “right” wh...,conservative,http://dailysurge.com/2016/11/commentary-28th-...,dailysurge.com
3,If there’s one thing that Americans find intol...,conservative,http://dailysurge.com/2016/11/commentary-retur...,dailysurge.com
4,What is Airbnb? Airbnb is an online marketplac...,conservative,http://dailysurge.com/2016/11/commentary-airbn...,dailysurge.com


### Pre-processing using spacy NLP

__Here we are going to lemmatize the text and tag words with their part of speech.__

In [8]:
# We'll keep article if it has at least min_sents sentences.
min_sents = 3     

# Whether to exclude stopwords.
keep_stops = False

In [31]:
%autoreload
import utils

In [32]:
%%time

# Tokenize the text.
df['tokenized'] = utils.parse_docs(list(df['text']), keep_stops, min_sents)

processing 43226 docs
CPU times: user 1.3 s, sys: 482 ms, total: 1.78 s
Wall time: 5min 38s


In [38]:
# Some of the articles are empty, let's remove them.

df = df.drop(df.index[np.where(df['tokenized'] == '')[0]])
df.index = range(df.shape[0])  # Need to re-index again.
df.shape

(41917, 4)

__Next let's encode the entire corpus into some vocab data structures.__

In [44]:
%%time

# Extract the vocabulary and related data structures for encoding/decoding the corpus.

vocab_list, vocab_word2idx, vocab_idx2word = utils.create_vocab(list(df['tokenized']))

dictionary size: 116081
CPU times: user 6.12 s, sys: 38.4 ms, total: 6.16 s
Wall time: 6.14 s


In [45]:
# Check out the top 10 words

vocab_list[:10]

[(u'say_VERB', 206657),
 (u'trump_PROPN', 185069),
 (u'people_NOUN', 58478),
 (u'president_PROPN', 56919),
 (u'house_PROPN', 52946),
 (u'year_NOUN', 51064),
 (u'president_NOUN', 47328),
 (u'obama_PROPN', 42440),
 (u'state_NOUN', 39197),
 (u'time_NOUN', 39013)]

In [49]:
%%time

# Encode the corpus.

df['encoded_text'] = df['tokenized'].map(lambda x: [vocab_word2idx[y] for y in x.split()])

CPU times: user 4.19 s, sys: 124 ms, total: 4.32 s
Wall time: 4.31 s


In [58]:
# Encode the labels.

df['encoded_label'] = LabelEncoder().fit_transform([x for x in df['label']])
df.head()

Unnamed: 0,text,label,url,tokenized,encoded_label,encoded_text
0,Big government has been crushing the United St...,conservative,http://dailysurge.com/2016/12/commentary-resto...,big_ADJ government_NOUN crush_VERB united_PROP...,0,"[147, 14, 4668, 25, 38, 444, 360, 5, 14550, 16..."
1,During the eight years of the Obama administra...,conservative,http://dailysurge.com/2016/12/commentary-welco...,year_NOUN obama_PROPN administration_NOUN witn...,0,"[5, 7, 16, 2889, 60912, 72, 41, 7, 480, 6, 406..."
2,We are witnessing the rise of a new “right” wh...,conservative,http://dailysurge.com/2016/11/commentary-28th-...,witness_VERB rise_NOUN new_ADJ right_INTJ libe...,0,"[2889, 1700, 20, 2813, 1241, 3611, 235, 30, 14..."
3,If there’s one thing that Americans find intol...,conservative,http://dailysurge.com/2016/11/commentary-retur...,thing_NOUN americans_PROPN find_VERB intolerab...,0,"[72, 129, 54, 9584, 6720, 7117, 178, 18, 564, ..."
4,What is Airbnb? Airbnb is an online marketplac...,conservative,http://dailysurge.com/2016/11/commentary-airbn...,airbnb_PROPN airbnb_PROPN online_ADJ marketpla...,0,"[7024, 7024, 1742, 4014, 179, 2, 154, 1015, 14..."


### Write the data

In [68]:
_ = utils.write_dataset('../data/data', df, keep_stops, min_sents, vocab_list, vocab_word2idx, vocab_idx2word)

wrote to ../data/data-False-3
