# Drew - Preprocessing Demo Notebook

## This notebook demonstrates...
1. How to split a dataset into train and test sets
2. How to remove stop words from text using gensim
3. How to remove punctuation from text using standard python code
4. How to perform tokenization with spacy
5. How to perform stemming with NLTK
6. How to perform lemmatization with spacy

## #1 Split Dataset into train and test

In [1]:
import pandas as pd
import csv
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
%%time
#Read Initial Dataset
df = pd.read_csv('../Datasets/Final/cleaned_reliable_fake.csv')
#Split Train and Test Datasets and save to CSVs
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, shuffle=True) #80/20 split with consistent state
df_train.to_csv('../Datasets/Model_1/train.csv', sep=',', index=False)
df_test.to_csv('../Datasets/Model_1/test.csv', sep=',', index=False)
df_train.head()

CPU times: user 2min 11s, sys: 6.75 s, total: 2min 18s
Wall time: 2min 19s


Unnamed: 0.1,Unnamed: 0,domain,type,content,scraped_at,title,label
535508,7872230,nytimes.com,reliable,Proving that this decade really is the 1980s i...,2018-02-11 00:43:19.062365+00:00,Michael Jackson to Play ‘Final’ Concerts in Lo...,real
1739964,1652583,beforeitsnews.com,fake,% of readers think this story is Fact. Add you...,2017-11-10 11:18:44.524042+00:00,Clip time lapse thú vị của anh chàng cơ khí,fake
396202,7071895,nytimes.com,reliable,Following are excerpts from Vice President Al ...,2018-02-11 00:35:34.945186+00:00,Excerpts From '60 Minutes' Interview,real
1375639,5542923,beforeitsnews.com,fake,Jawbone UP Users: How Is Your End Cap Holding ...,2017-11-27 01:14:08.745400+00:00,Jawbone UP Users: How Is Your End Cap Holding Up?,fake
785612,7918506,nytimes.com,reliable,"Tom Williams, a wing, went off bleeding from t...",2018-02-11 00:43:44.145806+00:00,"With a Scandal, Rugby’s Image Is at Stake",real


## #2 Remove Stop Words with gensim

In [3]:
from gensim.parsing.preprocessing import remove_stopwords
import gensim
gensim_stopwords = gensim.parsing.preprocessing.STOPWORDS

In [4]:
%%time
text_SW = []
for index, row in df_train.iterrows():
    text = remove_stopwords(str(row['content']).lower()) #narrowing scope to just content
    text_SW.append(text)
print('Sample content with stop words:\n\n%s\n\n' %df_train['content'].iloc[0])
print('Sample content without stop words:\n\n%s\n\n' %text_SW[0])

Sample content with stop words:

Proving that this decade really is the 1980s in quotation marks, Michael Jackson, right, announced on Thursday that he would perform a series of concerts in London this summer, in what he called a “final curtain call,” Reuters reported. Mr. Jackson, 50, revealed the details of the concerts at a news conference in London, where he said he would perform 10 shows at the O2 Arena, beginning July 8. “When I say this is it, I mean this is it,” Mr. Jackson said, according to Reuters. “I’ll be performing the songs my fans want to hear.” The shows would be Mr. Jackson’s first major performances since 2001 and 2002, when he appeared at a pair of 30th anniversary celebrations and two benefit concerts. His brief appearance at the World Music Awards in 2006 was booed by some audience members.


Sample content without stop words:

proving decade 1980s quotation marks, michael jackson, right, announced thursday perform series concerts london summer, called “final curt

## #3 Remove Punctuation

In [26]:
%%time
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~’“”'''
text_list = []
text_P = []
for index, row in df_train.iterrows(): 
    text_list.append(str(row['content']))

for item in text_list:
    for element in item:
        if element in punc:
            item = item.replace(element, "")
    text_P.append(item)
    
print('Sample content with punctuation:\n\n%s\n\n' %df_train['content'].iloc[0])
print('Sample content without punctuation:\n\n%s\n\n' %text_P[0])

Sample content with punctuation:

Proving that this decade really is the 1980s in quotation marks, Michael Jackson, right, announced on Thursday that he would perform a series of concerts in London this summer, in what he called a “final curtain call,” Reuters reported. Mr. Jackson, 50, revealed the details of the concerts at a news conference in London, where he said he would perform 10 shows at the O2 Arena, beginning July 8. “When I say this is it, I mean this is it,” Mr. Jackson said, according to Reuters. “I’ll be performing the songs my fans want to hear.” The shows would be Mr. Jackson’s first major performances since 2001 and 2002, when he appeared at a pair of 30th anniversary celebrations and two benefit concerts. His brief appearance at the World Music Awards in 2006 was booed by some audience members.


Sample content without punctuation:

Proving that this decade really is the 1980s in quotation marks Michael Jackson right announced on Thursday that he would perform a seri

## #4 Tokenization with spacy

In [8]:
import spacy
from spacy.cli import download
print(download('en'))

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/intel/opt/anaconda3/lib/python3.7/site-packages/en_core_web_sm -->
/opt/anaconda3/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')
None


In [27]:
%%time
en_nlp = spacy.load('en_core_web_sm') #load spacy's english language models
text_T = []
count = 0
for item in text_list:
    text_spacy = en_nlp(item)
    text_T.append(text_spacy)
    count = count + 1
    if count == 5:
        break
    
print('Sample content before tokenization:\n\n%s\n\n' %df_train['content'].iloc[0])
print('Sample content after tokenization:\n\n%s\n\n' %text_T[0])

Sample content before tokenization:

Proving that this decade really is the 1980s in quotation marks, Michael Jackson, right, announced on Thursday that he would perform a series of concerts in London this summer, in what he called a “final curtain call,” Reuters reported. Mr. Jackson, 50, revealed the details of the concerts at a news conference in London, where he said he would perform 10 shows at the O2 Arena, beginning July 8. “When I say this is it, I mean this is it,” Mr. Jackson said, according to Reuters. “I’ll be performing the songs my fans want to hear.” The shows would be Mr. Jackson’s first major performances since 2001 and 2002, when he appeared at a pair of 30th anniversary celebrations and two benefit concerts. His brief appearance at the World Music Awards in 2006 was booed by some audience members.


Sample content after tokenization:

Proving that this decade really is the 1980s in quotation marks, Michael Jackson, right, announced on Thursday that he would perform a

## #5 Stemming with NLTK

In [28]:
%%time
import nltk
stemmer = nltk.stem.PorterStemmer() #instantiate nltk's porter stemmer

stems = []
count = 0
for item in text_list:
    text_spacy = en_nlp(item)
    for ts in text_spacy:
        stems.append(stemmer.stem(ts.norm_.lower()))
        count = count + 1
        if count == 20:
            break
    break
    
print('Sample stems:\n\n%s\n\n' %stems[0:20])

Sample stems:

['prove', 'that', 'thi', 'decad', 'realli', 'is', 'the', '1980', 'in', 'quotat', 'mark', ',', 'michael', 'jackson', ',', 'right', ',', 'announc', 'on', 'thursday']


CPU times: user 26.3 ms, sys: 1.91 ms, total: 28.2 ms
Wall time: 27.8 ms


## #6 Lemmatization with spacy

In [29]:
%%time
lemmas = []
count = 0
for item in text_list:
    text_spacy = en_nlp(item)
    for ts in text_spacy:
        lemmas.append(ts.lemma_)
        count = count + 1
        if count == 20:
            break
    break
        
print('Sample lemmas:\n\n%s\n\n' %lemmas[0:20])

Sample lemmas:

['prove', 'that', 'this', 'decade', 'really', 'be', 'the', '1980s', 'in', 'quotation', 'mark', ',', 'Michael', 'Jackson', ',', 'right', ',', 'announce', 'on', 'Thursday']


CPU times: user 23.3 ms, sys: 1.45 ms, total: 24.7 ms
Wall time: 23.6 ms
