## Challenge: Build your own NLP model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [1]:
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
import spacy
import time
%matplotlib inline

In [2]:
news_raw = pd.read_json('News_Category_Dataset.json', lines=True)
news_raw.head(10)

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."
5,Ron Dicker,ENTERTAINMENT,2018-05-26,Morgan Freeman 'Devastated' That Sexual Harass...,https://www.huffingtonpost.com/entry/morgan-fr...,"""It is not right to equate horrific incidents ..."
6,Ron Dicker,ENTERTAINMENT,2018-05-26,Donald Trump Is Lovin' New McDonald's Jingle I...,https://www.huffingtonpost.com/entry/donald-tr...,"It's catchy, all right."
7,Todd Van Luling,ENTERTAINMENT,2018-05-26,What To Watch On Amazon Prime That’s New This ...,https://www.huffingtonpost.com/entry/amazon-pr...,There's a great mini-series joining this week.
8,Andy McDonald,ENTERTAINMENT,2018-05-26,Mike Myers Reveals He'd 'Like To' Do A Fourth ...,https://www.huffingtonpost.com/entry/mike-myer...,"Myer's kids may be pushing for a new ""Powers"" ..."
9,Todd Van Luling,ENTERTAINMENT,2018-05-26,What To Watch On Hulu That’s New This Week,https://www.huffingtonpost.com/entry/hulu-what...,You're getting a recent Academy Award-winning ...


In [3]:
#separate business and sports
business_df = news_raw.loc[news_raw['category']=='BUSINESS']
sports_df = news_raw.loc[news_raw['category']=='SPORTS']

#get text data
business_head = business_df['headline'].tolist()
business_desc = business_df['short_description'].tolist()
business_raw = [a + ' ' + b for a, b in zip(business_head, business_desc)]

sports_head = sports_df['headline'].tolist()
sports_desc = sports_df['short_description'].tolist()
sports_raw = [a + ' ' + b for a, b in zip(sports_head, sports_desc)]

#make docs from lists of strings
business = ' '.join(business_raw)
sports = ' '.join(sports_raw)

In [4]:
#clean text for spacy
def text_cleaner(text):
    text = re.sub(r'--',' ',text) #replace -- with blank string
    text = re.sub('[\[].*?[\]]','',text)
    text = ' '.join(text.split())
    return text

business_clean = text_cleaner(business)
sports_clean = text_cleaner(sports)

In [5]:
sports_clean[:100]

"Jets Chairman Christopher Johnson Won't Fine Players For Anthem Protests “I never want to put restri"

In [6]:
#parse cleaned text
nlp = spacy.load('en')
business_doc = nlp(business_clean)
sports_doc = nlp(sports_clean)

In [7]:
print(len(business_doc))
print(len(sports_doc))

146683
105483


## 1. BoW

In [8]:
#group into sentences
business_sents = [[sent, 'business'] for sent in business_doc.sents]
sports_sents = [[sent, 'sports'] for sent in sports_doc.sents]

print(len(business_sents))
print(len(sports_sents))

10731
10462


In [9]:
bus_sample = business_sents[:1500]
sports_sample = sports_sents[:1500]

In [10]:
#combine sentences into single df
sentences = pd.DataFrame(bus_sample + sports_sample)
sentences.shape

(3000, 2)

In [11]:
sentences.head()

Unnamed: 0,0,1
0,"(U.S., Launches, Auto, Import, Probe, ,, China...",business
1,"(To, Defend, Its, Interests)",business
2,"(The, investigation, could, lead, to, new, U.S...",business
3,"(Starbucks, Says, Anyone, Can, Now, Sit, In, I...",business
4,"(Even, Without, Buying, Anything, The, new, po...",business


In [12]:
#BoW, exclude stopwords & punctuation, use lemmas, 2000 most common words
from collections import Counter
def bag_of_words(text):
    
    #filter punct and stopwords
    allwords = [token.lemma_ for token in text 
                if not token.is_punct and not token.is_stop]
    
    #return most common words
    return [item[0] for item in Counter(allwords).most_common(2000)]

#creates a dataframe with features for each word in common word set
#values are count of times word appears in each sentence
def bow_features(sentences, common_words):
    
    #set df and initialize counts
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    #process each row, counting word occurences
    for i, sentence in enumerate(df['text_sentence']):
        
        #convert sentence to lemmas & filter punct, stops, & uncommon words
        words = [token.lemma_ for token in sentence
                 if (not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words)]
        
        #populate row with word counts
        for word in words:
            df.loc[i, word] += 1
        
        #counter to make sure kernal isn't hanging
        if i % 250 == 0:
            print('processing row {}'.format(i))
            print(time.clock())
    
    return df

#set up bags
businesswords = bag_of_words(business_doc)
sportswords = bag_of_words(sports_doc)

#combine bags to create set of unique words
common_words = set(businesswords + sportswords)

In [13]:
#word_counts = bow_features(sentences, common_words)
#word_counts.to_csv('bow_features_business_sports', index=False)
word_counts = pd.read_csv('bow_features_business_sports')
word_counts.shape

(3000, 2959)

In [14]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

X = np.array(word_counts.drop(['text_sentence', 'text_source'], 1))
y = word_counts['text_source']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

rfc = ensemble.RandomForestClassifier().fit(X_train, y_train)
print('train score: {}'.format(rfc.score(X_train, y_train)))
print('test score: {}'.format(rfc.score(X_test, y_test)))

train score: 0.9683333333333334
test score: 0.7416666666666667


In [15]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
print('train score: {}'.format(lr.score(X_train, y_train)))
print('test score: {}'.format(lr.score(X_test, y_test)))

train score: 0.9405555555555556
test score: 0.7908333333333334


In [16]:
start_time = time.clock()
gbc = ensemble.GradientBoostingClassifier().fit(X_train, y_train)
print('train score: {}'.format(gbc.score(X_train, y_train)))
print('test score: {}'.format(gbc.score(X_test, y_test)))
print('{} seconds'.format(time.clock() - start_time))

train score: 0.7838888888888889
test score: 0.7333333333333333
14.314459 seconds


In [17]:
import xgboost as xgb
start_time = time.clock()
xgbc = xgb.XGBClassifier().fit(X_train, y_train)
print('train score: {}'.format(gbc.score(X_train, y_train)))
print('test score: {}'.format(gbc.score(X_test, y_test)))
print('{} seconds'.format(time.clock() - start_time))

train score: 0.7838888888888889
test score: 0.7333333333333333
7.013244999999998 seconds


In [28]:
# subsample of data was pretty small given the original size
# try a larger sample and see if models improve
bus_sample_3000 = business_sents[:3000]
sports_sample_3000 = sports_sents[:3000]
sentences = pd.DataFrame(bus_sample_3000 + sports_sample_3000)
sentences.shape

(6000, 2)

In [29]:
word_counts = bow_features(sentences, common_words)
word_counts.to_csv('bow_features_6000_business_sports', index=False)

processing row 0
113.660745
processing row 50
369.840926
processing row 100
602.289914
processing row 150
813.694071
processing row 200
1055.261497
processing row 250
1279.636973
processing row 300
1481.472321
processing row 350
1717.36095
processing row 400
1965.289835
processing row 450
2210.596457
processing row 500
2453.024662
processing row 550
2679.848909
processing row 600
2939.132854
processing row 650
3161.681253
processing row 700
3406.50285
processing row 750
3605.608094
processing row 800
3840.61901
processing row 850
4093.627801
processing row 900
4386.460868
processing row 950
4638.441826
processing row 1000
4903.51757
processing row 1050
5174.622038
processing row 1100
5388.965507
processing row 1150
5576.217431
processing row 1200
5837.319115
processing row 1250
6084.650505
processing row 1300
6311.531316
processing row 1350
6534.507609
processing row 1400
6777.303552
processing row 1450
6971.18641
processing row 1500
7201.50416
processing row 1550
7409.436992
processin

In [30]:
X = np.array(word_counts.drop(['text_sentence', 'text_source'], 1))
y = word_counts['text_source']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

rfc = ensemble.RandomForestClassifier().fit(X_train, y_train)
print('train score: {}'.format(rfc.score(X_train, y_train)))
print('test score: {}'.format(rfc.score(X_test, y_test)))

train score: 0.9713888888888889
test score: 0.7479166666666667


In [31]:
lr = LogisticRegression().fit(X_train, y_train)
print('train score: {}'.format(lr.score(X_train, y_train)))
print('test score: {}'.format(lr.score(X_test, y_test)))

train score: 0.9277777777777778
test score: 0.8091666666666667


In [32]:
start_time = time.clock()
gbc = ensemble.GradientBoostingClassifier().fit(X_train, y_train)
print('train score: {}'.format(gbc.score(X_train, y_train)))
print('test score: {}'.format(gbc.score(X_test, y_test)))
print('{} seconds'.format(time.clock() - start_time))

train score: 0.7452777777777778
test score: 0.69625
46.18460899999991 seconds


In [33]:
start_time = time.clock()
xgbc = xgb.XGBClassifier().fit(X_train, y_train)
print('train score: {}'.format(gbc.score(X_train, y_train)))
print('test score: {}'.format(gbc.score(X_test, y_test)))
print('{} seconds'.format(time.clock() - start_time))

train score: 0.7452777777777778
test score: 0.69625
15.65634399999908 seconds


## 2. word2vec

In [18]:
sentences_bus = []
for sentence in business_doc.sents:
    sentence = [token.lemma_.lower()
                for token in sentence
                if not token.is_stop
                and not token.is_punct]
    sentences_bus.append(sentence)
    
sentences_sports = []
for sentence in sports_doc.sents:
    sentence = [token.lemma_.lower()
                for token in sentence
                if not token.is_stop
                and not token.is_punct]
    sentences_sports.append(sentence)

In [19]:
import gensim
from gensim.models import word2vec

start_time = time.clock()
w2v = word2vec.Word2Vec(sentences_bus,
                        workers=2,
                        min_count=10,
                        window=6,
                        sg=0,
                        sample=1e-3,
                        size=300,
                        hs=1)
print('runtime: {} seconds'.format(time.clock() - start_time))

runtime: 1.373656000000011 seconds


In [20]:
vocab = w2v.wv.vocab.keys()

#print(vocab)

In [21]:
print(w2v.wv.most_similar(positive=['investigation', 'trump'], negative=['facebook']))

[('gop', 0.9502933025360107), ('buffett', 0.9450339078903198), ('december', 0.9429740905761719), ('world', 0.9420689940452576), ('responsibility', 0.9412590265274048), ('vote', 0.9407662749290466), ('seattle', 0.9399857521057129), ('global', 0.9397921562194824), ('elizabeth', 0.9389846920967102), ('name', 0.9381713271141052)]


  if np.issubdtype(vec.dtype, np.int):


In [22]:
print(w2v.wv.similarity('investigation', 'probe'))

0.9844992


  if np.issubdtype(vec.dtype, np.int):


In [23]:
start_time = time.clock()
w2v = word2vec.Word2Vec(sentences_sports,
                        workers=2,
                        min_count=10,
                        window=6,
                        sg=0,
                        sample=1e-3,
                        size=300,
                        hs=1)
print('runtime: {} seconds'.format(time.clock() - start_time))

runtime: 0.9873839999999916 seconds


In [24]:
vocab = w2v.wv.vocab.keys()

#print(vocab)

In [25]:
print(w2v.wv.most_similar(positive=['anthem', 'protests'], negative=['kaepernick']))
print(w2v.similarity('anthem', 'protests'))

[('really', 0.9863007068634033), ('think', 0.9798712134361267), ('friend', 0.9792352318763733), ('nothing', 0.976024866104126), ('penn', 0.9752417206764221), ('witness', 0.9738208055496216), ('draft', 0.9735362529754639), ('grand', 0.9729882478713989), ('tweet', 0.9694628715515137), ('idea', 0.9687024354934692)]
0.93453765


  if np.issubdtype(vec.dtype, np.int):
  
