# Dataframe Cleaning & Feature Extraction

The <b> purpose </b>of this notebook is to merge and clean dataframes - all of theses steps will assist in feeding keywords into the Twitter API and the Gephi platform. 

## Libraries

In [59]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import matplotlib.pyplot as plt
import matplotlib
import time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import gensim
from gensim import models
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download('wordnet')
import pyLDAvis.gensim
import pickle

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/celinasprague/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Loading Exported Dataframes

Creating a "helper" function to do light cleaning so we can apply it quickly on multiple dataframes.

In [32]:
def initial_clean(data):
    
    "Light cleaning on raw data by dropping unnamed column, creating identifyer column, and re-ordering columns"
    
    data = data.drop(columns=['Unnamed: 0'])  # Drop the first column
 
    return (data)

### Datasets

Pulling in all compiled datasets as csv's. We'll then compile them into dataframes later on. For now we're pulling in the csv files and setting them to variables.

In [29]:
popular1 = pd.read_csv('popular1.csv', dtype=str)
popular2 = pd.read_csv('popular2.csv', dtype=str)

In [30]:
popular1.head()

Unnamed: 0.1,Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,...,thread_social_stumbledupon_shares,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid
0,0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,...,0,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f
1,1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,...,0,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5
2,2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,...,0,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767
3,3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,...,0,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a
4,4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,...,0,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc


### Dataframes

We run the helper function to every variable from the <b> dataset</b> section above and then we'll join them all together as one dataframe. 

In [33]:
popular1_df = initial_clean(popular1)
popular2_df = initial_clean(popular2)

## Compiling Dataframes

In [40]:
data = popular1_df.append(popular2_df, sort=False)

In [41]:
data.head()

Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,locations,...,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid,thread_domain_rank
0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,,...,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,
1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,['Savoonga'],...,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,
2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,['Palmyra'],...,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,
3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,,...,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,
4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,,...,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,


Removing columns with the exact same values because they are unneeded.

In [43]:
for col in data.columns:
    if len(data[col].unique()) == 1:
        data.drop(col,inplace = True,axis = 1)

In [44]:
data.head()

Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,locations,...,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid,thread_domain_rank
0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,,...,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,
1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,['Savoonga'],...,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,
2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,['Palmyra'],...,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,
3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,,...,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,
4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,,...,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,


In [46]:
data.to_csv('finaldata.csv', sep = ',')

## NLP Work

Now we have a compiled dataset, we have the choice of running processes on the full set or just subsets. Either way, the first step, however, will be to deal with null values.

In [2]:
data = pd.read_csv('finaldata.csv', dtype = str)

In [3]:
data = data.drop(columns=['Unnamed: 0'])

In [4]:
data_colnan=data.columns[data.isnull().any()]
data[data_colnan].isnull().sum()

author                                76379
entities_locations                   109139
entities_organizations               105393
entities_persons                     104007
external_links                       157547
highlightText                        170880
highlightTitle                       170880
locations                            153104
organizations                        145141
persons                              153170
text                                    361
thread_country                          657
thread_main_image                     53661
thread_participants_count              2433
thread_performance_score               2433
thread_published                       2433
thread_replies_count                   2433
thread_section_title                  27551
thread_site                            2433
thread_site_full                       2433
thread_site_section                   25529
thread_site_type                       4866
thread_social_facebook_comments 

In [5]:
data['title'] = data['title'].fillna("none")
data['text'] = data['text'].fillna("none")

### Running NLP on Text

In [6]:
count_vec = CountVectorizer()
X_train_count = count_vec.fit_transform(data.text)

In [7]:
print(count_vec.get_feature_names()[::1000])

['00', '0804', '10645598', '1099s', '12mph', '1553', '184bn', '1r4btxpxgq', '202h', '240000', '276248', '2gfor', '2njfq6i', '300158751', '300µg', '35466a848e51c1dbd36ecd', '3x_promotedstory', '45615', '4khz', '5547332', '5972', '64ymqvimj5', '704games', '790', '840x1', '900712003609', '9991pl3pbt', 'aaadhaar', 'abetted', 'acclimatise', 'adamant', 'adoringly', 'africans', 'ahm', 'akenzua', 'alegría', 'allisonpr', 'amaki', 'amongst', 'andreholland', 'annastazia', 'anukampa', 'appreciators', 'aregbesola', 'arriving', 'ashfordcastle', 'astrophes', 'atzimba', 'autója', 'ayeshashroff', 'babewatch', 'bahrun', 'bambolim', 'bargara', 'batane', 'beachland', 'behavioural', 'benefitting', 'besi', 'bhatte', 'bilirakis', 'bishphool', 'bleak', 'boafo', 'bona', 'bosintang', 'brabin', 'breira', 'brobby', 'bt14', 'bunchy', 'buttersafe', 'c94d46967414d063327750', 'callthemidwife', 'canonsburg', 'carla', 'castings', 'cchhhh', 'ceridono', 'chandaben', 'chautha', 'chiaraferragni', 'chofu', 'chuvanna', 'civi

In [12]:
lemtzer = WordNetLemmatizer()

def lemmatize_stemming(text):
    return lemtzer.lemmatize(text, pos='v')

# Write a function to perform the pre processing steps on the entire dataset
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :
        if token not in STOPWORDS:
            result.append(lemmatize_stemming(token))
            
    return result

### WARNING! 
The code below takes some time to run. 

In [16]:
processed_docs  = []

for doc in data.text:
    processed_docs.append(preprocess(doc))

Using 'Pickle' to save progress. 

In [60]:
file_1 = "processed_docs_text"

fileObject_1 = open(file_1,'wb') 

pickle.dump(processed_docs,fileObject_1)   

fileObject_1.close()

In [49]:
dictionary = gensim.corpora.Dictionary(processed_docs)

Setting parameters for our tokens - removing words occuring more or less than criteria and retaining most frequent tokens based on a threshold. 

In [50]:
dictionary.filter_extremes(no_below=15, no_above=0.2, keep_n=50000)

Using 'Pickle' to save progress. 

In [62]:
file_2 = "dictionary_text"

fileObject_2 = open(file_2,'wb') 

pickle.dump(dictionary,fileObject_2)   

fileObject_2.close()

We're formatting for "out-of-bag."

In [51]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Using 'Pickle' to save progress. 

In [63]:
file_3 = "bowcorpus_text"

fileObject_3 = open(file_3,'wb') 

pickle.dump(bow_corpus,fileObject_3)   

fileObject_3.close()

In [54]:
%%time
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2)
lda_model.save('lda.model')

CPU times: user 4min 30s, sys: 50 s, total: 5min 20s
Wall time: 6min 23s


In [55]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} Word: {}\n'.format(idx, topic))

Topic: 0 Word: 0.011*"pm" + 0.008*"play" + 0.007*"game" + 0.007*"vs" + 0.006*"team" + 0.006*"sport" + 0.005*"nbcsn" + 0.005*"win" + 0.005*"season" + 0.005*"csn"

Topic: 1 Word: 0.007*"police" + 0.005*"old" + 0.005*"man" + 0.004*"family" + 0.004*"photo" + 0.004*"woman" + 0.003*"post" + 0.003*"day" + 0.003*"home" + 0.003*"party"

Topic: 2 Word: 0.007*"think" + 0.006*"want" + 0.005*"women" + 0.005*"need" + 0.005*"health" + 0.004*"way" + 0.004*"life" + 0.003*"feel" + 0.003*"help" + 0.003*"get"

Topic: 3 Word: 0.061*"trump" + 0.012*"president" + 0.011*"donald" + 0.007*"campaign" + 0.005*"million" + 0.005*"care" + 0.005*"white" + 0.005*"republican" + 0.005*"plan" + 0.004*"obamacare"

Topic: 4 Word: 0.038*"ht" + 0.017*"sign" + 0.014*"email" + 0.014*"forward" + 0.012*"celtics" + 0.012*"center" + 0.011*"nhl" + 0.011*"freestyle" + 0.010*"account" + 0.010*"link"

Topic: 5 Word: 0.008*"government" + 0.005*"china" + 0.005*"minister" + 0.004*"india" + 0.004*"report" + 0.004*"country" + 0.004*"bank" 

#### Quick Visualization

In [56]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)

### NLP on Title

In [35]:
count_vec = CountVectorizer()
X_train_count = count_vec.fit_transform(data.title)

In [36]:
print(count_vec.get_feature_names()[::1000])

['00', '32811261', 'aar', 'agincourt', 'amphibious', 'aronofsky', 'ayanda', 'batumalai', 'bigs', 'boop', 'buc', 'canzieri', 'chandeliers', 'clarifying', 'conceding', 'coutinho', 'd3', 'della', 'dipali', 'dozen', 'eeo', 'epson', 'f12tdf', 'finalises', 'fraim', 'gautham', 'goodcall', 'guus', 'heathen', 'hoosiers', 'ilan', 'instructor', 'janot', 'kalie', 'kirit', 'lamu', 'lift', 'luy', 'maranello', 'medal', 'minn', 'motter', 'narcing', 'nirupa', 'odzyskać', 'outvets', 'pataca', 'piaget', 'ported', 'proposing', 'rafa', 'reds', 'rethinking', 'roselawn', 'sanjiv', 'seesaw', 'shobhaa', 'slots', 'speedy', 'stolinsky', 'surpass', 'taverne', 'tikes', 'trican', 'uncharted', 'vai', 'vonovia', 'wham', 'wwiii', 'řąřşř', 'قليل', 'กไม', 'ธนาร', 'ยหายเด', 'ากระแสหย', 'キングジムの']


In [37]:
processed_docs2  = []

for doc in data.title:
    processed_docs2.append(preprocess(doc))

Using 'Pickle' to save progress. 

In [69]:
file_4 = "processed_docs_title"

fileObject_4 = open(file_4,'wb') 

pickle.dump(processed_docs2,fileObject_4)   

fileObject_4.close()

In [39]:
dictionary2 = gensim.corpora.Dictionary(processed_docs)

Filtering out tokens by criterias - Frequency.

In [40]:
dictionary2.filter_extremes(no_below=15, no_above=0.2, keep_n=50000)

Using 'Pickle' to save progress. 

In [70]:
file_5 = "dictionary_title"

fileObject_5 = open(file_5,'wb') 

pickle.dump(dictionary2,fileObject_5)   

fileObject_5.close()

Preparing for "out-of-bag" format.

In [41]:
bow_corpus2 = [dictionary2.doc2bow(doc) for doc in processed_docs]

Using 'Pickle' to save progress. 

In [71]:
file_6 = "bowcorpus_title"

fileObject_6 = open(file_6,'wb') 

pickle.dump(bow_corpus2,fileObject_6)   

fileObject_6.close()

In [57]:
%%time
lda_model2 = gensim.models.LdaMulticore(bow_corpus2, num_topics=10, id2word=dictionary2, passes=2)
lda_model2.save('lda.model')

CPU times: user 3min 26s, sys: 42.1 s, total: 4min 8s
Wall time: 5min 25s


In [47]:
for idx, topic in lda_model2.print_topics(-1):
    print('Topic: {} Word: {}\n'.format(idx, topic))

Topic: 0 Word: 0.023*"house" + 0.020*"trump" + 0.010*"health" + 0.010*"care" + 0.009*"president" + 0.008*"republicans" + 0.008*"vote" + 0.008*"committee" + 0.008*"republican" + 0.007*"senate"

Topic: 1 Word: 0.007*"trump" + 0.007*"school" + 0.005*"president" + 0.005*"federal" + 0.005*"million" + 0.004*"law" + 0.004*"tax" + 0.004*"fund" + 0.004*"plan" + 0.004*"program"

Topic: 2 Word: 0.006*"women" + 0.004*"help" + 0.004*"need" + 0.004*"company" + 0.004*"use" + 0.003*"health" + 0.003*"children" + 0.003*"include" + 0.003*"study" + 0.003*"life"

Topic: 3 Word: 0.006*"play" + 0.005*"game" + 0.005*"get" + 0.004*"want" + 0.004*"think" + 0.004*"look" + 0.004*"season" + 0.004*"win" + 0.004*"team" + 0.004*"star"

Topic: 4 Word: 0.009*"government" + 0.008*"minister" + 0.007*"party" + 0.006*"india" + 0.005*"mr" + 0.004*"country" + 0.004*"china" + 0.003*"uk" + 0.003*"prime" + 0.003*"eu"

Topic: 5 Word: 0.010*"enw" + 0.009*"trump" + 0.008*"article" + 0.008*"right" + 0.008*"display" + 0.007*"com" + 

#### Quick Visualization

In [72]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model2, bow_corpus2, dictionary2)

# End