# Dataframe Cleaning & Feature Extraction

The <b> purpose </b>of this notebook is to merge and clean dataframes - all of theses steps will assist in feeding keywords into the Twitter API and the Gephi platform. 

## Libraries

In [79]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import matplotlib.pyplot as plt
import matplotlib
import time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
import gensim
from gensim import models
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download('wordnet')
import pyLDAvis.gensim
import pickle

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/celinasprague/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Loading Exported Dataframes

Creating a "helper" function to do light cleaning so we can apply it quickly on multiple dataframes.

In [32]:
def initial_clean(data):
    
    "Light cleaning on raw data by dropping unnamed column, creating identifyer column, and re-ordering columns"
    
    data = data.drop(columns=['Unnamed: 0'])  # Drop the first column
 
    return (data)

### Datasets

Pulling in all compiled datasets as csv's. We'll then compile them into dataframes later on. For now we're pulling in the csv files and setting them to variables.

In [29]:
popular1 = pd.read_csv('popular1.csv', dtype=str)
popular2 = pd.read_csv('popular2.csv', dtype=str)

In [30]:
popular1.head()

Unnamed: 0.1,Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,...,thread_social_stumbledupon_shares,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid
0,0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,...,0,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f
1,1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,...,0,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5
2,2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,...,0,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767
3,3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,...,0,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a
4,4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,...,0,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc


### Dataframes

We run the helper function to every variable from the <b> dataset</b> section above and then we'll join them all together as one dataframe. 

In [33]:
popular1_df = initial_clean(popular1)
popular2_df = initial_clean(popular2)

## Compiling Dataframes

In [40]:
data = popular1_df.append(popular2_df, sort=False)

In [41]:
data.head()

Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,locations,...,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid,thread_domain_rank
0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,,...,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,
1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,['Savoonga'],...,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,
2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,['Palmyra'],...,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,
3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,,...,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,
4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,,...,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,


Removing columns with the exact same values because they are unneeded.

In [43]:
for col in data.columns:
    if len(data[col].unique()) == 1:
        data.drop(col,inplace = True,axis = 1)

In [44]:
data.head()

Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,locations,...,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid,thread_domain_rank
0,USNews,2015-10-02T17:33:59.981+03:00,,,,[['http://www.reddit.com/submit?url=http%3A%2F...,,,english,,...,0,0.0,The Healthiest Pastas: From Quinoa to Buckwhea...,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,The Healthiest Pastas: From Quinoa to Buckwhea...,http://health.usnews.com/health-news/health-we...,8085f289866a814f7a443e1a31e48f8a307a040f,
1,,2015-10-19T09:23:00.540+03:00,,,,,,,english,['Savoonga'],...,0,0.0,Photos: Operation Santa Claus visits Savoonga,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,Photos: Operation Santa Claus visits Savoonga,http://www.newsdump.com/article/photos-operati...,f4ad43deab0a72726d6165b37a971c578efdd4f5,
2,,2015-10-08T17:42:28.717+03:00,,,,,,,english,['Palmyra'],...,0,0.0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...","Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",http://www.newsdump.com/article/watch-video-sh...,c98cbd870f52950ff685e772fd189bd01fc85767,
3,,2015-10-05T10:10:00.218+03:00,,,,,,,english,,...,0,0.0,'Fear the Walking Dead' ends Season 1 on a gri...,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,'Fear the Walking Dead' ends Season 1 on a gri...,http://www.newsdump.com/article/fear-the-walki...,3481ad311613e0da31e6017f854c7ded093b398a,
4,,2015-10-23T15:40:06.454+03:00,,,,,,,english,,...,0,0.0,Facebook app draining your iPhone battery? Com...,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,Facebook app draining your iPhone battery? Com...,http://www.newsdump.com/article/facebook-app-d...,17954912c005732967b28ef81b4ebc58d3911efc,


In [46]:
data.to_csv('finaldata.csv', sep = ',')

# End