# Dataframe Cleaning & NLP

The <b> purpose </b>of this notebook is to merge, clean, and create keywords using NLP - all of theses steps will assist in feeding keywords into the Twitter API.

## Libraries

In [123]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import matplotlib.pyplot as plt
import matplotlib
import time

## Loading Exported Dataframes

Creating a "helper" function to do light cleaning so we can apply it quickly on multiple dataframes.

In [146]:
def initial_clean(data, article_type):
    
    "Light cleaning on raw data by dropping unnamed column, creating identifyer column, and re-ordering columns"
    
    data2 = data.drop(columns=['Unnamed: 0'])  # Drop the first column
    rowcount = data.shape[0]                   # Finding total rows
    data2['Article Type'] = article_type       # Indicator for original JSON category from website
    cols = data2.columns.tolist()              # Making a list of all the columns in dataframe
    cols2 = cols[-1:] + cols[:-1]              # Take the last column and move it to the front 
    data3 = data2[cols2]                       # Setting new ordered dataframe to variable
    
    
    return (data3)

### Datasets

Pulling in all compiled datasets as csv's. We'll then compile them into dataframes later on. For now we're pulling in the csv files and setting them to variables.

In [150]:
politicalnews = pd.read_csv('politicalnews.csv', sep="\t", encoding="utf8", dtype=str)
technews = pd.read_csv('technews.csv', sep="\t", encoding="utf8", dtype=str)

### Dataframes

We run the helper function to every variable from the <b> dataset</b> section above and then we'll join them all together as one dataframe. 

In [151]:
politicalnews_df = initial_clean(politicalnews, 'Political News')
technews_df = initial_clean(technews, 'Tech News')

## Compiling Dataframes

In [152]:
data = politicalnews_df.append(technews_df, ignore_index = True)

In [153]:
data.head()

Unnamed: 0,Article Type,organizations,uuid,thread_social_gplus_shares,thread_social_pinterest_shares,thread_social_vk_shares,thread_social_linkedin_shares,thread_social_facebook_likes,thread_social_facebook_shares,thread_social_facebook_comments,...,entities_locations,entities_organizations,highlightText,language,persons,text,external_links,published,crawled,highlightTitle
0,Political News,,8085f289866a814f7a443e1a31e48f8a307a040f,0,0,0,0,0,0,0,...,,,,english,,The Healthiest Pastas: From Quinoa to Buckwhea...,[['http://www.reddit.com/submit?url=http%3A%2F...,2015-10-02T03:00:00.000+03:00,2015-10-02T17:33:59.981+03:00,
1,Political News,['Anchorage Daily News'],f4ad43deab0a72726d6165b37a971c578efdd4f5,0,0,0,0,0,0,0,...,,,,english,,Published By: Anchorage Daily News - Today \nP...,,2015-10-19T08:06:00.000+03:00,2015-10-19T09:23:00.540+03:00,
2,Political News,['ABC News'],c98cbd870f52950ff685e772fd189bd01fc85767,0,0,0,0,0,0,0,...,,,,english,,Published By: ABC News - Today \nVideo obtaine...,,2015-10-08T17:09:00.000+03:00,2015-10-08T17:42:28.717+03:00,
3,Political News,,3481ad311613e0da31e6017f854c7ded093b398a,0,0,0,0,0,0,0,...,,,,english,,Note: This post contains spoilers about Fear t...,,2015-10-05T07:28:00.000+03:00,2015-10-05T10:10:00.218+03:00,
4,Political News,,17954912c005732967b28ef81b4ebc58d3911efc,0,0,0,0,0,0,0,...,,,,english,,Facebook app draining your iPhone battery? Com...,,2015-10-23T13:08:00.000+03:00,2015-10-23T15:40:06.454+03:00,


## Adding Keyword Column Through NLP

Now we have a compiled dataset, we'll use NLP to find keywords which will be fed into Twitter's API and useful for other visualizations.

In [154]:
data.columns.tolist()

['Article Type',
 'organizations',
 'uuid',
 'thread_social_gplus_shares',
 'thread_social_pinterest_shares',
 'thread_social_vk_shares',
 'thread_social_linkedin_shares',
 'thread_social_facebook_likes',
 'thread_social_facebook_shares',
 'thread_social_facebook_comments',
 'thread_social_stumbledupon_shares',
 'thread_site_full',
 'thread_main_image',
 'thread_site_section',
 'thread_section_title',
 'thread_url',
 'thread_country',
 'thread_title',
 'thread_performance_score',
 'thread_site',
 'thread_participants_count',
 'thread_title_full',
 'thread_spam_score',
 'thread_site_type',
 'thread_published',
 'thread_replies_count',
 'thread_uuid',
 'author',
 'url',
 'ord_in_thread',
 'title',
 'locations',
 'entities_persons',
 'entities_locations',
 'entities_organizations',
 'highlightText',
 'language',
 'persons',
 'text',
 'external_links',
 'published',
 'crawled',
 'highlightTitle']