# Dataframe Cleaning & Feature Extraction

The <b> purpose </b>of this notebook is to merge and clean dataframes - all of theses steps will assist in feeding keywords into the Twitter API and the Gephi platform. 

## Libraries

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import matplotlib.pyplot as plt
import matplotlib
import time
from sklearn.feature_extraction.text import CountVectorizer

## Loading Exported Dataframes

Creating a "helper" function to do light cleaning so we can apply it quickly on multiple dataframes.

In [2]:
def initial_clean(data, article_type):
    
    "Light cleaning on raw data by dropping unnamed column, creating identifyer column, and re-ordering columns"
    
    data2 = data.drop(columns=['Unnamed: 0'])  # Drop the first column
    rowcount = data.shape[0]                   # Finding total rows
    data2['Article Type'] = article_type       # Indicator for original JSON category from website
    cols = data2.columns.tolist()              # Making a list of all the columns in dataframe
    cols2 = cols[-1:] + cols[:-1]              # Take the last column and move it to the front 
    data3 = data2[cols2]                       # Setting new ordered dataframe to variable
    
    
    return (data3)

### Datasets

Pulling in all compiled datasets as csv's. We'll then compile them into dataframes later on. For now we're pulling in the csv files and setting them to variables.

In [3]:
politicalnews = pd.read_csv('politicalnews.csv', sep="\t", encoding="utf8", dtype=str)
technews = pd.read_csv('technews.csv', sep="\t", encoding="utf8", dtype=str)

### Dataframes

We run the helper function to every variable from the <b> dataset</b> section above and then we'll join them all together as one dataframe. 

In [4]:
politicalnews_df = initial_clean(politicalnews, 'Political News')
technews_df = initial_clean(technews, 'Tech News')

## Compiling Dataframes

In [5]:
data = politicalnews_df.append(technews_df, ignore_index = True)

In [6]:
data.head()

Unnamed: 0,Article Type,organizations,uuid,thread_social_gplus_shares,thread_social_pinterest_shares,thread_social_vk_shares,thread_social_linkedin_shares,thread_social_facebook_likes,thread_social_facebook_shares,thread_social_facebook_comments,...,entities_locations,entities_organizations,highlightText,language,persons,text,external_links,published,crawled,highlightTitle
0,Political News,,8085f289866a814f7a443e1a31e48f8a307a040f,0,0,0,0,0,0,0,...,,,,english,,The Healthiest Pastas: From Quinoa to Buckwhea...,[['http://www.reddit.com/submit?url=http%3A%2F...,2015-10-02T03:00:00.000+03:00,2015-10-02T17:33:59.981+03:00,
1,Political News,['Anchorage Daily News'],f4ad43deab0a72726d6165b37a971c578efdd4f5,0,0,0,0,0,0,0,...,,,,english,,Published By: Anchorage Daily News - Today \nP...,,2015-10-19T08:06:00.000+03:00,2015-10-19T09:23:00.540+03:00,
2,Political News,['ABC News'],c98cbd870f52950ff685e772fd189bd01fc85767,0,0,0,0,0,0,0,...,,,,english,,Published By: ABC News - Today \nVideo obtaine...,,2015-10-08T17:09:00.000+03:00,2015-10-08T17:42:28.717+03:00,
3,Political News,,3481ad311613e0da31e6017f854c7ded093b398a,0,0,0,0,0,0,0,...,,,,english,,Note: This post contains spoilers about Fear t...,,2015-10-05T07:28:00.000+03:00,2015-10-05T10:10:00.218+03:00,
4,Political News,,17954912c005732967b28ef81b4ebc58d3911efc,0,0,0,0,0,0,0,...,,,,english,,Facebook app draining your iPhone battery? Com...,,2015-10-23T13:08:00.000+03:00,2015-10-23T15:40:06.454+03:00,


Removing columns with the exact same values because they are unneeded.

In [7]:
for col in data.columns:
    if len(data[col].unique()) == 1:
        data.drop(col,inplace = True,axis = 1)

In [171]:
data.head()

Unnamed: 0,Article Type,organizations,uuid,thread_social_gplus_shares,thread_social_pinterest_shares,thread_social_vk_shares,thread_social_linkedin_shares,thread_social_facebook_likes,thread_social_facebook_shares,thread_social_facebook_comments,...,url,ord_in_thread,title,locations,language,persons,text,external_links,published,crawled
0,Political News,,8085f289866a814f7a443e1a31e48f8a307a040f,0,0,0,0,0,0,0,...,http://health.usnews.com/health-news/health-we...,0,The Healthiest Pastas: From Quinoa to Buckwhea...,,english,,The Healthiest Pastas: From Quinoa to Buckwhea...,[['http://www.reddit.com/submit?url=http%3A%2F...,2015-10-02T03:00:00.000+03:00,2015-10-02T17:33:59.981+03:00
1,Political News,['Anchorage Daily News'],f4ad43deab0a72726d6165b37a971c578efdd4f5,0,0,0,0,0,0,0,...,http://www.newsdump.com/article/photos-operati...,0,Photos: Operation Santa Claus visits Savoonga,['Savoonga'],english,,Published By: Anchorage Daily News - Today \nP...,,2015-10-19T08:06:00.000+03:00,2015-10-19T09:23:00.540+03:00
2,Political News,['ABC News'],c98cbd870f52950ff685e772fd189bd01fc85767,0,0,0,0,0,0,0,...,http://www.newsdump.com/article/watch-video-sh...,0,"Watch: Video Shows 2,000-Year-Old Ancient Arch...",['Palmyra'],english,,Published By: ABC News - Today \nVideo obtaine...,,2015-10-08T17:09:00.000+03:00,2015-10-08T17:42:28.717+03:00
3,Political News,,3481ad311613e0da31e6017f854c7ded093b398a,0,0,0,0,0,0,0,...,http://www.newsdump.com/article/fear-the-walki...,0,'Fear the Walking Dead' ends Season 1 on a gri...,,english,,Note: This post contains spoilers about Fear t...,,2015-10-05T07:28:00.000+03:00,2015-10-05T10:10:00.218+03:00
4,Political News,,17954912c005732967b28ef81b4ebc58d3911efc,0,0,0,0,0,0,0,...,http://www.newsdump.com/article/facebook-app-d...,0,Facebook app draining your iPhone battery? Com...,,english,,Facebook app draining your iPhone battery? Com...,,2015-10-23T13:08:00.000+03:00,2015-10-23T15:40:06.454+03:00


## Preparing Gephi Files

Now we have a compiled dataset, we have the choice of running processe on the full set or just subsets. Either way, the first step, however, will be to deal with null values.

In [8]:
data_colnan=data.columns[data.isnull().any()]
data[data_colnan].isnull().sum()

organizations                 72880
thread_social_gplus_shares       10
thread_main_image             61130
thread_section_title             78
thread_url                       54
thread_country                  631
thread_title                     54
thread_performance_score         84
thread_site                     108
thread_participants_count       108
thread_title_full               108
thread_spam_score               108
thread_site_type                 54
thread_published                 85
thread_replies_count             54
thread_uuid                     108
author                        63993
url                              54
ord_in_thread                   108
title                           108
locations                     83335
language                        108
persons                       83307
text                            108
external_links                96787
published                       108
crawled                         108
dtype: int64

In [9]:
data['title'] = data['title'].fillna("none")
data['text'] = data['text'].fillna("none")

We don't want all of the columns from the large dataset - only the article type, organizations, title, and text. The article type will be eventually removed, but we need it to initially filter for topics, i.e., politics, tech etc.

In [11]:
gephi_df = data[['Article Type','organizations', 'title', 'text']]

In [15]:
gephi_df= gephi_df.loc[gephi_df['Article Type'] == 'Political News']

In [21]:
gephi_df = gephi_df.drop(columns=['Article Type']).head()

We'll eventually need to create another dataframe which will be uploaded to Gephi, so we need to prepare it properly for the platform to easily read.

In [22]:
cv = CountVectorizer(ngram_range=(1,1), stop_words = 'english') 
X = cv.fit_transform(gephi_df['text']) #Change to columns wanted to identify keywords

In [23]:
Xc = (X.T * X) # This is the matrix manipulation step
Xc.setdiag(0) # We set the diagonals to be zeroes as it's pointless to be 1

In [24]:
names = cv.get_feature_names() # This is the keywords
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)

The cleaning process is done, so now we can export the file and open Gephi.

In [26]:
df.to_csv('to gephi.csv', sep = ',')