# Loading Data & Creating Exported Dataframe

The <b>purpose</b> of this notebook is to provide the function to compile each JSON file into one dataframe and export it. This will allow an easier access to data cleaning and manipulation later.

## Libraries

In [44]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import matplotlib.pyplot as plt
import matplotlib
import time

## Creating A Class For Each JSON File

Based on the short loop below, it definitely looks like we'll need to create a series of functions to open and clean each file since the "os" feature will only retrieve the files instead of file content. 

In [45]:
counter = 0

for file in os.listdir('data1'):
    print(file)
        
    counter += 1 
    if counter >= 4: break 

news_0047188.json
news_0014010.json
news_0042670.json
news_0069520.json


### The Class

The <b>class 'data_file'</b> simplifies the process of importing, flattening, light cleaning, and forming a dataframe from a given file. This will save time and from creating "for" loops to import and clean data in addition to joining multiple dataframes.  

In [46]:
class data_file(object):
 
    "Initiates the class of functions"
    def __init__(self, filename):
        self.filename = filename
        self.data = None
    
    "Open the json file"
    def open_file(self):
        with open('./data1/'+ self.filename) as f:
            self.data = json.load(f)
    
    "Flattens dictionary helper function and therefore taking care of any nested dictionaries"
    @staticmethod
    def flatten_dict(dd, separator='_', prefix=''):
        return { prefix + separator + k if prefix else k : v
                 for kk, vv in dd.items()
                 for k, v in data_file.flatten_dict(vv, separator, kk).items()
                 } if isinstance(dd, dict) else { prefix : dd }
    
    "Uses the function above to assign the resulting data to a variable"           
    def flatten_data(self):
        self.data = data_file.flatten_dict(self.data)
        
    "Inputting something for empty list and or string key values"
    def emptyvals(self):
        for key in self.data.keys():
            if self.data[key] in [[],'']: 
                self.data[key] = None 
            elif isinstance(self.data[key], list) and len(self.data[key]) > 1: 
                self.data[key] = [self.data[key]]
    
    "Creating a dataframe from the dictionary"           
    def dataframetable(self):
        return (pd.DataFrame(self.data, index=[0]))

## Setting Up The Loop For JSON Files

We run the class on the first file in the list of all JSON files because it will serve as the "starting point" dataframe for each additional file. In this sense, we'll be adding to the dataframe with each new JSON file. 

In [47]:
file_list = listdir('./data1/')

In [48]:
file_list[:10]

['news_0047188.json',
 'news_0014010.json',
 'news_0042670.json',
 'news_0069520.json',
 'news_0038485.json',
 'news_0075503.json',
 'news_0023363.json',
 'news_0008033.json',
 'news_0059196.json',
 'news_0043962.json']

Setting the first JSON file from the entire folder of JSON files to a variable, to which will we'll apply the class of functions.

In [49]:
temp_df = data_file(file_list[0])

Running the class of functions on the first JSON file in the list of all JSON files.

In [50]:
temp_df.open_file()
temp_df.data = data_file.flatten_dict(temp_df.data)
temp_df.flatten_data()
temp_df.emptyvals()
temp_df = temp_df.dataframetable()

## Creating The Dataframe

Using the very first JSON file, we'll create a variable for the columns to be included in the final dataframe. This is the equivalent of creating the "backbone" of object and then with each additional JSON file, the object will grow with the JSON file's data as a new row. 

In [51]:
colname = temp_df.columns.tolist()

### Warning! 

The cell below will take some time to run!

#### Steps

- We start by creating an <b>empty dataframe</b> with a <b>counter</b> to allow us to track the function's progress and a <b> time stamp</b> to provide a sense of time duration
- Create a "for" loop applying the class of functions to each JSON file
- Print the time stamp for ever 10,000 files 
- We remove the temporary file to save processing power
- Finally, we provide a final time stamp with the total amount of time it took to run the function

In [52]:
final_df = pd.DataFrame(columns= colname)
counter = 0
start_time = time.time()

for filename in file_list:
    temp_df = data_file(filename)
    temp_df.open_file()
    temp_df.data = data_file.flatten_dict(temp_df.data)
    temp_df.flatten_data()
    temp_df.emptyvals()
    final_df.loc[len(final_df)] = temp_df.data
    
    counter += 1
    
    if counter % 10000 == 0:
        print("There have been {} files read so far".format(counter))
        print("Time elapsed: {}".format(time.time() - start_time))
        
    del temp_df
    
print("Operation complete after {} seconds.".format(time.time()-start_time))

There have been 10000 files read so far
Time elapsed: 367.1813039779663
There have been 20000 files read so far
Time elapsed: 1272.5758740901947
There have been 30000 files read so far
Time elapsed: 2696.3004338741302
There have been 40000 files read so far
Time elapsed: 4618.97672700882
There have been 50000 files read so far
Time elapsed: 7057.38160610199
There have been 60000 files read so far
Time elapsed: 10034.922996044159
There have been 70000 files read so far
Time elapsed: 14382.060152053833
There have been 80000 files read so far
Time elapsed: 19303.08251595497
Operation complete after 22888.047554016113 seconds.


We will now take a look at the dataframe as a whole to visually check if all of the rows were added - the row count should be the same as the amount of files within the original downloaded data.

In [53]:
final_df

Unnamed: 0,organizations,uuid,thread_social_gplus_shares,thread_social_pinterest_shares,thread_social_vk_shares,thread_social_linkedin_shares,thread_social_facebook_likes,thread_social_facebook_shares,thread_social_facebook_comments,thread_social_stumbledupon_shares,...,entities_locations,entities_organizations,highlightText,language,persons,text,external_links,published,crawled,highlightTitle
0,,8085f289866a814f7a443e1a31e48f8a307a040f,0,0,0,0,0,0,0,0,...,,,,english,,The Healthiest Pastas: From Quinoa to Buckwhea...,[[http://www.reddit.com/submit?url=http%3A%2F%...,2015-10-02T03:00:00.000+03:00,2015-10-02T17:33:59.981+03:00,
1,[Anchorage Daily News],f4ad43deab0a72726d6165b37a971c578efdd4f5,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Anchorage Daily News - Today \nP...,,2015-10-19T08:06:00.000+03:00,2015-10-19T09:23:00.540+03:00,
2,[ABC News],c98cbd870f52950ff685e772fd189bd01fc85767,0,0,0,0,0,0,0,0,...,,,,english,,Published By: ABC News - Today \nVideo obtaine...,,2015-10-08T17:09:00.000+03:00,2015-10-08T17:42:28.717+03:00,
3,,3481ad311613e0da31e6017f854c7ded093b398a,0,0,0,0,0,0,0,0,...,,,,english,,Note: This post contains spoilers about Fear t...,,2015-10-05T07:28:00.000+03:00,2015-10-05T10:10:00.218+03:00,
4,,17954912c005732967b28ef81b4ebc58d3911efc,0,0,0,0,0,0,0,0,...,,,,english,,Facebook app draining your iPhone battery? Com...,,2015-10-23T13:08:00.000+03:00,2015-10-23T15:40:06.454+03:00,
5,,5bbf98bcfe73b21ec93242edbe28c726162587e7,0,0,0,0,0,0,0,0,...,,,,english,,Maroochydore MP Fiona Simpson is calling on th...,,2015-10-14T03:00:00.000+03:00,2015-10-14T06:07:23.084+03:00,
6,[Cincinnati Enquirer],bca68dcec52429a0a83a38388ee8bf733ade5d48,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Cincinnati Enquirer - Today \nCi...,,2015-10-16T13:36:00.000+03:00,2015-10-16T14:42:08.387+03:00,
7,[Instagram Takeover Katy Perry],aa01573d89a949a310f069a8e1a4cb4a0595219c,0,0,0,0,0,0,0,0,...,,,,english,"[[Hillary Clinton, Katy Perry]]",Katy Perry Shows Her Support for Hillary Clint...,,2015-10-25T05:00:00.000+02:00,2015-10-25T06:24:55.144+02:00,
8,,d0cc1f0001b61cb5bea5acd414b0ec2be587708d,0,0,0,0,0,0,0,0,...,,,,english,,A Lebanese-born Sydney father has spent 10 mon...,,2015-10-26T02:00:00.000+02:00,2015-10-26T12:59:48.915+02:00,
9,[Finger Lake Times],c2487aab68c380ef6de6ed674b675c4e7dca55d6,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Finger Lake Times - Today \nSENE...,,2015-10-18T21:00:00.000+03:00,2015-10-18T23:21:55.348+03:00,


Now we will export the dataframe as a csv file, so we can access it going forward. Note, the file will be saved in the same folder as this notebook and original downloaded data.

In [54]:
final_df.to_csv('politicalnews.csv', sep='\t')