# Loading Data & Creating Exported Dataframe

The <b>purpose</b> of this notebook is to provide function to compile each JSON file into one dataframe and export it. This will allow an easier access to data cleaning and manipulation later.

## Libraries

In [24]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import time

## Creating A Class For Each JSON File

Based on the short loop below, it definitely looks like we'll need to create a series of functions to open and clean each file since the "os" feature will only retrieve the files instead of file content. 

In [106]:
counter = 0

for file in os.listdir('popular2'):
    print(file)
        
    counter += 1 
    if counter >= 4: break 

news_0047188.json
news_0014010.json
news_0042670.json
news_0069520.json


### The Class

The <b>class 'data_file'</b> simplifies the process of importing, flattening, light cleaning, and forming a dataframe from a given file. This will save time and from creating "for" loops to import and clean data in addition to joining multiple dataframes.  

In [107]:
class data_file(object):
 
    "Initiates the class of functions"
    def __init__(self, filename):
        self.filename = filename
        self.data = None
    
    "Open the json file"
    def open_file(self):
        with open('./popular2/'+ self.filename) as f:
            self.data = json.load(f)
    
    "Flattens dictionary helper function and therefore taking care of any nested dictionaries"
    @staticmethod
    def flatten_dict(dd, separator='_', prefix=''):
        return { prefix + separator + k if prefix else k : v
                 for kk, vv in dd.items()
                 for k, v in data_file.flatten_dict(vv, separator, kk).items()
                 } if isinstance(dd, dict) else { prefix : dd }
    
    "Uses the function above to assign the resulting data to a variable"           
    def flatten_data(self):
        self.data = data_file.flatten_dict(self.data)
        
    "Inputting something for empty list and or string key values"
    def emptyvals(self):
        for key in self.data.keys():
            if self.data[key] in [[],'']: 
                self.data[key] = None 
            elif isinstance(self.data[key], list) and len(self.data[key]) > 1: 
                self.data[key] = [self.data[key]]
    
    "Creating a dataframe from the dictionary"           
    def dataframetable(self):
        return (pd.DataFrame(self.data, index=[0]))

## Setting Up The Loop For JSON Files

We run the class on the first file in the list of all JSON files because it will serve as the "starting point" dataframe for each additional file. In this sense, we'll be adding to the dataframe with each new JSON file. 

In [108]:
file_list = os.listdir('./popular2/')

In [109]:
file_list[:10]

['news_0047188.json',
 'news_0014010.json',
 'news_0042670.json',
 'news_0069520.json',
 'news_0038485.json',
 'news_0075503.json',
 'news_0023363.json',
 'news_0008033.json',
 'news_0059196.json',
 'news_0088866.json']

Setting the first JSON file from the entire folder of JSON files to a variable, to which will we'll apply the class of functions.

In [110]:
temp_df = data_file(file_list[0])

Running the class of functions on the first JSON file in the list of all JSON files.

In [111]:
temp_df.open_file()
temp_df.data = data_file.flatten_dict(temp_df.data)
temp_df.flatten_data()
temp_df.emptyvals()
temp_df = temp_df.dataframetable()

## Creating The Dataframe

Using the very first JSON file, we'll create a variable for the columns to be included in the final dataframe. This is the equivalent of creating the "backbone" of object and then with each additional JSON file, the object will grow with the JSON file's data as a new row. 

In [112]:
colname = temp_df.columns.tolist()

#### Steps

- We start by creating a <b>list</b> with a <b>counter</b> to allow us to track the function's progress and a <b> time stamp</b> to provide a sense of time duration
- Create a "for" loop applying the class of functions to each JSON file
- Print the time stamp for ever 10,000 files 
- We remove the temporary file to save processing power
- Finally, we provide a final time stamp with the total amount of time it took to run the function

In [113]:
len(file_list)

89659

In [114]:
final_df = list(range(0,len(file_list)))

In [115]:

counter = 0
start_time = time.time()

for filename in file_list:
    temp_df = data_file(filename)
    temp_df.open_file()
    temp_df.data = data_file.flatten_dict(temp_df.data)
    temp_df.flatten_data()
    temp_df.emptyvals()
    final_df[counter] = temp_df.data
    
    counter += 1
    
    if counter % 10000 == 0:
        print("There have been {} files read so far".format(counter))
        print("Time elapsed: {}".format(time.time() - start_time))
        
    del temp_df
    
print("Operation complete after {} seconds.".format(time.time()-start_time))

There have been 10000 files read so far
Time elapsed: 14.17127513885498
There have been 20000 files read so far
Time elapsed: 31.045186042785645
There have been 30000 files read so far
Time elapsed: 44.850319147109985
There have been 40000 files read so far
Time elapsed: 63.95774602890015
There have been 50000 files read so far
Time elapsed: 76.82732796669006
There have been 60000 files read so far
Time elapsed: 88.94894218444824
There have been 70000 files read so far
Time elapsed: 102.82191324234009
There have been 80000 files read so far
Time elapsed: 114.26567101478577
Operation complete after 127.98632192611694 seconds.


We will now take a look at the dataframe as a whole to visually check if all of the rows were added - the row count should be the same as the amount of files within the original downloaded data.

In [116]:
df = pd.DataFrame.from_dict(final_df[0:len(final_df)-1])

In [117]:
df

Unnamed: 0,author,crawled,entities_locations,entities_organizations,entities_persons,external_links,highlightText,highlightTitle,language,locations,...,thread_social_stumbledupon_shares,thread_social_vk_shares,thread_spam_score,thread_title,thread_title_full,thread_url,thread_uuid,title,url,uuid
0,Agence France-Presse,2017-03-20T04:30:16.027+02:00,"[{'name': 'us', 'sentiment': 'none'}]","[[{'name': 'cbs news', 'sentiment': 'none'}, {...","[[{'name': 'julia', 'sentiment': 'negative'}, ...",,,,english,,...,0,0,0.000,"Julia, the newest resident of ‘Sesame Street’,...","Julia, the newest resident of ‘Sesame Street’,...",http://www.scmp.com/news/world/united-states-c...,5a37b792d632090c157e0f82705f45fc70af4775,"Julia, the newest resident of ‘Sesame Street’,...",http://www.scmp.com/news/world/united-states-c...,5a37b792d632090c157e0f82705f45fc70af4775
1,,2017-03-07T20:37:39.015+02:00,,,,,,,english,,...,0,0,0.789,"Jewish community centers, schools close amid n...","Jewish community centers, schools close amid n...",https://www.peters.senate.gov/newsroom/press-r...,5256ad1f0ff328988421bc0968a44b6a8b45c4f9,"Jewish community centers, schools close amid n...",https://www.peters.senate.gov/newsroom/press-r...,5256ad1f0ff328988421bc0968a44b6a8b45c4f9
2,Jen,2017-03-17T00:51:47.366+02:00,,"[{'name': 'facebook', 'sentiment': 'none'}]","[[{'name': 'jen mills', 'sentiment': 'neutral'...",,,,english,,...,0,0,0.000,Why could that naked man be hiding outside on ...,Why could that naked man be hiding outside on ...,http://metro.co.uk/2017/03/16/naked-lover-film...,e403a77dc7e5232e65fac97948042145752dfe4a,Why could that naked man be hiding outside on ...,http://metro.co.uk/2017/03/16/naked-lover-film...,e403a77dc7e5232e65fac97948042145752dfe4a
3,"<a href=""https://www.washingtonpost.com/people...",2017-03-23T00:30:17.689+02:00,,,,,,,english,,...,0,0,0.000,Legislative rollback of Obama-era worker safet...,Legislative rollback of Obama-era worker safet...,https://www.washingtonpost.com/news/politics/w...,2cd6516e31392c36461d525ba50cc835f175b3c1,Legislative rollback of Obama-era worker safet...,https://www.washingtonpost.com/news/politics/w...,2cd6516e31392c36461d525ba50cc835f175b3c1
4,,2017-03-16T15:01:03.085+02:00,"[[{'name': 'ranchi', 'sentiment': 'none'}, {'n...","[{'name': 'zee media bureau', 'sentiment': 'no...","[[{'name': 'virat kohli', 'sentiment': 'negati...",,,,english,,...,0,0,0.236,"Virat Kohli picks up injury in Ranchi, Ajinkya...","Virat Kohli picks up injury in Ranchi, Ajinkya...",http://zeenews.india.com/cricket/virat-kohli-p...,4422a136ba4cc13bf8c5f38ebec760e322428dab,"Virat Kohli picks up injury in Ranchi, Ajinkya...",http://zeenews.india.com/cricket/virat-kohli-p...,4422a136ba4cc13bf8c5f38ebec760e322428dab
5,Alex Carlile,2017-03-24T05:36:58.154+03:00,"[[{'name': 'britain', 'sentiment': 'none'}, {'...","[{'name': 'parliament', 'sentiment': 'none'}]","[[{'name': 'ellwood', 'sentiment': 'none'}, {'...",[http://17909.cdx.c.ooyala.com/p4cnVnYTE6uIVSs...,,,english,,...,0,0,0.003,"After the Westminster attack, here is what we ...","After the Westminster attack, here is what we ...",http://www.telegraph.co.uk/news/2017/03/23/wes...,139115c35aa0b9ae4e3030e523595196bd3707b7,"After the Westminster attack, here is what we ...",http://www.telegraph.co.uk/news/2017/03/23/wes...,139115c35aa0b9ae4e3030e523595196bd3707b7
6,lem1,2017-03-10T08:19:24.914+02:00,"[[{'name': 'jolo', 'sentiment': 'none'}, {'nam...","[[{'name': 'defense', 'sentiment': 'negative'}...","[[{'name': 'delfin lorenzana', 'sentiment': 'n...",,,,english,,...,0,0,0.000,"Re: For the Defense chief, Abu Sayyaf the bigg...","Re: For the Defense chief, Abu Sayyaf the bigg...",http://newsinfo.inquirer.net/879491/for-the-de...,31953ab5502659ac8e7126187f77866fb958157d,"Re: For the Defense chief, Abu Sayyaf the bigg...",http://newsinfo.inquirer.net/879491/for-the-de...,31953ab5502659ac8e7126187f77866fb958157d
7,,2017-03-01T04:12:52.665+02:00,,"[[{'name': 'ft group markets', 'sentiment': 'n...",,,,,english,,...,0,0,0.216,Subscribe to read,Subscribe to read,https://www.ft.com/content/3d8910fe-fe1b-11e6-...,edcb28cb8a2b5043c2b192378e1d3d563f473f72,Subscribe to read,https://www.ft.com/content/3d8910fe-fe1b-11e6-...,edcb28cb8a2b5043c2b192378e1d3d563f473f72
8,Jennifer Williams,2017-03-21T08:47:46.905+02:00,"[[{'name': 'birmingham city', 'sentiment': 'no...","[[{'name': 'manchester', 'sentiment': 'negativ...","[[{'name': 'gary neville', 'sentiment': 'none'...",,,,english,,...,0,0,0.000,She's about to become Manchester's most powerf...,She's about to become Manchester's most powerf...,http://www.manchestereveningnews.co.uk/news/gr...,5ccf4477e1b130ea733a1afe5d16f8df5ea50ba0,She's about to become Manchester's most powerf...,http://www.manchestereveningnews.co.uk/news/gr...,5ccf4477e1b130ea733a1afe5d16f8df5ea50ba0
9,Michael McLaughlin,2017-03-31T03:01:54.340+03:00,"[{'name': 'florida', 'sentiment': 'none'}]","[[{'name': 'falcon', 'sentiment': 'none'}, {'n...","[{'name': 'michael mclaughlin', 'sentiment': '...",,,,english,,...,0,0,0.000,SpaceX Launches And Lands The First Recycled O...,SpaceX Launches And Lands The First Recycled O...,http://www.huffingtonpost.com/entry/spacex-rec...,8d49701cd75042a1f8fc188202c873de8cc70570,SpaceX Launches And Lands The First Recycled O...,http://www.huffingtonpost.com/entry/spacex-rec...,8d49701cd75042a1f8fc188202c873de8cc70570


Now we will export the dataframe as a csv file, so we can access it going forward. Note, the file will be saved in the same folder as this notebook and original downloaded data.

In [92]:
df.to_csv('popular2.csv',  sep=",")

# End