# Loading Data & Creating Exported Dataframe

The <b>purpose</b> of this notebook is to provide function to compile each JSON file into one dataframe and export it. This will allow an easier access to data cleaning and manipulation later.

## Libraries

In [12]:
import pandas as pd
import numpy as np
from pprint import pprint
import spacy
import json
import os
import time

## Creating A Class For Each JSON File

Based on the short loop below, it definitely looks like we'll need to create a series of functions to open and clean each file since the "os" feature will only retrieve the files instead of file content. 

In [14]:
counter = 0

for file in os.listdir('technews1'):
    print(file)
        
    counter += 1 
    if counter >= 4: break 

news_0014010.json
news_0008033.json
news_0000403.json
news_0004913.json


### The Class

The <b>class 'data_file'</b> simplifies the process of importing, flattening, light cleaning, and forming a dataframe from a given file. This will save time and from creating "for" loops to import and clean data in addition to joining multiple dataframes.  

In [15]:
class data_file(object):
 
    "Initiates the class of functions"
    def __init__(self, filename):
        self.filename = filename
        self.data = None
    
    "Open the json file"
    def open_file(self):
        with open('./data1/'+ self.filename) as f:
            self.data = json.load(f)
    
    "Flattens dictionary helper function and therefore taking care of any nested dictionaries"
    @staticmethod
    def flatten_dict(dd, separator='_', prefix=''):
        return { prefix + separator + k if prefix else k : v
                 for kk, vv in dd.items()
                 for k, v in data_file.flatten_dict(vv, separator, kk).items()
                 } if isinstance(dd, dict) else { prefix : dd }
    
    "Uses the function above to assign the resulting data to a variable"           
    def flatten_data(self):
        self.data = data_file.flatten_dict(self.data)
        
    "Inputting something for empty list and or string key values"
    def emptyvals(self):
        for key in self.data.keys():
            if self.data[key] in [[],'']: 
                self.data[key] = None 
            elif isinstance(self.data[key], list) and len(self.data[key]) > 1: 
                self.data[key] = [self.data[key]]
    
    "Creating a dataframe from the dictionary"           
    def dataframetable(self):
        return (pd.DataFrame(self.data, index=[0]))

## Setting Up The Loop For JSON Files

We run the class on the first file in the list of all JSON files because it will serve as the "starting point" dataframe for each additional file. In this sense, we'll be adding to the dataframe with each new JSON file. 

In [32]:
file_list = os.listdir('./technews/')

In [33]:
file_list[:10]

['news_0014010.json',
 'news_0008033.json',
 'news_0000403.json',
 'news_0004913.json',
 'news_0018930.json',
 'news_0005601.json',
 'news_0011212.json',
 'news_0008463.json',
 'news_0014440.json',
 'news_0013385.json']

Setting the first JSON file from the entire folder of JSON files to a variable, to which will we'll apply the class of functions.

In [34]:
temp_df = data_file(file_list[0])

Running the class of functions on the first JSON file in the list of all JSON files.

In [35]:
temp_df.open_file()
temp_df.data = data_file.flatten_dict(temp_df.data)
temp_df.flatten_data()
temp_df.emptyvals()
temp_df = temp_df.dataframetable()

## Creating The Dataframe

Using the very first JSON file, we'll create a variable for the columns to be included in the final dataframe. This is the equivalent of creating the "backbone" of object and then with each additional JSON file, the object will grow with the JSON file's data as a new row. 

In [36]:
colname = temp_df.columns.tolist()

### Warning! 

The cell below will take some time to run!

#### Steps

- We start by creating an <b>empty dataframe</b> with a <b>counter</b> to allow us to track the function's progress and a <b> time stamp</b> to provide a sense of time duration
- Create a "for" loop applying the class of functions to each JSON file
- Print the time stamp for ever 10,000 files 
- We remove the temporary file to save processing power
- Finally, we provide a final time stamp with the total amount of time it took to run the function

In [37]:
final_df = pd.DataFrame(columns= colname)
counter = 0
start_time = time.time()

for filename in file_list:
    temp_df = data_file(filename)
    temp_df.open_file()
    temp_df.data = data_file.flatten_dict(temp_df.data)
    temp_df.flatten_data()
    temp_df.emptyvals()
    final_df.loc[len(final_df)] = temp_df.data
    
    counter += 1
    
    if counter % 10000 == 0:
        print("There have been {} files read so far".format(counter))
        print("Time elapsed: {}".format(time.time() - start_time))
        
    del temp_df
    
print("Operation complete after {} seconds.".format(time.time()-start_time))

There have been 10000 files read so far
Time elapsed: 1162.9778900146484
Operation complete after 2868.747946023941 seconds.


We will now take a look at the dataframe as a whole to visually check if all of the rows were added - the row count should be the same as the amount of files within the original downloaded data.

In [38]:
final_df

Unnamed: 0,organizations,uuid,thread_social_gplus_shares,thread_social_pinterest_shares,thread_social_vk_shares,thread_social_linkedin_shares,thread_social_facebook_likes,thread_social_facebook_shares,thread_social_facebook_comments,thread_social_stumbledupon_shares,...,entities_locations,entities_organizations,highlightText,language,persons,text,external_links,published,crawled,highlightTitle
0,[Anchorage Daily News],f4ad43deab0a72726d6165b37a971c578efdd4f5,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Anchorage Daily News - Today \nP...,,2015-10-19T08:06:00.000+03:00,2015-10-19T09:23:00.540+03:00,
1,[Instagram Takeover Katy Perry],aa01573d89a949a310f069a8e1a4cb4a0595219c,0,0,0,0,0,0,0,0,...,,,,english,"[[Hillary Clinton, Katy Perry]]",Katy Perry Shows Her Support for Hillary Clint...,,2015-10-25T05:00:00.000+02:00,2015-10-25T06:24:55.144+02:00,
2,,be02beb2a6e3b83cd26debc2b7012073a539c691,0,0,0,0,0,0,0,0,...,,,,english,,"( Source : City of Carrollton, TX ) Enjoy a ni...",,2015-10-23T03:00:00.000+03:00,2015-10-23T03:10:51.192+03:00,
3,,1c20fd96c76b6e50168f4b76db3ebdea6ce14ac9,0,0,0,0,0,0,0,0,...,,,,english,,OPINION\nReturn to video Video settings Please...,[[http://media.smh.com.au/video-news/video-nat...,2015-10-22T10:47:00.000+03:00,2015-10-26T07:49:05.006+02:00,
4,,472f1f1983bb7e35ddaa30bdf52754840e236722,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Louisville Courier-Journal: Spor...,,2015-10-12T21:09:00.000+03:00,2015-10-12T23:50:48.817+03:00,
5,,cba9ec1d6ef684b4a14eb2167243df4cf8d462bf,0,0,0,0,0,0,0,0,...,,,,english,[Buemi],Published By: Fox Sports: Motor - Today \nFull...,,2015-10-24T12:32:00.000+03:00,2015-10-24T12:44:05.602+03:00,
6,,8dfe83ddc37107aed3e2024229e2b71b899f7381,0,0,0,0,0,0,0,0,...,,,,english,,By JOSH LEDERMAN and JULIE PACE\nAssociated Pr...,"[[http://twitter.com/jpaceDC, http://twitter.c...",2015-10-21T03:00:00.000+03:00,2015-10-22T07:52:57.663+03:00,
7,,001141aa1a5cf56994e687eb682cb5beba552edd,0,0,0,0,0,0,0,0,...,,,,english,,Published By: Louisville Courier-Journal: Spor...,,2015-10-19T02:10:00.000+03:00,2015-10-19T02:35:01.214+03:00,
8,[Democratic],acb7698e0b9aecdb7380e2220951798f2f999581,0,0,0,0,0,0,0,0,...,,,,english,[Clinton],Overnight ratings point to Democratic debate r...,,2015-10-14T20:29:00.000+03:00,2015-10-14T20:34:02.600+03:00,
9,,71e717f6a98bc29a9e6d441c457c220ad7e6f892,0,0,0,0,0,0,0,0,...,,,,english,,( Source : Southern Illinois University System...,,2015-10-17T03:00:00.000+03:00,2015-10-17T03:06:46.095+03:00,


Now we will export the dataframe as a csv file, so we can access it going forward. Note, the file will be saved in the same folder as this notebook and original downloaded data.

In [39]:
final_df.to_csv('technews.csv',  sep="\t")

# End