# Part 1. Import Tweet JSON ZIP Files and Create DataFrame
### This jupyter notebook is used to load the extracted .josnl.gz files and create a dataframe from this file. 

The jsonl.gz files are made when running the hydrate.py file from github repo [us-pres-elections-2020](https://github.com/echen102/us-pres-elections-2020). The hydrate.py file dehydrates twitter ids with the package twarc which is associated to the 2020 US Presidential Election. These ids were orginally collected by github user [echen102](https://github.com/echen102). The tweets collected are specifically associated with the 2020 US Presidential Election. 

In [1]:
# load packages 
import pandas as pd
import numpy as np

# load packages to load json file and unzip gz file
import json
import gzip

# load package to find out how long the execution time of the function 
import time

# load file for identifying global path in order to load multiple json files
import glob

### GZIP Package
Gzip package is not an efficient package to use. Unzipping the gzip file into a jsonl file first onto the computer before loading the file will load the jsonl a lot quicker into python. However, I do not want to unzip the files onto my computer because this will take up a huge amount of storage space that I do not have. Therefore I am using gzip to unzip the file within python and load the data.

In [3]:
# function for loading one json file into a list

def load_json(file):
    data = []
    with open(file) as f:
        for jsonObj in f:
            dataDict = json.loads(jsonObj)
            data.append(dataDict)
    


 ### Sample json files
 The original json file contains about 100,000 tweets. Each file contains tweets for each day of the week. I decide to sample 100 tweets for each file because 1. GZIP is not computationally efficient and would take a very long time to compute 100,000 tweets. As well my laptop does not have enough storage space to hold 100,000 tweets per day for five months. It is also important to work with a smaller dataset before working in the cloud. 

In [10]:
# Function for opening multiple .jsonl.gz files and appending the json files into a dataframe
# 
def jsonl_to_DF(path):

    
    startTime = time.time()
        
    
    filenames = glob.glob(path + "/*.jsonl.gz")

    dfs = []
    
    # open gzip file, read json file into a datframe named dfs
    # sample(n=100) is sampling only 100 lines or 100 tweets in each file. 
    for filename in filenames:
        with gzip.open(filename, 'r') as f:
            dfs.append(pd.read_json(filename, lines = True))
        
    # Concatenate all data into one DataFrame
    df = pd.concat(dfs, ignore_index=True)

    # save dataframe as .csv, name of file associated with the folder name: 2020-10, 2020-09, 2020-08, 2020-07, 2020-06
    file = path[-7:] + '.csv'
    df.to_csv(file)
    
    #print execuation time of running function
    executionTime = (time.time() - startTime) 
    return 'Execution time in seconds: ' + str(executionTime)

In [6]:
# Example of path:
# path =r'/Users/user_name/Desktop/Tweet_Data/2020-08'


### Run function to tranform unstructured jsonl.gz files into a structured DataFrame
#### DataFrame will be saved as a .csv file

In [None]:
jsonl_to_DF(r'/Tweet_Data/2020-06')