# Converting .jsonl to .csv

This notebook will take the .jsonl file of tweets and convert it to .csv to make the data easier to work with. 

I have pulled tweets from two twitter accounts with the use of Twarc to build the tweet corpora. Additionally the csv of each account will only store tweets relevant to the COVID-19 outbreak. 

The two twitter accounts: 

@Health_ZA : started reporting on COVID-19 on 25th Jan 2020. 

@nicd_sa: started reporting on COVID-19 on 15th Jan 2020. 

## Import Relevant Libraries 

In [238]:
import pandas as pd
import csv
import json
import sys

## Path and file name setup

In [239]:
input_name_health = "Updated_1_healthza_tweets.jsonl"
input_name_nicd = "Updated_1_nicd_tweets.jsonl"
unsliced_output_name_health = "unsliced_health_za_tweets"
unsliced_output_name_nicd = "unsliced_nicd_sa_tweets"
sliced_output_name_health = "sliced_health_za_tweets.csv"
sliced_output_name_nicd = "sliced_nicd_sa_tweets.csv"

startDateHealth = "2020-01-27 15:21:18+00:00"
startDateNICD = "2020-01-15 10:00:08+00:00"

## Data Processing

### Convert .jsonl to .csv
The code block below converts the jsonl to csv: 

In [240]:
def flatten_dict(d_obj):

    for k, v in d_obj.items():

        if isinstance(v, dict):

            new_dict = {'{}/{}'.format(k,k2):v2 for k2, v2 in v.items()}
            yield from flatten_dict(new_dict)

        else:
            yield k, v

def convertToCSV(inputname, unslicedoutputname):
    dict_list = []

    with open(inputname) as f:
        for line in f:
            j_data = json.loads(line)
            new_dict = {k:v for k,v in flatten_dict(j_data)}
            dict_list.append(new_dict)

    # Write new raw JSON file with combined data
    with open(unslicedoutputname + '.json', 'w') as new_f:
        new_f.write(json.dumps(dict_list))

    # Convert list of dicts into dataframe and send to csv
    df = pd.DataFrame(dict_list)
    #df.to_csv('nicd_sa_tweets.csv', index=False)
    df.to_csv(unslicedoutputname + '.csv', index=False, header=True)

Run function for each twitter corpus. 

In [241]:
convertToCSV(input_name_health, unsliced_output_name_health )
convertToCSV(input_name_nicd, unsliced_output_name_nicd )

### Read created csv

In [242]:
df_health = pd.read_csv(unsliced_output_name_health + ".csv", sep=",", engine='python')
df_nicd = pd.read_csv(unsliced_output_name_nicd + ".csv", sep=",", engine='python')

### Check the shape

In [243]:
df_health.shape

(3204, 346)

In [244]:
df_nicd.shape

(2022, 347)

### Convert "created_at" column to datetime format

In [245]:
df_health.created_at = pd.to_datetime(df_health.created_at, errors='coerce')
df_nicd.created_at = pd.to_datetime(df_nicd.created_at, errors='coerce')

Verifying data type of datetime:

In [246]:
df_health.created_at.head(5)

0   2020-03-19 20:59:19+00:00
1   2020-03-19 20:58:53+00:00
2   2020-03-19 19:45:29+00:00
3   2020-03-19 18:57:11+00:00
4   2020-03-19 17:52:58+00:00
Name: created_at, dtype: datetime64[ns, UTC]

In [247]:
df_nicd.created_at.head(5)

0   2020-03-19 16:41:07+00:00
1   2020-03-19 13:17:53+00:00
2   2020-03-19 09:40:49+00:00
3   2020-03-19 09:40:48+00:00
4   2020-03-19 09:40:48+00:00
Name: created_at, dtype: datetime64[ns, UTC]

### Find row where we want to slice the data

For Health_ZA the first tweet about COVID-19 was 2020-01-27 at 15:21. 

For nicd_sa the first tweet about COVID-19 was 2020-01-15 at 10:00. 

In [248]:
def FindRowNum(df, startdate):
    for i,v in enumerate(df.created_at):
        if str(v) == startdate:
            return i
    return None

In [249]:
row_num_health = FindRowNum(df_health, startDateHealth)
row_num_nicd = FindRowNum(df_nicd, startDateNICD)

### Create clean .csv of when the COVID-19 tweets started

In [250]:
df_health[:row_num_health].to_csv(sliced_output_name_health, index=False, header=True)
df_nicd[:row_num_nicd].to_csv(sliced_output_name_nicd, index=False, header=True)