# Project 2, Part 2, Parse Peak's sales JSON file into CSV files

University of California, Berkeley

Master of Information and Data Science (MIDS) program

w205 - Fundamentals of Data Engineering


# Included Modules and Packages

Code cell containing your includes for modules and packages

In [None]:
import csv

import json

import math
import numpy as np
import pandas as pd

import psycopg2

# Supporting code

Code cells containing any supporting code, such as connecting to the database, any functions, etc.  

Remember you can freely use any code from the labs. You do not need to cite code from the labs.

In [None]:
#
# function to run a select query and return rows in a pandas dataframe
# pandas puts all numeric values from postgres to float
# if it will fit in an integer, change it to integer
#

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # fix the float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)
    

In [None]:
connection = psycopg2.connect(
    user = "postgres",
    password = "ucb",
    host = "postgres",
    port = "5432",
    database = "postgres"
)

In [None]:
cursor = connection.cursor()

In [None]:
def my_read_csv_file(file_name, limit):
    "read the csv file and print only the first limit rows"
    
    csv_file = open(file_name, "r")
    
    csv_data = csv.reader(csv_file)
    
    i = 0
    
    for row in csv_data:
        i += 1
        if i <= limit:
            print(row)
            
    print("\nPrinted ", min(limit, i), "lines of ", i, "total lines.")

In [None]:
def my_recursive_print_json(j, level = -1):
    "given a json object print it"
    
    level += 1
    
    spaces = "    "
    
    if type(j) is dict:
        dict_2_list = list(j.keys())
        for k in dict_2_list:
            print(spaces * level + k)
            my_recursive_print_json(j[k], level)
            
    elif type(j) is list:
        for (i, l) in enumerate(j):
            print(spaces * level + "[" + str(i) + "]")
            my_recursive_print_json(l, level)
                  
    else:
        print(spaces * level + "value:", str(j))
                  


In [None]:
def my_read_nested_json(file_name):
    "given a file of json, read it and parse it meaningfully"
    
    f = open(file_name, "r")
    
    j = json.load(f)
    
    f.close
    
    my_recursive_print_json(j)

# 2.2.1 Recursive walk of Peak's sales JSON file to help understand the structure

Peak has sent AGM a nested JSON file of sales data for October 3, 2020.  We need to first understand the structure of this file so we can parse it into CSV files to load into staging tables.

Use the function my_read_nested_json() which calls my_recursive_print_json() to take a recursive walk of the nested JSON file peak_sales_2020_10_03.json to help understand the structure.  These functions are from the lab, and are included above for your convenience.

Note: This will display a lot of output.  When running in Jupyter Notebook, the output will display in a scroll box.  When you look at it in GitHub in a web browser, it will be very long.  This is ok.

### Please take some time to study the structure until you understand it.  Understanding the structure will make it much easier to write the parsing code.


# 2.2.2 Parse Peak's nested JSON sales file into CSV files

Write Python code to parse Peak's nested JSON sales file, peak_sales_2020_10_03.json, into CSV files.

The first line of each CSV file should be a list of fields as specified below.

Do NOT sort them.  Keep the csv data rows in the same order as the data appears in the JSON file.

Do NOT remove duplicates.

peak_sales.csv
* sale_id - this is Peak's sale_id (NOT AGM's sale_id)
* sale_date
* sub_total - this should equal the sum of line item quantity x 12, however always use the data from the JSON file, as we will validate it later
* tax - this should always be 0 as our items are tax exempt, however always use the data from the JSON file, as we will validate it later
* total_amount - this should be equal to sub_total + tax, however always use the data from the JSON file, as we will validate it later

peak_stores.csv
* sale_id - we need this to be able to link this to peak_sales
* location_id - this is Peak's location_id (NOT AGM's store_id)
* name  
* street 
* city 
* state 
* zip 

peak_customers.csv
* sale_id - we need this to be able to link this to peak_sales
* customer_id - this is Peak's customer_id (NOT AGM's customer_id)
* first_name
* last_name
* street
* city
* state
* zip

peak_line_items.csv
* sale_id - we need this to be able to link this to peak_sales
* line_item_id - you will need to sequentially number them within each sale starting with 1, each new sale will start over at 1
* product_id - this is Peak's product ID (NOT AGM's product_id)
* price - this should always be 12 as as all of our items are 12, however always use the data from the JSON file, as we will validate it later
* quantity
* taxable - this should always be 'N' as our items are not taxable, however always use the data from the JSON file, as we will validate it later


After creating these CSV files, be sure and check them into your GitHub repo.

You may use as many code cells as you wish.

# 2.2.3 Display the CSV file peak_sales.csv

Use the function my_read_csv_file() from the labs (with a limit of 10) to display the CSV file you just created:
* peak_sales.csv

For your convenience the function is provided above.

The output should look similar to this:
```
['sale_id', 'sale_date', 'sub_total', 'tax', 'total_amount']
['5763728874', '2020-10-03', '12', '0', '12']
['5763729036', '2020-10-03', '72', '0', '72']
['5763728904', '2020-10-03', '24', '0', '24']
['5763728973', '2020-10-03', '96', '0', '96']
['5763728757', '2020-10-03', '108', '0', '108']
['5763729051', '2020-10-03', '144', '0', '144']
['5763729153', '2020-10-03', '24', '0', '24']
['5763728608', '2020-10-03', '96', '0', '96']
['5763728696', '2020-10-03', '84', '0', '84']

Printed  10 lines of  98 total lines.
```

# 2.2.4 Display the CSV file peak_stores.csv

Use the function my_read_csv_file() from the labs (with a limit of 10) to display the CSV file you just created:
* peak_stores.csv

For your convenience, the function is provided above. 

The output should look similar to this:

```
['sale_id', 'location_id', 'name', 'street', 'city', 'state', 'zip']
['5763728874', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763729036', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763728904', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763728973', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763728757', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763729051', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763729153', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763728608', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']
['5763728696', '12573', 'Acme Gourmet Meals', '3000 Telegraph Ave', 'Berkeley', 'CA', '94705']

Printed  10 lines of  98 total lines.
```


# 2.2.5 Display the CSV file peak_customers.csv

Use the function my_read_csv_file() from the labs (with a limit of 10) to display the CSV file you just created:
* peak_customers.csv

For your convenience, the function is provided above.

The output should look similar to this:

```
['sale_id', 'customer_id', 'first_name', 'last_name', 'street', 'city', 'state', 'zip']
['5763728874', '3728404', 'Darrelle', 'Dohrmann', '46 Farwell Terrace', 'Oakland', 'CA', '94609']
['5763729036', '3729309', 'Moria', 'Greenwood', '8803 Delaware Crossing', 'Berkeley', 'CA', '94705']
['5763728904', '3728508', 'Josiah', 'Hulett', '6755 Melby Plaza', 'Oakland', 'CA', '94612']
['5763728973', '3728534', 'Gayle', 'MacGarrity', '286 Onsgard Center', 'Berkeley', 'CA', '94703']
['5763728757', '3729188', 'Courtenay', 'Shirrell', '75 West Park', 'Emeryville', 'CA', '94608']
['5763729051', '3729276', 'Christian', 'Anyene', '869 Transport Crossing', 'Berkeley', 'CA', '94707']
['5763729153', '3729242', 'Linnell', 'Barr', '521 Fallview Alley', 'Oakland', 'CA', '94602']
['5763728608', '3728705', 'Benedick', 'Staneland', '3852 Laurel Park', 'Berkeley', 'CA', '94704']
['5763728696', '3729340', 'Lanni', 'Pickavant', '481 Moose Pass', 'Oakland', 'CA', '94609']

Printed  10 lines of  98 total lines.
```

# 2.2.6 Display the CSV file peak_line_items.csv

Use the function my_read_csv_file() from the labs (with a limit of 10) to display the CSV file you just created:
* peak_line_items.csv

For your convenience, the function is provided above.

The output should look similar to this:

```
['sale_id', 'line_item_id', 'product_id', 'price', 'quantity', 'taxable']
['5763728874', '1', '42314680', '12', '1', 'N']
['5763729036', '1', '42314677', '12', '1', 'N']
['5763729036', '2', '42314682', '12', '3', 'N']
['5763729036', '3', '42314684', '12', '2', 'N']
['5763728904', '1', '42314680', '12', '1', 'N']
['5763728904', '2', '42314684', '12', '1', 'N']
['5763728973', '1', '42314677', '12', '2', 'N']
['5763728973', '2', '42314680', '12', '2', 'N']
['5763728973', '3', '42314682', '12', '2', 'N']

Printed  10 lines of  353 total lines.
```