# Data 200: Database Systems and Data Management for Data Analytics


# Homework 11: Operations on JSON Files

**Dickinson College**<br/>
**Spring 2022**<br/>
**Instructor:** Dick Forrester<br/>
<font color='red'>**Due Date and Time:** 11:59pm on Monday, 4/18/2022 </font>
---
Enter your name in the markdown cell below.

# Name: Zimeng Liu

In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING AND TO LOAD NumPy
import requests
import numpy as np
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Tasks

- Review pages 281-322 in the Course Notes.
- Complete the **Creating Plots on Data Aware Grids** chapter of the **Intermediate Data Visualization with Seaborn** course on DataCamp.
- Complete the **Creating and Manipulating Your Own Databases** and **Putting it all Together** chapters of the **Introduction to Databases in Python** course on DataCamp.
- Complete the **Key-Value Databases** and **Document Databases** chapters of the **NoSQL Concepts** course on DataCamp.
- E-mail me your completed Jupyter notebook.

# Exercises

This homework involves processing a file, `slac.json`, that contains
JSON-formatted data for the course catalog at a Small Liberal Arts College (slac).

**Run the code cell below to import the libraries we will use in this homework, and to set the data directory.**

In [2]:
import os.path
from lxml import etree
import json
import pandas as pd
import util

datadir="hw11Data"

<div class="exercise"><b>Exercise 1:</b></div> 

Open the `slac.json` file using a **text editor** and inspect the file to understand it's structure:


- The top level is a dictionary, with only one key (`course`) that maps to a list of dictionaries,
- Each of these dictionaries describes exactly one course offering, with fields about the course, including `reg_num`, `title`, etc.
- Two of the entries within these dictionaries are dictionaries themselves.

<div class="exercise"><b>Exercise 2:</b></div> 

Write Python code to 
- Read in the JSON-formatted data from the `slac.json` file and generate a data structure called `slac` that contains the parsed data.  **Note:** The `slac.json` file is stored in the `hw11Data` folder.  Furthermore, we already defined `datadir = hw11Data` in the same code cell where we imported the needed Python libraries.  **Please use `os.path.join()`** to set the path to the JSON file.
- Use the utility function `util.print_data()` to print the first 38 lines of the data structure.

In [3]:
slac = os.path.join(datadir, "slac.json")
with open(slac) as  file:
    slac = json.load(file)
util.print_data(slac, nlines=38)

{
  "course": [
    {
      "reg_num": "10577",
      "subj": "ANTH",
      "crse": "211",
      "sect": "F01",
      "title": "Introduction to Anthropology",
      "units": "1.0",
      "instructor": "Brightman",
      "days": "M-W",
      "time": {
        "start_time": "03:10PM",
        "end_time": "04:30"
      },
      "place": {
        "building": "ELIOT",
        "room": "414"
      }
    },
    {
      "reg_num": "20573",
      "subj": "ANTH",
      "crse": "344",
      "sect": "S01",
      "title": "Sex and Gender",
      "units": "1.0",
      "instructor": "Makley",
      "days": "T-Th",
      "time": {
        "start_time": "10:30AM",
        "end_time": "11:50"
      },
      "place": {
        "building": "VOLLUM",
        "room": "120"
      }
    },


<div class="exercise"><b>Exercise 3:</b></div> 

Write code to iterate over the first 5 elements of the course list contained in `slac` and print
out the course title and registrar number. This is a little tricky because `slac` is a dictionary, that contains a list of dictionaries.  Please study the output from the previous exercise carefully.  Note that in my solution, I simply used standard Python techniques for accessing lists and dictionaries.

Below is my output--your code should mimic it.

<code>
Introduction to Anthropology  (10577)
Sex and Gender  (20573)
Field Biology of Amphibians  (10624)
Bacterial Pathogenesis  (10626)
Seminar in Biology  (20626)
</code>

In [4]:
lis = slac['course']
for item in lis[:5]:
    print(item['title']+'  ('+item['reg_num']+')')

Introduction to Anthropology  (10577)
Sex and Gender  (20573)
Field Biology of Amphibians  (10624)
Bacterial Pathogenesis  (10626)
Seminar in Biology  (20626)


<div class="exercise"><b>Exercise 4:</b></div> 

Write a function `slacDataFrame(data)` that creates and returns a Pandas data frame from the `slac` data. There should be
a row per course, and columns named as they are in the dictionary used to represent each course, but skipping the "time" and "place" sub-dictionaries. The data frame should have `reg_num` as the row index.   **Hint:** In my solution I deleted the "time" and "place" sub-dictionaries before passing the data into `pd.DataFrame()`. However, not all courses have this data, so I tested if "time" was contained in a course, and if so, I deleted it.  I did the same for "place".

In the second code cell below you will test your function by finding passing in `slac`. The output from my solution is as follows (note that `reg_num` is the index):<br>

<code>
             subj crse sect                         title units instructor  days
reg_num                                                                     
10577    ANTH  211  F01  Introduction to Anthropology   1.0  Brightman   M-W
20573    ANTH  344  S01                Sex and Gender   1.0     Makley  T-Th
10624    BIOL  431  F01   Field Biology of Amphibians   0.5     Kaplan     T
10626    BIOL  431  F03        Bacterial Pathogenesis   0.5        NaN   NaN
20626    BIOL  431  S04            Seminar in Biology   0.5  Yezerinac    Th
</code>

In [5]:
def slacDataFrame(data):
    LoD = []
    lis = data['course']
    for rowD in lis:
        rowD = rowD.copy()
        LoD.append(rowD)
        
    df = pd.DataFrame(LoD)
    df.set_index('reg_num',inplace=True)
    df = df.drop(['time', 'place'], axis=1, errors='ignore')
    return df

**Run the code cell below to test your function and make sure it matches my output.**  Note that I make a copy of `slac` before passing it into the function because my function modfies the data passed in.  Specifically, I use the `copy.deepcopy()` function.  I encourage you to read about this function on the web.

In [6]:
# Make a copy of slac before passing it in to the function
import copy
slac_copy = copy.deepcopy(slac)

slac_df = slacDataFrame(slac_copy)
print(slac_df.head())

         subj crse sect                         title units instructor  days
reg_num                                                                     
10577    ANTH  211  F01  Introduction to Anthropology   1.0  Brightman   M-W
20573    ANTH  344  S01                Sex and Gender   1.0     Makley  T-Th
10624    BIOL  431  F01   Field Biology of Amphibians   0.5     Kaplan     T
10626    BIOL  431  F03        Bacterial Pathogenesis   0.5        NaN   NaN
20626    BIOL  431  S04            Seminar in Biology   0.5  Yezerinac    Th


<div class="exercise"><b>Exercise 5:</b></div> 

Now write a function `slacDataframe2(data)` that creates and returns a Pandas data frame from the `slac` data from the last
exercise. But in this case, traverse the "time" and "place" sub-dictionaries to populate columns `start_time`, `end_time`, `building`, and `room` in the data frame.  Once again note that not all courses have "time" and "place". Furthermore, some courses may have "time", but not both "start_time" and "end-time" (the same is true for "place").  Therefore, you will need to test for the existence of all of these within your code.  To be honest, it's a pain, but representative of real data!

In the second code cell below you will test your function by finding passing in `slac`. The output from my solution is as follows:<br>

![Dataframe](dataframe1.png)

In [7]:
def slacDataFrame2(data):
    LoD = []
    lis = data['course']
    for rowD in lis:
        if 'time' in rowD:
            if 'start_time' in rowD['time']:
                rowD['start_time'] = rowD['time']['start_time']
            if 'end_time' in rowD['time']:
                rowD['end_time'] = rowD['time']['end_time']
        
        if 'place' in rowD:
            if 'building' in rowD['place']:
                rowD['building'] = rowD['place']['building']
            if 'room' in rowD['place']:
                rowD['room'] = rowD['place']['room']
        del rowD['time']
        del rowD['place']
        
        LoD.append(rowD)
        
    df = pd.DataFrame(LoD)
    df.set_index('reg_num',inplace=True)
    return df

**Run the code cell below to test your function and make sure it matches my output.** 

In [8]:
# Make a copy of slac
slac_copy = copy.deepcopy(slac)

slac_df2 = slacDataFrame2(slac_copy)
slac_df2.head()

Unnamed: 0_level_0,subj,crse,sect,title,units,instructor,days,start_time,end_time,building,room
reg_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
10577,ANTH,211,F01,Introduction to Anthropology,1.0,Brightman,M-W,03:10PM,04:30,ELIOT,414
20573,ANTH,344,S01,Sex and Gender,1.0,Makley,T-Th,10:30AM,11:50,VOLLUM,120
10624,BIOL,431,F01,Field Biology of Amphibians,0.5,Kaplan,T,06:10PM,08:00,PHYSIC,240A
10626,BIOL,431,F03,Bacterial Pathogenesis,0.5,,,,,,240B
20626,BIOL,431,S04,Seminar in Biology,0.5,Yezerinac,Th,06:10PM,08:00,BIOL,200A
