# Code Toolkit: Python, Fall 2021
## Week 12 — Class notes
* Serialization with JSON
* Scraping and Data with Python
* REST Api

## Serialization with JSON

![JSON](images/json.jpeg)
_JSON WTF?_

![JSON](images/yo_dog_json.jpeg)

JSON or [JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON) is an open standard that just about everyone uses, not just web-developers.  It's used to store and transmit data in a formatted way so that other applications can talk to each other or save and load data from within your own application.  I use JSON almost everyday to cache data to load it later.  Other file formats are built on-top of JSON's open standard. 

Lets look at a simple JSON Object:

```
{
    "name": "Dan Moore"
}
```

Lets look at how you load that with python:

In [None]:
import json

# some JSON:
file = open("dan_moore.json")
# parse x:
person = json.load(file)

# the result is a Python dictionary:
print(person["name"])

Lets make it a little more official:

```
{
    "first_name": "Dan",
    "last_name": "Moore"
}
```
or

```
{
    "firstName": "Dan",
    "lastName": "Moore"
}
```

JSON supports having lists or arrays of items.  They are denoted using the ```[]```.  A JSON Array would look something like this:

```
{
    "first_name": "Dan",
    "last_name": "Moore",
    "pets":[
        {
            "name": "Voxel",
            "age": 2.5,
            "dob": "04/06/2021,
            "type": "Dog",
            "breed": "Mini Golden Doodle",
            "color": "Sable",
            "isAlive": true
        },
        {
            "name": "Tux",
            "age": 14,
            "dob": null,
            "type": "Cat",
            "breed": "House",
            "color": "Black and White",
            "isAlive": false
        }
    ]
}
```

So lets make it more complex:

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
    "Catherine",
    "Thomas",
    "Trevor"
  ],
  "spouse": null
}
```


_OK COOL_ but how do I use it???

In [None]:
with open("data/john_smith.json") as f:
    data = json.load(f)
    for item in data:
        print(f"{item} data: {data[item]}")

    print(f"{data['address']['city']}")
    print(f"{data['phoneNumbers'][0]['type']} {data['phoneNumbers'][0]['number']}")

    for item in data["phoneNumbers"]:
        print(item["type"])
        print(item["number"])


# OK so lets turn any CSV into JSON we can parse easier:



In [None]:
import csv
import json
def convert_csv_to_json(csv_file_path):
    # Read CSV file
    with open(csv_file_path, 'r') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    # Convert CSV data to JSON
    json_data = json.dumps(rows, indent=4)

    # Save JSON data to a file (optional)
    with open('notebooks/07-12-08-10.json', 'w') as json_file:
        json_file.write(json_data)

    return json_data

# Specify the path to your CSV file
csv_file_path = './notebooks/data_sources/Lottery_Powerball_Winning_Numbers__Beginning_2010-08-10-2023.csv'

# Convert CSV to JSON
json_data = convert_csv_to_json(csv_file_path)

print("Conversion completed. JSON data:")
print(json_data)

In [None]:
%pip install BeautifulSoup4
%pip install requests 

### Beautiful Soup 
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

In [1]:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
print(soup.prettify())

<html>
 <body>
  <p>
   Some
   <b>
    bad
    <i>
     HTML
    </i>
   </b>
  </p>
 </body>
</html>



In [None]:
import requests
import json
import csv
import time

In [None]:
url = f"LET'S PICK A WEB SITE? WITH A LOT OF TEXT OR IMAGES"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
response = requests.get(url, headers=headers)  # send html request
# print(response.text)

### [HEADERS?](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)

```
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}

```

### Let's Talk about Scraping
* Why would we want to scrape a whole website?
* What kind of information are we looking for?
* What kind of story do we want to tell?

In [None]:
data = {}
for i in range(0, 17):
    url = f"https://courses.newschool.edu/?term%5B%5D=202310&campus%5B%5D=GV&page={i}&mode=json&_=1670198429800"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
    response = requests.get(url, headers=headers)  # send html request
    # print(response.text)
    json_data = json.loads(response.text)
    soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
    # print(soup)
    # soup = soup.find_all("div",  {"class":"crse_id"})  # gets specific header
    crse_id = soup.select("div[class*=crse_id]")
    titles = soup.select("div[class*=title]")
    credit = soup.select("div[class*=credit]")
    i = 0
    for course in crse_id:
        course_num = course.select("p")
        course_num_str = str(course_num[0])
        course_num_str = course_num_str.replace("<p>", "").replace("</p>", "")
        course_credit = credit[i].select("p")
        course_credit_str = str(course_credit[0])
        course_credit_str = course_credit_str.replace("<p>", "").replace("</p>", "")
        title = titles[i].select("p")
        title_str = str(title[0])
        title_str = title_str.replace("<p>", "").replace("</p>", "")
        data[course_num_str] = {}
        data[course_num_str]["credits"] = course_credit_str
        data[course_num_str]["title"] = title_str
        i += 1
    
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "w") as outfile:
        outfile.write(json.dumps(data, indent = 4))

In [None]:
results = []
total = 0
with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as in_file:
    data = json.load(in_file)
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/courses/{course_id}/13942/"
        response = requests.get(url, headers=headers)  # send html request
        print(response.text)
        # json_data = json.loads(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')  # parse data
        print(soup)
    output = {}
    count = 0
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/?term%5B%5D=202310&page=1&crse_id={course_id}&mode=json&_=1669857183697"
        response = requests.get(url, headers=headers)  # send html request
        # print(response.text)
        json_data = json.loads(response.text)
        soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
        try:
            instructor_elements = soup.find_all("div", class_="instructor")
            day_elements = soup.find_all("div", class_="days")
            time_elements = soup.find_all("div", class_="times")
            count = 0
            output[course_id] = []
            for day in day_elements:
                day_num = day.select("p")
                day_num_str = str(day_num[0]).replace("<p>", "").replace("</p>", "").replace("<em class=\"fa fa-calendar\"></em>", "")
                instructor = instructor_elements[count].select("p")
                instructor_str = str(instructor[0]).replace("<p>", "").replace("</p>", "").replace("<b>Faculty</b>: ", "")
                _time_num = time_elements[count].select("p")
                _time_num_str =  str(_time_num[0]).replace("<p>", "").replace("</p>", "")
                section = {}
                section["day"] = day_num_str
                section["time"] = _time_num_str
                section["Instructor"] = instructor_str
                output[course_id].append(section)
                count += 1

            with open("sections_with_instructor_fall2023.json", "w") as outfile:
                outfile.write(json.dumps(output, indent = 4))
        except BaseException as e:
            print(e)
        time.sleep(0.333)

In [None]:
with open("sections_with_instructor_fall2023.json", "r") as outfile:
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as titles:
        with open("people.csv", "r") as people_file:
            with open("fall_2023_contact_hours_for_ptf.csv", "a") as csv_out:
                data = json.load(outfile)
                titles_data = json.load(titles)
                reader = csv.DictReader(people_file)
                csvwriter = csv.writer(csv_out, lineterminator='\n')
                csvwriter.writerow(["INSTRUCTOR","TITLE","COURSE NUMBER", "MEETINGS PER WEEK", "START TIME", "END TIME","PTF"])
                ptf = []
                for row in reader:
                    ptf.append(row["name"])
                instructor = []    
                for course_num in data :
                    for section in data[course_num]:
                        title = ""
                        if course_num in titles_data:
                            title = titles_data[course_num]["title"]                    
                        day = section["day"]
                        day = day.split(",")
                        _time = section["time"]
                        _time = _time.split(" - ")
                        _start_time = ""
                        _end_time = ""
                        if len(_time) > 1:
                            _start_time = _time[0]
                            _end_time = _time[1]
                            _ptf = False
                        if "," in section["Instructor"]:
                            section["Instructor"] = section["Instructor"].replace("and ", "")
                            instructors = section["Instructor"].split(", ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        elif "and" in section["Instructor"]:
                            instructors = section["Instructor"].split(" and ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        else:
                            if section["Instructor"] in ptf:
                                _ptf = True
                            else :
                                _ptf = False
                            if section["Instructor"]  not in instructor:
                                instructor.append(section["Instructor"])
                                print(len(instructor))
                            print([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])
                            csvwriter.writerow([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])