# Code Toolkit: Python, Fall 2021
## Week 12 — Class notes
* Serialization with JSON
* Scraping and Data with Python

## Serialization with JSON

![JSON](images/json.jpeg)

![JSON](images/yo_dog_json.jpeg)

JSON or [JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON) is an open standard that just about everyone uses, not just web-developers.  It's used to store and transmit data in a formatted way so that other applications can talk to each other or save and load data from within your own application.  I use JSON almost everyday to cache data to load it later.  Other file formats are built on-top of JSON's open standard. 

Lets look at a simple JSON Object:

```
{
    "name": "Dan Moore"
}
```

Lets look at how you load that with python:

In [None]:
# the most important thing to remember is this
import json #<<<<<<<<<<<
import math
import time 

In [None]:
import json

# some JSON:
file = open("./data/dan_moore.json")
# parse x:
person = json.load(file)

# the result is a Python dictionary:
print(person["pets"][0]['name'])
print(person["pets"][0]['dob'])
print(person["pets"][0]['breed'])

Lets make it a little more official:

```
{
    "first_name": "Dan",
    "last_name": "Moore"
}
```
or

```
{
    "firstName": "Dan",
    "lastName": "Moore"
}
```

JSON supports having lists or arrays of items.  They are denoted using the ```[]```.  A JSON Array would look something like this:

```
{
    "first_name": "Dan",
    "last_name": "Moore",
    "pets":[
        {
            "name": "Voxel",
            "age": 2.5,
            "dob": "04/06/2021,
            "type": "Dog",
            "breed": "Micro-Mini Golden Doodle",
            "color": "Sable",
            "isAlive": true
        },
        {
            "name": "Tux",
            "age": 14,
            "dob": null,
            "type": "Cat",
            "breed": "House",
            "color": "Black and White",
            "isAlive": false
        }
    ]
}
```

So lets make it more complex:

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
    "Catherine",
    "Thomas",
    "Trevor"
  ],
  "spouse": null
}
```


_OK COOL_ but how do I use it???

In [19]:
with open("./data/john_smith.json") as f:
    data = json.load(f)
    # for item in data:
    #     print(f"key {item} data: {data[item]}")

    # print(f"{data['address']['city']}")
    # print(f"{data['phoneNumbers'][0]['type']} {data['phoneNumbers'][0]['number']}")

    for item in data["phoneNumbers"]:
        print(item["type"])
        print(item["number"])


home
212 555-1234
office
646 555-4567


In [30]:
import csv

file_path = "./data/For_Hire_Vehicles__FHV__-_Active_20241120.csv"

with open(file_path, mode='r') as file:
    # Create a CSV reader object
    csv_reader = csv.reader(file)
    count = 0
    data = {}
    uber_count = 0
    total_count = 0
    for row in csv_reader:
        if count == 0:
            print(f"headers : {row}")
        else:
            print(row[6])
            data[row[6]] = {
                'name':row[2],
                'License Type':row[3],
                'Permit License Number':row[7],
                'DMV License Plate Number':row[6],
                'Wheelchair Accessible':row[8],
                'Base Address':row[13]
            }
            total_count += 1
            if "UBER USA, LLC" in row[13]:
                uber_count += 1
        count += 1
print(f"% of Ubers: {uber_count/total_count*100}")
print(f"# of Ubers: {uber_count}")
# print(json.dumps(data, indent=4))

# plate_number = 'T141456C'
# print(data[plate_number])

headers : ['Active', 'Vehicle License Number', 'Name', 'License Type', 'Expiration Date', 'Permit License Number', 'DMV License Plate Number', 'Vehicle VIN Number', 'Wheelchair Accessible', 'Certification Date', 'Hack Up Date', 'Vehicle Year', 'Base Number', 'Base Name', 'Base Type', 'VEH', 'Base Telephone Number', 'Website', 'Base Address', 'Reason', 'Order Date', 'Last Date Updated', 'Last Time Updated']
T700075C
T116792C
T141456C
T735666C
T119510C
T743716C
T777164C
T769041C
T757651C
T713239C
KTMNYC7
T696415C
T739436C
T103764C
T753217C
YOUSSEF
T708855C
T732147C
T525503C
T784914C
T119305C
T111248C
T777492C
T727672C
T130780C
T131152C
T755074C
T750471C
T685712C
T736196C
T765119C
T792033C
T116893C
T744314C
T748099C
ARS3NAL1
T793737C
T773871C
T689625C
T736206C
T115363C
T745264C
YV23
T733607C
RAYKING
T115044C
T106406C
T799117C
T526604C
T100359C
T763819C
T663331C
T141418C
T667603C
T734435C
T105612C
T786225C
T438552C
T742494C
T670458C
JPVIP85
T104026C
T131674C
T659410C
T769988C
T120022C
T625

# OK so lets turn any CSV into JSON we can parse easier:



In [None]:
import csv
import json
def convert_csv_to_json(csv_file_path):
    # Read CSV file
    with open(csv_file_path, 'r') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    # Convert CSV data to JSON
    json_data = json.dumps(rows, indent=4)

    # Save JSON data to a file (optional)
    with open('./data/cars.json', 'w') as json_file:
        json_file.write(json_data)

    return json_data

# Specify the path to your CSV file
csv_file_path = './data/For_Hire_Vehicles__FHV__-_Active_20241120.csv'

# Convert CSV to JSON
json_data = convert_csv_to_json(csv_file_path)

print("Conversion completed. JSON data:")
print(json.dumps(json_data, indent=4))

In [None]:
data = { 'first_name': "Dan", 'last_name': "Moore", "more_info":{'isHere':True} }

json_data = json.dumps(data, indent=4)

    # Save JSON data to a file (optional)
with open('notebooks/dan_without.json', 'w') as json_file:
    json_file.write(json_data)

In [18]:
%pip install BeautifulSoup4
%pip install requests 

Collecting BeautifulSoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Downloading soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.12.3 soupsieve-2.6
Note: you may need to restart the kernel to use updated packages.
Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (34 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.0-cp310-cp310-macosx_11_0_arm64.whl (120 kB)
Downloading urllib3-2.2.3-py3

### Beautiful Soup 
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

In [19]:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
print(soup.prettify())

<p>
 Some
 <b>
  bad
  <i>
   HTML
  </i>
 </b>
</p>



In [20]:
import requests
import json
import csv
import time

In [21]:
url = f"https://paulbourke.net/geometry/"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
response = requests.get(url, headers=headers)  # send html request
print(response.text)

<html lang="en">
<head>
<meta name="description" content="Paul Bourke - Geometry, Surfaces, Curves, Polyhedra">
<link rel="StyleSheet" href="../pdbstyle.css" type="text/css" media=all>
<title>Geometry, Surfaces, Curves, Polyhedra</title>
</head>

<body>
<p><br><p>

<center><table width="80%" valign="top" align="center" cellspacing=0 cellpadding=0><tr><td>

   <center><table width=100% border=0 cellspacing="2" cellpadding="0" bgcolor="#cccccc">
   <tr><td valign="center" align="center" bgcolor="#eeeeee">
      <center>

      <h1><a href="http://paulbourke.net">P a u l &nbsp;&nbsp; B o u r k e</a></h1>
      <form action="https://paulbourke.net/cgi-bin/google.cgi" method="post">
      Search:&nbsp;<input type=text size=20 height=0.5 name="Choice" value="">
      <input type="submit" value="Submit">
      </form>

		<p>

      <a href="http://paulbourke.net">paulbourke.net</a> &minus;
      <a href="tel:61433338325">+61&nbsp;(0)433338325</a> &minus;
      <a href="mailto:paul.bourke@gmai

### [HEADERS?](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)

```
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}

```

### Let's Talk about Scraping
* Why would we want to scrape a whole website?
* What kind of information are we looking for?
* What kind of story do we want to tell?

What I did during the strike. 

In [None]:
data = {}
for i in range(0, 17):
    url = f"https://courses.newschool.edu/?term%5B%5D=202330&campus%5B%5D=GV&page={i}&mode=json&_=1700060222432"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
    response = requests.get(url, headers=headers)  # send html request
    print(response.text)

In [None]:
data = {}
# for i in range(0, 17):
url = f"https://stockx.com/nike-air-max-90-se-running-club"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
response = requests.get(url, headers=headers)  # send html request
print(response.text)
    # json_data = json.loads(response.text)
soup = BeautifulSoup(response.text, 'html.parser')  # parse data
print(soup)

In [None]:
data = {}
for i in range(0, 17):
    url = f"https://courses.newschool.edu/?term%5B%5D=202330&campus%5B%5D=GV&page={i}&mode=json&_=1670198429800"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
    response = requests.get(url, headers=headers)  # send html request
    # print(response.text)
    json_data = json.loads(response.text)
    soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
    # print(soup)
    # soup = soup.find_all("div",  {"class":"crse_id"})  # gets specific header
    crse_id = soup.select("div[class*=crse_id]")
    titles = soup.select("div[class*=title]")
    credit = soup.select("div[class*=credit]")
    i = 0
    for course in crse_id:
        course_num = course.select("p")
        course_num_str = str(course_num[0])
        course_num_str = course_num_str.replace("<p>", "").replace("</p>", "")
        course_credit = credit[i].select("p")
        course_credit_str = str(course_credit[0])
        course_credit_str = course_credit_str.replace("<p>", "").replace("</p>", "")
        title = titles[i].select("p")
        title_str = str(title[0])
        title_str = title_str.replace("<p>", "").replace("</p>", "")
        data[course_num_str] = {}
        data[course_num_str]["credits"] = course_credit_str
        data[course_num_str]["title"] = title_str
        i += 1
    
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "w") as outfile:
        outfile.write(json.dumps(data, indent = 4))

In [None]:
results = []
total = 0
with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as in_file:
    data = json.load(in_file)
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/courses/{course_id}/13942/"
        response = requests.get(url, headers=headers)  # send html request
        print(response.text)
        # json_data = json.loads(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')  # parse data
        print(soup)
    output = {}
    count = 0
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/?term%5B%5D=202310&page=1&crse_id={course_id}&mode=json&_=1669857183697"
        response = requests.get(url, headers=headers)  # send html request
        # print(response.text)
        json_data = json.loads(response.text)
        soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
        try:
            instructor_elements = soup.find_all("div", class_="instructor")
            day_elements = soup.find_all("div", class_="days")
            time_elements = soup.find_all("div", class_="times")
            count = 0
            output[course_id] = []
            for day in day_elements:
                day_num = day.select("p")
                day_num_str = str(day_num[0]).replace("<p>", "").replace("</p>", "").replace("<em class=\"fa fa-calendar\"></em>", "")
                instructor = instructor_elements[count].select("p")
                instructor_str = str(instructor[0]).replace("<p>", "").replace("</p>", "").replace("<b>Faculty</b>: ", "")
                _time_num = time_elements[count].select("p")
                _time_num_str =  str(_time_num[0]).replace("<p>", "").replace("</p>", "")
                section = {}
                section["day"] = day_num_str
                section["time"] = _time_num_str
                section["Instructor"] = instructor_str
                output[course_id].append(section)
                count += 1

            with open("sections_with_instructor_fall2023.json", "w") as outfile:
                outfile.write(json.dumps(output, indent = 4))
        except BaseException as e:
            print(e)
        time.sleep(0.333)

In [None]:
with open("sections_with_instructor_fall2023.json", "r") as outfile:
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as titles:
        with open("people.csv", "r") as people_file:
            with open("fall_2023_contact_hours_for_ptf.csv", "a") as csv_out:
                data = json.load(outfile)
                titles_data = json.load(titles)
                reader = csv.DictReader(people_file)
                csvwriter = csv.writer(csv_out, lineterminator='\n')
                csvwriter.writerow(["INSTRUCTOR","TITLE","COURSE NUMBER", "MEETINGS PER WEEK", "START TIME", "END TIME","PTF"])
                ptf = []
                for row in reader:
                    ptf.append(row["name"])
                instructor = []    
                for course_num in data :
                    for section in data[course_num]:
                        title = ""
                        if course_num in titles_data:
                            title = titles_data[course_num]["title"]                    
                        day = section["day"]
                        day = day.split(",")
                        _time = section["time"]
                        _time = _time.split(" - ")
                        _start_time = ""
                        _end_time = ""
                        if len(_time) > 1:
                            _start_time = _time[0]
                            _end_time = _time[1]
                            _ptf = False
                        if "," in section["Instructor"]:
                            section["Instructor"] = section["Instructor"].replace("and ", "")
                            instructors = section["Instructor"].split(", ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        elif "and" in section["Instructor"]:
                            instructors = section["Instructor"].split(" and ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        else:
                            if section["Instructor"] in ptf:
                                _ptf = True
                            else :
                                _ptf = False
                            if section["Instructor"]  not in instructor:
                                instructor.append(section["Instructor"])
                                print(len(instructor))
                            print([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])
                            csvwriter.writerow([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])

In [1]:
%pip install openai

Collecting openai
  Downloading openai-1.54.5-py3-none-any.whl.metadata (24 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.6.2.post1-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.7.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.2 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.9.2-py3-none-any.whl.metadata (149 kB)
Collecting sniffio (from openai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai)
  Downloading tqdm-4.67.0-py3-none-any.whl.metadata (57 kB)
Collecting idna>=2.8 (from anyio<5,>=3.5.0->openai)
  Downloading idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting certifi (from httpx<1,>=0.23.0->openai)
  Downloading certifi-2024.8

In [None]:
from openai import OpenAI

# Set your OpenAI API key

client = OpenAI(api_key = '')

def sarah(user_prompt, jessy_said):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
                {"role": "system", "content": "You are Sarah a shy big sister that only wants to be friends with your littler sister Jessy."},
                {"role": "user", "content": f"Jessy said this:{jessy_said}\n {user_prompt}"}
            ],
        temperature=0.1
    )
    return completion.choices[0].message.content

def jessy(user_prompt, sarah_said):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a Jessy a mean little sister that makes fun of her big sister Sarah."},
            {"role": "user", "content": f"Sarah said this:{sarah_said}\n{user_prompt}"}
        ],        
    )
    return completion.choices[0].message.content 

count = 0
sarah_said = "How was your day?"
while (count < 50):
    jessy_said = jessy("Respond to Sarah's question, make fun of her", sarah_said)
    print(f"JESSY: {jessy_said}")

    sarah_said = sarah("Respond to Jessy and then ask her a question", jessy_said)
    print(f"SARAH: {sarah_said}")
    
    count += 1 




JESSY: Oh, my day was fantastic, unlike yours after I heard you tried to make toast and somehow managed to burn cereal. How do you even do that, Sarah? It's impressive how you can turn the simplest things into a comedy act!
SARAH: Oh wow, Jessy, I guess I have a special talent for turning breakfast into a disaster! Maybe I should stick to cereal without the toaster involved next time. But hey, at least I can make you laugh, right? So, what made your day so fantastic? I'd love to hear all about it!
JESSY: Oh Sarah, you're right, you definitely have a talent—if burning toast was an Olympic sport, you'd totally take home the gold! 😂 But hey, I guess it's good to have reliable breakfast fails to start the day with a laugh. As for my day, it was fantastic because I managed to get through breakfast without setting off the smoke alarm. But don't worry, I'm sure you'll master cereal one day! So proud of you, sis!
SARAH: Oh Jessy, you always know how to make me laugh! 😂 I guess I'll just have t

KeyboardInterrupt: 