# Code Toolkit: Python, Fall 2021
## Week 12 — Class notes
* Serialization with JSON
* Scraping and Data with Python

## Serialization with JSON

![JSON](images/json.jpeg)

![JSON](images/yo_dog_json.jpeg)

JSON or [JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON) is an open standard that just about everyone uses, not just web-developers.  It's used to store and transmit data in a formatted way so that other applications can talk to each other or save and load data from within your own application.  I use JSON almost everyday to cache data to load it later.  Other file formats are built on-top of JSON's open standard. 

Lets look at a simple JSON Object:

```
{
    "name": "Dan Moore"
}
```

Lets look at how you load that with python:

In [1]:
# the most important thing to remember is this
import json #<<<<<<<<<<<

In [2]:
import json

# some JSON:
file = open("./data/dan_moore.json")
# parse x:
person = json.load(file)

# the result is a Python dictionary:
print(person["name"])

Dan Moore


Lets make it a little more official:

```
{
    "first_name": "Dan",
    "last_name": "Moore"
}
```
or

```
{
    "firstName": "Dan",
    "lastName": "Moore"
}
```

JSON supports having lists or arrays of items.  They are denoted using the ```[]```.  A JSON Array would look something like this:

```
{
    "first_name": "Dan",
    "last_name": "Moore",
    "pets":[
        {
            "name": "Voxel",
            "age": 2.5,
            "dob": "04/06/2021,
            "type": "Dog",
            "breed": "Micro-Mini Golden Doodle",
            "color": "Sable",
            "isAlive": true
        },
        {
            "name": "Tux",
            "age": 14,
            "dob": null,
            "type": "Cat",
            "breed": "House",
            "color": "Black and White",
            "isAlive": false
        }
    ]
}
```

So lets make it more complex:

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
    "Catherine",
    "Thomas",
    "Trevor"
  ],
  "spouse": null
}
```


_OK COOL_ but how do I use it???

In [5]:
with open("data/john_smith.json") as f:
    data = json.load(f)
    for item in data:
        print(f"{item} data: {data[item]}")

    #print(f"{data['address']['city']}")
    #print(f"{data['phoneNumbers'][0]['type']} {data['phoneNumbers'][0]['number']}")

    for item in data["phoneNumbers"]:
        print(item["type"])
        print(item["number"])


firstName data: John
lastName data: Smith
isAlive data: True
age data: 27
address data: {'streetAddress': '21 2nd Street', 'city': 'New York', 'state': 'NY', 'postalCode': '10021-3100'}
phoneNumbers data: [{'type': 'home', 'number': '212 555-1234'}, {'type': 'office', 'number': '646 555-4567'}]
children data: ['Catherine', 'Thomas', 'Trevor']
spouse data: None
home
212 555-1234
office
646 555-4567


# OK so lets turn any CSV into JSON we can parse easier:



In [8]:
import csv
import json
def convert_csv_to_json(csv_file_path):
    # Read CSV file
    with open(csv_file_path, 'r') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    # Convert CSV data to JSON
    json_data = json.dumps(rows, indent=4)

    # Save JSON data to a file (optional)
    with open('notebooks/squirrels.json', 'w') as json_file:
        json_file.write(json_data)

    return json_data

# Specify the path to your CSV file
csv_file_path = './data/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv'

# Convert CSV to JSON
json_data = convert_csv_to_json(csv_file_path)

print("Conversion completed. JSON data:")
print(json_data)

Conversion completed. JSON data:
[
    {
        "X": "-73.9561344937861",
        "Y": "40.7940823884086",
        "Unique Squirrel ID": "37F-PM-1014-03",
        "Hectare": "37F",
        "Shift": "PM",
        "Date": "10142018",
        "Hectare Squirrel Number": "3",
        "Age": "",
        "Primary Fur Color": "",
        "Highlight Fur Color": "",
        "Combination of Primary and Highlight Color": "+",
        "Color notes": "",
        "Location": "",
        "Above Ground Sighter Measurement": "",
        "Specific Location": "",
        "Running": "false",
        "Chasing": "false",
        "Climbing": "false",
        "Eating": "false",
        "Foraging": "false",
        "Other Activities": "",
        "Kuks": "false",
        "Quaas": "false",
        "Moans": "false",
        "Tail flags": "false",
        "Tail twitches": "false",
        "Approaches": "false",
        "Indifferent": "false",
        "Runs from": "false",
        "Other Interactions": "",
       

In [13]:
data = { 'first_name': "Dan", 'last_name': "Moore", "more_info":{'isHere':True} }

json_data = json.dumps(data, indent=4)

    # Save JSON data to a file (optional)
with open('notebooks/dan_without.json', 'w') as json_file:
    json_file.write(json_data)

In [14]:
%pip install BeautifulSoup4
%pip install requests 

Collecting BeautifulSoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[K     |████████████████████████████████| 142 kB 1.9 MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.12.2 soupsieve-2.5
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Beautiful Soup 
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

In [15]:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
print(soup.prettify())

<p>
 Some
 <b>
  bad
  <i>
   HTML
  </i>
 </b>
</p>



In [16]:
import requests
import json
import csv
import time

In [18]:
url = f"https://paulbourke.net/geometry/"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
response = requests.get(url, headers=headers)  # send html request
print(response.text)

<html lang="en">
<head>
<meta name="description" content="Paul Bourke - Geometry, Surfaces, Curves, Polyhedra">
<link rel="StyleSheet" href="../pdbstyle.css" type="text/css" media=all>
<title>Geometry, Surfaces, Curves, Polyhedra</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<script language="JavaScript">
<!--
   if (self.location.href != top.location.href) {
      top.location.href = self.location.href;
   }
-->
</script>
<style id="compiled-css" type="text/css">
   input { 
      height:22px;
   }
</style>
</head>
<body>
<p><br><p>
<!-- W H A T    A R E    Y O U   L O O K I N G    F O R   -   Y E S   ,   Y O U -->

<!--
                                                         ...                                                    
                                                       ...::.....                                               
                                            

### [HEADERS?](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)

```
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}

```

### Let's Talk about Scraping
* Why would we want to scrape a whole website?
* What kind of information are we looking for?
* What kind of story do we want to tell?

What I did during the strike. 

In [20]:
data = {}
for i in range(0, 17):
    url = f"https://courses.newschool.edu/?term%5B%5D=202330&campus%5B%5D=GV&page={i}&mode=json&_=1700060222432"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
    response = requests.get(url, headers=headers)  # send html request
    print(response.text)

{"meta":{"skip":0,"limit":100,"page":1,"total":1542,"found":100,"pages":16,"filters":1,"query":null},"links":{"self":"https://courses.newschool.edu/?campus%5B%5D=GV&mode=json&term%5B%5D=202330"},"data":{"type":"html","attributes":"<div class=\"list_row container-fluid p-0 m-0 col col-12 border-bottom\"  id=\"CAML3000\">\n\n  <div class=\"main_content row p-0 pt-2 px-2 m-0 col col-12 m-0\">\n\n    <div class=\"p-0 m-0 pr-2 pb-2 col col-12 col-sm-2\">\n      <div class=\"container col col-12\">\n        <div class=\"row overflow-hidden\">\n\n          <div class=\"row_item p-0 m-0 col col-12\">\n            <p class=\"m-0 p-0\">CAML3000</p>\n          </div>\n\n        </div>\n      </div>\n    </div>\n\n    <div class=\"p-0 m-0 pr-2 pb-2 col col-12 col-sm-3 user-select-none\">\n      <div class=\"col col-12\">\n        <div class=\"row overflow-hidden\">\n\n          <div class=\"row_item p-0 m-0 col col-12\">\n            <p class=\"m-0 p-0\">UG 2 cr. Major Lessons</p>\n          </div

In [25]:
data = {}
# for i in range(0, 17):
url = f"https://stockx.com/nike-air-max-90-se-running-club"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
response = requests.get(url, headers=headers)  # send html request
print(response.text)
    # json_data = json.loads(response.text)
soup = BeautifulSoup(response.text, 'html.parser')  # parse data
print(soup)

<!DOCTYPE html>


In [22]:
data = {}
for i in range(0, 17):
    url = f"https://courses.newschool.edu/?term%5B%5D=202330&campus%5B%5D=GV&page={i}&mode=json&_=1670198429800"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0'}
    response = requests.get(url, headers=headers)  # send html request
    # print(response.text)
    json_data = json.loads(response.text)
    soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
    # print(soup)
    # soup = soup.find_all("div",  {"class":"crse_id"})  # gets specific header
    crse_id = soup.select("div[class*=crse_id]")
    titles = soup.select("div[class*=title]")
    credit = soup.select("div[class*=credit]")
    i = 0
    for course in crse_id:
        course_num = course.select("p")
        course_num_str = str(course_num[0])
        course_num_str = course_num_str.replace("<p>", "").replace("</p>", "")
        course_credit = credit[i].select("p")
        course_credit_str = str(course_credit[0])
        course_credit_str = course_credit_str.replace("<p>", "").replace("</p>", "")
        title = titles[i].select("p")
        title_str = str(title[0])
        title_str = title_str.replace("<p>", "").replace("</p>", "")
        data[course_num_str] = {}
        data[course_num_str]["credits"] = course_credit_str
        data[course_num_str]["title"] = title_str
        i += 1
    
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "w") as outfile:
        outfile.write(json.dumps(data, indent = 4))

In [None]:
results = []
total = 0
with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as in_file:
    data = json.load(in_file)
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/courses/{course_id}/13942/"
        response = requests.get(url, headers=headers)  # send html request
        print(response.text)
        # json_data = json.loads(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')  # parse data
        print(soup)
    output = {}
    count = 0
    for course_id in data:
        print(course_id)
        url = f"https://courses.newschool.edu/?term%5B%5D=202310&page=1&crse_id={course_id}&mode=json&_=1669857183697"
        response = requests.get(url, headers=headers)  # send html request
        # print(response.text)
        json_data = json.loads(response.text)
        soup = BeautifulSoup(json_data["data"]["attributes"], 'html.parser')  # parse data
        try:
            instructor_elements = soup.find_all("div", class_="instructor")
            day_elements = soup.find_all("div", class_="days")
            time_elements = soup.find_all("div", class_="times")
            count = 0
            output[course_id] = []
            for day in day_elements:
                day_num = day.select("p")
                day_num_str = str(day_num[0]).replace("<p>", "").replace("</p>", "").replace("<em class=\"fa fa-calendar\"></em>", "")
                instructor = instructor_elements[count].select("p")
                instructor_str = str(instructor[0]).replace("<p>", "").replace("</p>", "").replace("<b>Faculty</b>: ", "")
                _time_num = time_elements[count].select("p")
                _time_num_str =  str(_time_num[0]).replace("<p>", "").replace("</p>", "")
                section = {}
                section["day"] = day_num_str
                section["time"] = _time_num_str
                section["Instructor"] = instructor_str
                output[course_id].append(section)
                count += 1

            with open("sections_with_instructor_fall2023.json", "w") as outfile:
                outfile.write(json.dumps(output, indent = 4))
        except BaseException as e:
            print(e)
        time.sleep(0.333)

In [None]:
with open("sections_with_instructor_fall2023.json", "r") as outfile:
    with open("new_school_course_numbers_with_title_credit_fall2023_nyc.json", "r") as titles:
        with open("people.csv", "r") as people_file:
            with open("fall_2023_contact_hours_for_ptf.csv", "a") as csv_out:
                data = json.load(outfile)
                titles_data = json.load(titles)
                reader = csv.DictReader(people_file)
                csvwriter = csv.writer(csv_out, lineterminator='\n')
                csvwriter.writerow(["INSTRUCTOR","TITLE","COURSE NUMBER", "MEETINGS PER WEEK", "START TIME", "END TIME","PTF"])
                ptf = []
                for row in reader:
                    ptf.append(row["name"])
                instructor = []    
                for course_num in data :
                    for section in data[course_num]:
                        title = ""
                        if course_num in titles_data:
                            title = titles_data[course_num]["title"]                    
                        day = section["day"]
                        day = day.split(",")
                        _time = section["time"]
                        _time = _time.split(" - ")
                        _start_time = ""
                        _end_time = ""
                        if len(_time) > 1:
                            _start_time = _time[0]
                            _end_time = _time[1]
                            _ptf = False
                        if "," in section["Instructor"]:
                            section["Instructor"] = section["Instructor"].replace("and ", "")
                            instructors = section["Instructor"].split(", ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        elif "and" in section["Instructor"]:
                            instructors = section["Instructor"].split(" and ")
                            for inst in instructors:
                                if inst not in instructor:
                                    instructor.append(inst)
                                    print(len(instructor))
                                if inst in ptf:
                                    _ptf = True
                                else :
                                    _ptf = False
                                csvwriter.writerow([inst, title,course_num, len(day), _start_time, _end_time, _ptf])
                        else:
                            if section["Instructor"] in ptf:
                                _ptf = True
                            else :
                                _ptf = False
                            if section["Instructor"]  not in instructor:
                                instructor.append(section["Instructor"])
                                print(len(instructor))
                            print([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])
                            csvwriter.writerow([section["Instructor"],title,course_num, len(day), _start_time, _end_time, _ptf])

In [29]:
%pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 2.7 MB/s eta 0:00:01
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.6-cp39-cp39-macosx_11_0_arm64.whl (349 kB)
[K     |████████████████████████████████| 349 kB 4.6 MB/s eta 0:00:01
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-macosx_11_0_arm64.whl (29 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.4.0-cp39-cp39-macosx_11_0_arm64.whl (46 kB)
[K     |████████████████████████████████| 46 kB 6.2 MB/s eta 0:00:01
[?25hCollecting yarl<2.0,>=1.0
  Downloading yarl-1.9.2-cp39-cp39-macosx_11_0_arm64.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.1 MB/s eta 0:00:01
[?25hCollecting attrs>=17.3.0
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 5.6 MB/s eta 0:00:011
[?25hCollec

In [36]:
from openai import OpenAI

# Set your OpenAI API key
# openai.api_key = ''

client = OpenAI(api_key = '')

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a game designer, designing games with math"},
    {"role": "user", "content": "design a game to be played with prime numbers"}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content='Title: Prime Pursuit\n\nGame Concept:\nPrime Pursuit is a fast-paced and strategic math game that challenges players to utilize their knowledge of prime numbers to outwit their opponents. The objective is to collect as many prime number cards as possible by strategically forming and breaking prime number chains.\n\nComponents:\n1. Prime Number Cards: A deck of cards displaying prime numbers ranging from 2 to a predetermined maximum, such as 97.\n2. Game Board: A grid-style game board, consisting of spaces for placing and connecting the prime number cards.\n3. Player Tokens: Unique tokens representing each player, used to mark their progress on the game board.\n\nGameplay:\n1. Setup:\n   - Shuffle the prime number cards and place them face-down as a draw pile.\n   - Each player selects a token and places it on the starting space of the game board.\n   - Deal five cards to each player from the draw pile.\n\n2. Turn Structure:\n   - On their turn, a player m