In [1]:
from dsc80_utils import *

In [2]:
# For JSON evaluation example
def err():
    raise ValueError('i just deleted all your files lol 😂')

# Lecture 9 – HTTP, Midterm Review

## DSC 80, Fall 2023

## 📣 Announcements 📣

- Project 2 due tomorrow!
- Lab 5 released, due Monday.
    - No hidden tests for this lab since it's close to the midterm.
- Midterm exam will happen this **Thurs Nov 2**.

## 📝 Midterm Exam

- Thurs, Nov 2 from 3:30-4:50pm in WLH 2005.
- Pen and paper only. No calculators, phones, or watches allowed.
- You are allowed to bring one double-sided 8.5" x 11" sheet of handwritten notes.
    - No reference sheet given, unlike DSC 10!
- We will display clarifications and the time remaining during the exam.
- Covers Lectures 1-8, Labs 1-4, and Projects 1-2.
- To review problems from old exams, go to [practice.dsc80.com](https://practice.dsc80.com).
    - Also look at the [Resources](https://dsc80.com/resources) tab on the course website.

## 📆 Agenda

- [ ] Introduction to HTTP
- [ ] Midterm Review

## Introduction to HTTP

<center><img src="imgs/ds-lifecycle.svg" width="40%"></center>

### Data sources

* Often, the data you need doesn't exist in "clean" `.csv` files.

* **Solution:** Collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

- The internet contains **massive** amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet:
    - through an API.
    - by scraping.

- We will discuss the differences between both approaches, but for now, the important part is that they **both use HTTP**.

### HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).

- It is a **request-response** protocol.
    - Protocol = set of rules.

- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers".

- The "S" in HTTPS stands for "secure".

<center><img src='imgs/ucsd.png' width=750></center>

UCSD was a node in ARPANET, the predecessor to the modern internet ([source](https://en.wikipedia.org/wiki/ARPANET#/media/File:Arpanet_map_1973.jpg/)).

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=600></center>

- A **request** is made by the **client**.

- A **response** is returned by the **server**.

- **Example:** DuckDuckGo Search 🎥.
    - https://html.duckduckgo.com/html/?q=data%20science
    - Your web browser, a **client**, makes an HTTP **request** with a search query.
    - The **server**, DuckDuckGo, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the search results.
    

### Request methods

The request methods you will use most often are `GET` and `POST`; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.    

- `GET` is used to request data **from** a specified resource.

- `POST` is used to **send** data to the server. 
    - e.g. uploading a photo to Instagram or entering credit card information on Amazon.

### Example `GET` request

Below is an example `GET` HTTP request made by a browser when accessing [datascience.ucsd.edu](https://datascience.ucsd.edu).

```HTTP
GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9
```

- The first line (`GET / HTTP/1.1`) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata. 

- We _could_ also provide a "body" after the header fields.

- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

### Example `GET` response

The response below was generated by executing the request on the previous slide.

```HTTP
HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8">
	<link rel="profile" href="https://gmpg.org/xfn/11">
	<style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...
```

### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

## Making HTTP requests

### Making HTTP requests

We'll see two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.

- **From Python, with the `requests` package.**

### Making HTTP requests using `curl`

[`curl`](https://curl.haxx.se/docs/httpscripting.html) is a **command-line tool** that sends HTTP requests, like a browser.

1. The client, `curl`, sends a HTTP request. 
2. The request contains a method (e.g. `GET` or `POST`).
3. The HTTP server responds with:
    - a status line, indicating if things went well, 
    - response headers, and
    - (usually) a response body, containing the requested data.

### Example: `GET` requests via `curl`

- By default, `curl` issues a `GET` request.

```zsh
# `-v` is short for verbose
curl -v https://httpbin.org/html 
```

- Remember, you can run command-line commands in a Jupyter Notebook by placing a `!` before them. Let's try that here.

In [11]:
# Compare the output to what you see when you go to https://httpbin.org/html in your browser!
!curl -v https://httpbin.org/html

*   Trying 34.196.65.63:443...
* Connected to httpbin.org (34.196.65.63) port 443 (#0)
* ALPN: offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=httpbin.org
*  start date: Sep 21 00:00:00 2023 GMT
*  expire date: Oct 18 23:59:59 2024 GMT
*  subjectAltName: host "httpbin.org" matched cert's "httpbin.org"
*  issuer: C=US; O=Amazon; CN=Am

### Queries in a `GET` request

- In order to request more specific information, we can include a **query string** in the URL. `?` begins a query.

<a href="https://www.google.com/search?q=ucsd+dsc+80+hard"><pre>
https://www.google.com/search?q=ucsd+dsc+80+hard
</pre></a>

- This method works well when sending small amounts of data; we will use a similiar technique when working with APIs next lecture.

- Be on the lookout for query strings in URLs you share on social media!

### Making HTTP requests using `requests`

- `requests` is a Python module that allows you to use Python to interact with the internet!  
- There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.

In [12]:
import requests

### Example: `GET` requests via `requests`

To access the source code of the UCSD home page, all we need to run is the following:

```py
requests.get('https://ucsd.edu').text
```

In [17]:
res = requests.get('https://ucsd.edu')
type(res.text)

str

`res` is now a `Response` object.

In [18]:
res

<Response [200]>

The `text` attribute of `res` is a string that containing the entire response.

In [19]:
type(res.text)

str

In [20]:
len(res.text)

46344

In [21]:
print(res.text[:1000])

<!DOCTYPE html>
<html lang="en">
  <head>
  
  

 





    <meta charset="utf-8"/>
    <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    <title>University of California San Diego</title>
    <meta content="University of California, San Diego" name="ORGANIZATION"/>
    <meta content="index,follow,noarchive" name="robots"/>
    <meta content="UCSD" name="SITE"/>
    <meta content="University of California San Diego" name="PAGETITLE"/>
    <meta content="The University California San Diego is one of the world's leading public research universities, located in beautiful La Jolla, California" name="DESCRIPTION"/>
    <link href="favicon.ico" rel="icon"/>


    
  




<!-- Site-specific CSS files -->
    
  <link href="https://www.ucsd.edu/_resources/css/vendor/brix_sans.css" rel="stylesheet" type="text/css"/>
  <link href="https://www.ucsd.edu/_resources/css/vendor/refrigerator_deluxe.css" rel="stylesheet" 

### Example: `POST` requests via `requests`

The following call to `requests.post` makes a post request to https://httpbin.org/post, with a `'name'` parameter of `'King Triton'`.

In [22]:
post_res = requests.post(
    'https://httpbin.org/post',
    data={'name': 'King Triton'}
)

post_res

<Response [200]>

In [23]:
post_res.text

'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "name": "King Triton"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Content-Length": "16", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.31.0", \n    "X-Amzn-Trace-Id": "Root=1-654184d1-52bd69fd49087463171d0d29"\n  }, \n  "json": null, \n  "origin": "128.54.24.143", \n  "url": "https://httpbin.org/post"\n}\n'

In [24]:
post_res.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'name': 'King Triton'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, br',
  'Content-Length': '16',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.31.0',
  'X-Amzn-Trace-Id': 'Root=1-654184d1-52bd69fd49087463171d0d29'},
 'json': None,
 'origin': '128.54.24.143',
 'url': 'https://httpbin.org/post'}

What happens when we try and make a `POST` request somewhere where we're unable to?

In [25]:
yt_res = requests.post(
    'https://youtube.com',
    data={'name': 'King Triton'})

yt_res

<Response [400]>

In [27]:
yt_res.text[:100]

'<html lang="en" dir="ltr"><head><title>Oops</title><style nonce="SUU7Km6-n0tEXgUNr6DE5g">html{font-f'

`yt_res.text` is a string containing HTML – we can render this in-line using `IPython.display.HTML`.

In [28]:
from IPython.display import HTML

In [29]:
HTML(yt_res.text)

### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  

* The most common status code is `200`, which means there were no issues.  

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `400` – bad request, `404` – page not found, `500` – internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code; for example, https://httpstat.us/404 returns a `404`.

### Successful requests ✅

- You can check if a request was successful using the `ok` attribute, which returns a bool.
    - If a status is in the 200s, then it is successful.

In [32]:
yt_res.status_code, yt_res.ok

(400, False)

In [35]:
post_res.status_code, post_res.ok

(200, True)

- Unsuccessful requests can be re-tried, depending on the issue.
    - Wait a little, then try the request again.
    - You can even re-try requests programmatically (e.g. using a loop). If rate of requests is too high, slow down requests between each retry (e.g. using `time.sleep`).

- See the [textbook](https://learningds.org/ch/14/web_http.html) for more examples.

## Data formats

### The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

- The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage).

- The response body of a `POST` request is usually JSON.

- XML is also a common format, but not as popular as it once was.

<center><img src='imgs/json.png' width=50%></center>

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=50%></center>

In [40]:
import json
from pathlib import Path

f = Path('data') / 'family.json'
family_tree = json.loads(f.read_text())

In [41]:
family_tree.keys()

dict_keys(['name', 'age', 'children'])

In [45]:
family_tree['children'][1]['children'][0]['age']

34

### Aside: `eval`

- `eval`, which stands for "evaluate", is a function built into Python.

- It takes in a **string containing a Python expression** and evaluates it in the current context.

In [46]:
x = 4
eval('x + 5')

9

- It seems like `eval` can do the same thing that `json.load` does...

In [47]:
eval(f.read_text())

{'name': 'Grandma',
 'age': 94,
 'children': [{'name': 'Dad',
   'age': 60,
   'children': [{'name': 'Me', 'age': 24}, {'name': 'Brother', 'age': 22}]},
  {'name': 'My Aunt',
   'children': [{'name': 'Cousin 1', 'age': 34},
    {'name': 'Cousin 2',
     'age': 36,
     'children': [{'name': 'Cousin 2 Jr.', 'age': 2}]}]}]}

- But you should **almost never use `eval`**. The next slide demonstrates why.

### `eval` gone wrong

Observe what happens when we use `eval` on a string representation of a JSON object:

In [48]:
f_other = Path('data') / 'evil_family.json'
eval(f_other.read_text())

ValueError: i just deleted all your files lol 😂

- Oh no! Since `evil_family.json`, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.


- This happened because `eval` **evaluates** all parts of the input string as if it were Python code.

- You never need to do this – instead, use the `.json()` method of a response object, or use the `json` library.

### Using the `json` module

Let's process the same file using the `json` module. Recall:
- `json.load(f)` loads a JSON file from a file object.
- `json.loads(f)` loads a JSON file from a **s**tring.

In [49]:
s = f_other.read_text()
s

'{\n    "name": "Grandma",\n    "age": 94,\n    "children": [\n        {\n        "name": err(),\n        "age": 60,\n        "children": [{"name": "Me", "age": 24}, \n                     {"name": "Brother", "age": 22}]\n        },\n        {\n        "name": "My Aunt",\n        "children": [{"name": "Cousin 1", "age": 34}, \n                     {"name": "Cousin 2", "age": 36, "children": \n                        [{"name": "Cousin 2 Jr.", "age": 2}]\n                     }\n                    ]\n        }\n    ]\n}'

In [50]:
json.loads(s)

JSONDecodeError: Expecting value: line 6 column 17 (char 84)

- Since `util.err()` is not a string in JSON (there are no quotes around it), `json.loads` is not able to parse it as a JSON object.

- This "safety check" is intentional.

### Handling _unfamiliar_ data

- Never trust data from an unfamiliar site.

- **Never** use `eval` on "raw" data that you didn't create!

- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

## APIs and scraping

### Programmatic requests

* We learned how to use the Python `requests` package to exchange data via HTTP.
    - `GET` requests are used to request data **from** a server.
    - `POST` requests are used to **send** data to a server.

* There are two ways of collecting data through a request:
    * By using a published API (application programming interface).
    * By scraping a webpage to collect its HTML source code.

### APIs

An API is a service that makes data directly available to the user in a convenient fashion.

Advantages:

- The data are usually clean, up-to-date, and ready to use.

- The presence of a API signals that the data provider is okay with you using their data.

- The data provider can plan and regulate data usage.
    - Some APIs require you to create an API "key", which is like an account for using the API.
    - APIs can also give you access to data that isn't publicly available on a webpage.

Disadvantages:
- APIs don't always exist for the data you want!

### API terminology

- A URL, or uniform resource locator, describes the location of a website or resource.

- An **API endpoint** is a URL of the data source that the user wants to make requests to.

- For example, on the [Reddit API](https://www.reddit.com/dev/api/):
    * the `/comments` endpoint retrieves information about comments.
    * the `/hot` endpoint retrieves data about posts labeled "hot" right now. 
    - To access these endpoints, you add the endpoint name to the base URL of the API.

### API requests

- API requests are just `GET`/`POST` requests to a specially maintained URL.
- Let's test out the [Pokémon API](https://pokeapi.co).

First, let's make a `GET` request for `'squirtle'`.

In [51]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/squirtle')
r

<Response [200]>

Remember, the 200 status code is good! Let's take a look at the **content**:

In [56]:
rr = r.json()
rr

{'abilities': [{'ability': {'name': 'torrent',
    'url': 'https://pokeapi.co/api/v2/ability/67/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'rain-dish',
    'url': 'https://pokeapi.co/api/v2/ability/44/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 63,
 'forms': [{'name': 'squirtle',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/7/'}],
 'game_indices': [{'game_index': 177,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 177,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 177,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 7,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version/4/'}},
  {'game_index': 7,
   'version': {'name': 'silver',
    'url': 'https://pokeapi.co/api/v2/version/5/'}},
  {'game_index': 7,
   'version': {'name': 'crystal',
    'url': 'https://pokeapi.co/api/

Looks like JSON. We can extract the JSON from this request with the `json` method (or by passing `r.text` to `json.loads`).

In [58]:
rr.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'past_abilities', 'past_types', 'species', 'sprites', 'stats', 'types', 'weight'])

In [60]:
rr['abilities']

[{'ability': {'name': 'torrent',
   'url': 'https://pokeapi.co/api/v2/ability/67/'},
  'is_hidden': False,
  'slot': 1},
 {'ability': {'name': 'rain-dish',
   'url': 'https://pokeapi.co/api/v2/ability/44/'},
  'is_hidden': True,
  'slot': 3}]

Let's try a `GET` request for `'billy'`.

In [61]:
r = requests.get('https://pokeapi.co/api/v2/pokemon/billy')
r

<Response [404]>

Error, since no Pokemon named `'billy'`!

### Scraping

Scraping is the act of programmatically "browsing" the web, downloading the source code (HTML) of pages that you're interested in extracting data from.

Advantages:

* You can always do it!
    - e.g. Google scrapes webpages in order to make them searchable.

Disadvantages:

- It is often difficult to parse and clean scraped data.
    - Source code often includes a lot of content unrelated to the data you're trying to find (e.g. formatting, advertisements, other text).

- Websites can change often, so scraping code can get outdated quickly.

- Websites may not want you to scrape their data!

- **In general, we prefer APIs, but scraping is a useful skill to learn.**

## Midterm Review

https://app.sli.do/event/8N48uxYMVuraPquxCAQjn1

Submit your topics / previous exam questions, Sam will see what the most popular categories are and review.

## apply vs agg vs transform vs filter