In [None]:
import pandas as pd
import numpy as np
import os

import util

import plotly.express as px
import plotly.figure_factory as ff
pd.options.plotting.backend = 'plotly'

# Lecture 14 – HTTP Basics

## DSC 80, Spring 2023

### Agenda

- Recap: Imputation
- Introduction to HTTP.
- Making HTTP requests.
- Data formats.

## Recap: Imputation

### Example: Heights 🧍📏

In [None]:
heights = pd.read_csv(os.path.join('data', 'midparent.csv'))
heights = (
    heights
    .rename(columns={'childHeight': 'child', 'childNum': 'number'})
    .drop('midparentHeight', axis=1)
)
heights.head()

In [None]:
np.random.seed(42) # So that we get the same results each time (for lecture).
heights_mcar = util.make_mcar(heights, 'child', pct=0.5)
heights_mar = util.make_mar_on_cat(heights, 'child', 'gender', pct=0.5)

### Mean imputation

Suppose the `'child'` column has missing values.

- If `'child'` is MCAR, then fill in each of the missing values using the **mean of the observed values**.

- If `'child'` is MAR dependent on a categorical column, then fill in each of the missing values using the **mean of the observed values in each category**. For instance, if `'child'` is MAR dependent on `'gender'`, we can fill in:
    - missing female `'child'` heights with the observed mean for female children, and
    - missing male `'child'` heights with the observed mean for male children.

- If `'child'` is MAR dependent on a numerical column, then **bin the numerical column to make it categorical**, then follow the procedure above. See Lab 5, Question 5!

- Mean imputation, when done correctly, creates a distribution whose mean is an unbiased estimate of the true distribution's mean, but whose variance is **an underestimate** of the true variance.

### Conditional mean imputation of MAR data

In [None]:
def mean_impute(ser):
    return ser.fillna(ser.mean())

heights_mar_cond = heights_mar.groupby('gender')['child'].transform(mean_impute).to_frame() # Conditional mean imputation (good, since MAR).
heights_mar_mfilled = heights_mar.fillna(heights_mar['child'].mean()) # Single mean imputation (bad, since MAR).

df_map = {'Original': heights, 'MAR, Unfilled': heights_mar, 
          'MAR, Mean Imputed': heights_mar_mfilled, 'MAR, Conditional Mean Imputed': heights_mar_cond}

util.multiple_kdes(df_map)

The <span style='color:rgb(231,41,138)'><b>pink distribution (conditional mean imputation)</b></span> does a better job of approximating the <span style='color:rgb(27,158,119)'><b>turquoise distribution (the full dataset with no missing values)</b></span> than the <span style='color:rgb(117,112,179)'><b>purple distribution (single mean imputation)</b></span>.

### Probabilistic imputation

Suppose the `'child'` column has missing values.

- If `'child'` is MCAR, then fill in each of the missing values with **randomly selected observed `'child'` heights**.
    - For instance, if there are 5 missing `'child'` values, pick 5 of the not-missing `'child'` values.

- If `'child'` is MAR dependent on a categorical column, sample from the observed values separately for each category.

### Conditional probabilistic imputation of MAR data

In [None]:
def create_imputed(col):
    col = col.copy()
    
    # Find the number of missing child heights for that gender.
    num_null = col.isna().sum()
    
    # Sample num_null observed child heights for that gender.
    fill_values = np.random.choice(col.dropna(), num_null)
    
    # Fill in missing values and return ser.
    col[col.isna()] = fill_values
    return col

Let's use `transform` to call `create_imputed` separately on each `'gender'`.

In [None]:
heights_mar_pfilled = heights_mar.copy()
heights_mar_pfilled['child'] = heights_mar.groupby('gender')['child'].transform(create_imputed)
heights_mar_pfilled['child'].head()

In [None]:
df_map['MAR, Conditionally Probabilistically Imputed'] = heights_mar_pfilled
util.multiple_kdes(df_map)

The <span style='color:rgb(102,166,30)'><b>green distribution (conditional probabilistic imputation)</b></span> does the best job of approximating the <span style='color:rgb(27,158,119)'><b>turquoise distribution (the full dataset with no missing values)</b></span>!

_Remember that the graph above is interactive – you can hide/show lines by clicking them in the legend._

### Randomness

- Unlike mean imputation, probabilistic imputation is **random** – each time you run the cell in which imputation is performed, the results could be different.

- If we're interested in estimating some population **parameter** given our (incomplete) sample, it's best not to rely on just a single random imputation.

- **Multiple imputation**: Generate multiple imputed datasets and aggregate the results!
    - Similar to bootstrapping.

### Multiple imputation of MCAR data

Steps:

0. Start with observed and incomplete data. 

1. Create $m$ **imputed** versions of the data through a probabilistic procedure.
    - The imputed datasets are identical for the observed data entries.
    - They differ in the imputed values. 
    - The differences reflect our **uncertainty** about what value to impute.

2. Then, compute parameter estimates on **each** imputed dataset.
    - For instance, the mean, standard deviation, median, etc.

3. Finally, pool the $m$ parameter estimates into one estimate.

### Multiple imputation of MCAR data

Let's try this procedure out on the `heights_mcar` dataset.

In [None]:
heights_mcar.head()

 Each time we run the following cell, it generates a new imputed version of the `'child'` column.

In [None]:
create_imputed(heights_mcar['child']).head()

Let's run the above procedure 100 times.

In [None]:
mult_imp = pd.concat([create_imputed(heights_mcar['child']).rename(k) for k in range(100)], axis=1)
mult_imp.head()

Let's plot some of the imputed columns on the previous slide.

In [None]:
# Random sample of 15 imputed columns.
mult_imp_sample = mult_imp.sample(15, axis=1)
fig = ff.create_distplot(mult_imp_sample.to_numpy().T, list(mult_imp_sample.columns), show_hist=False, show_rug=False)
fig.update_xaxes(title='child')

Let's look at the distribution of means across the imputed columns.

In [None]:
px.histogram(pd.DataFrame(mult_imp.mean()), nbins=15, histnorm='probability',
             title='Distribution of Imputed Sample Means')

### Summary of imputation techniques

See the end of Lecture 13 for a detailed summary of all imputation techniques that we've seen so far.

## Introduction to HTTP

The material we're covering now is _not_ on the Midterm Exam.

<center><img src="imgs/DSLC.png" width="40%"></center>

### Data sources

* Often, the data you need doesn't exist in "clean" `.csv` files.

* **Solution:** Collect your own data!
    - Design and administer your own survey or run an experiment.
    - Find related data on the internet.

- The internet contains **massive** amounts of historical record; for most questions you can think of, the answer exists somewhere on the internet.

### Collecting data from the internet

- There are two ways to programmatically access data on the internet:
    - through an API.
    - by scraping.

- We will discuss the differences between both approaches next lecture, but for now, the important part is that they **both use HTTP**.

### HTTP

- HTTP stands for **Hypertext Transfer Protocol**.
    - It was developed in 1989 by Tim Berners-Lee (and friends).

- It is a **request-response** protocol.
    - Protocol = set of rules.

- HTTP allows...
    - computers to talk to each other over a network.
    - devices to fetch data from "web servers".

- The "S" in HTTPS stands for "secure".

<center><img src='imgs/ucsd.png' width=750></center>

UCSD was a node in ARPANET, the predecessor to the modern internet ([source](https://en.wikipedia.org/wiki/ARPANET#/media/File:Arpanet_map_1973.jpg/)).

### The request-response model

HTTP follows the **request-response** model.

<center><img src='imgs/req-response.png' width=600></center>

- A **request** is made by the **client**.

- A **response** is returned by the **server**.

- **Example:** YouTube 🎥.
    - Your phone's web browser, a **client**, makes an HTTP **request** to view a video.
    - The **server**, YouTube, is a computer that is sitting somewhere else.
    - The server returns a **response** that contains the video.

### Request methods

The request methods you will use most often are `GET` and `POST`; see [Mozilla's web docs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) for a detailed list of request methods.    

- `GET` is used to request data **from** a specified resource.

- `POST` is used to **send** data to the server. 
    - e.g. uploading a photo to Instagram or entering credit card information on Amazon.

### Example `GET` request

Below is an example `GET` HTTP request made by a browser when accessing [datascience.ucsd.edu](https://datascience.ucsd.edu).

```HTTP
GET / HTTP/1.1
Host: datascience.ucsd.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
Connection: keep-alive
Accept-Language: en-US,en;q=0.9
```

- The first line (`GET / HTTP/1.1`) is called the "request line", and the lines afterwards are called "header fields". Header fields contain metadata. 

- We _could_ also provide a "body" after the header fields.

- To see HTTP requests in Google Chrome, follow [these steps](https://mkyong.com/computer-tips/how-to-view-http-headers-in-google-chrome/).

### Example `GET` response

The response below was generated by executing the request on the previous slide.

```HTTP
HTTP/1.1 200 OK
Date: Fri, 29 Apr 2022 02:54:41 GMT
Server: Apache
Link: <https://datascience.ucsd.edu/wp-json/>; rel="https://api.w.org/"
Link: <https://datascience.ucsd.edu/wp-json/wp/v2/pages/2427>; rel="alternate"; type="application/json"
Link: <https://datascience.ucsd.edu/>; rel=shortlink
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8">
	<link rel="profile" href="https://gmpg.org/xfn/11">
	<style media="all">img.wp-smiley,img.emoji{display:inline !important;border:none
...
```

### Consequences of the request-response model

- When a request is sent to view content on a webpage, the server must:
    - process your request (i.e. prepare data for the response).
    - send content back to the client in its response.

- Remember, servers are computers. 
    - Someone has to pay to keep these computers running.
    - **This means that every time you access a website, someone has to pay.**

### Example: [istheshipstuck.com](https://istheshipstillstuck.com)

<center><img src='imgs/ships.png' width=35%></center>

Read [_Inside a viral website_](https://notfunatparties.substack.com/p/inside-a-viral-website), an account of what it's like to run a site that gained 50 million+ views in 5 days.

## Making HTTP requests

### Making HTTP requests

We'll see two ways to make HTTP requests outside of a browser:

- From the command line, with `curl`.

- **From Python, with the `requests` package.**

### Making HTTP requests using `curl`

[`curl`](https://curl.haxx.se/docs/httpscripting.html) is a **command-line tool** that sends HTTP requests, like a browser.

1. The client, `curl`, sends a HTTP request. 
2. The request contains a method (e.g. `GET` or `POST`).
3. The HTTP server responds with:
    - a status line, indicating if things went well, 
    - response headers, and
    - (usually) a response body, containing the requested data.

### Example: `GET` requests via `curl`

- By default, `curl` issues a `GET` request.

```zsh
# `-v` is short for verbose
curl -v https://httpbin.org/html 
```

- Remember, you can run command-line commands in a Jupyter Notebook by placing a `!` before them. Let's try that here.

In [None]:
# Compare the output to what you see when you go to https://httpbin.org/html in your browser!
!curl -v https://httpbin.org/html

### Queries in a `GET` request

- In order to request more specific information, we can include a **query string** in the URL. `?` begins a query.

<a href="https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari"><pre>
https://www.google.com/search?q=ucsd+dsc+80+hard&client=safari
</pre></a>

- This method works well when sending small amounts of data; we will use a similiar technique when working with APIs next lecture.

- Be on the lookout for query strings in URLs you share on social media!

### Making HTTP requests using `requests`

- `requests` is a Python module that allows you to use Python to interact with the internet!  
- There are other packages that work similarly (e.g. `urllib`), but `requests` is arguably the easiest to use.

In [None]:
import requests

### Example: `GET` requests via `requests`

To access the source code of the UCSD home page, all we need to run is the following:

```py
requests.get('https://ucsd.edu').text
```

In [None]:
res = requests.get('https://ucsd.edu')

`res` is now a `Response` object.

In [None]:
res

The `text` attribute of `res` is a string that containing the entire response.

In [None]:
type(res.text)

In [None]:
len(res.text)

In [None]:
print(res.text[:1000])

### Example: `POST` requests via `requests`

The following call to `requests.post` makes a post request to https://httpbin.org/post, with a `'name'` parameter of `'King Triton'`.

In [None]:
post_res = requests.post('https://httpbin.org/post',
                         data={'name': 'King Triton'})

post_res

In [None]:
post_res.text

In [None]:
# More on this shortly!
post_res.json()

What happens when we try and make a `POST` request somewhere where we're unable to?

In [None]:
yt_res = requests.post('https://youtube.com',
                       data={'name': 'King Triton'})

yt_res

`yt_res.text` is a string containing HTML – we can render this in-line using `IPython.display.HTML`.

In [None]:
from IPython.display import HTML

In [None]:
HTML(yt_res.text)

### HTTP status codes

- When we **request** data from a website, the server includes an **HTTP status code** in the response.  

* The most common status code is `200`, which means there were no issues.  

* Other times, you will see a different status code, describing some sort of event or error.
    - Common examples: `400` – bad request, `404` – page not found, `500` – internal server error.
    - [The first digit of a status describes its general "category".](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- See [https://httpstat.us](https://httpstat.us/) for a list of all HTTP status codes.
    - It also has example sites for each status code; for example, https://httpstat.us/404 returns a `404`.

In [None]:
yt_res.status_code

### Successful requests ✅

- You can check if a request was successful using the `ok` attribute, which returns a bool.
    - If a status is in the 200s, then it is successful.

In [None]:
yt_res.status_code, yt_res.ok

In [None]:
post_res.status_code, post_res.ok

- Unsuccessful requests can be re-tried, depending on the issue.
    - Wait a little, then try the request again.
    - You can even re-try requests programmatically (e.g. using a loop). If rate of requests is too high, slow down requests between each retry (e.g. using `time.sleep`).

- See the [course notes](https://notes.dsc80.com/content/07/requests.html#responsible-use-of-http-requests) for more examples.

## Data formats

### The data formats of the internet

Responses typically come in one of two formats: HTML or JSON.

- The response body of a `GET` request is usually either JSON (when using an API) or HTML (when accessing a webpage).

- The response body of a `POST` request is usually JSON.

- XML is also a common format, but not as popular as it once was.

<center><img src='imgs/json.png' width=50%></center>

### JSON

- JSON stands for **JavaScript Object Notation**. It is a lightweight format for storing and transferring data.

- It is:
    - very easy for computers to read and write.
    - moderately easy for programmers to read and write by hand.
    - meant to be generated and parsed.

- Most modern languages have an interface for working with JSON objects.
    - JSON objects _resemble_ Python dictionaries (but are not the same!).

### JSON data types

| Type | Description |
| --- | --- |
| String | Anything inside double quotes. |
| Number | Any number (no difference between ints and floats). |
| Boolean | `true` and `false`. |
| Null | JSON's empty value, denoted by `null`. |
| Array | Like Python lists. |
| Object | A collection of key-value pairs, like dictionaries. Keys must be strings, values can be anything (even other objects). |

See [json-schema.org](https://json-schema.org/understanding-json-schema/reference/type.html) for more details.

### Example JSON object

See `data/family.json`.

<center><img src='imgs/hierarchy.png' width=50%></center>

In [None]:
import json

f = open(os.path.join('data', 'family.json'), 'r')
family_tree = json.load(f)

In [None]:
family_tree

In [None]:
family_tree['children'][0]['children'][0]['age']

### Aside: `eval`

- `eval`, which stands for "evaluate", is a function built into Python.

- It takes in a **string containing a Python expression** and evaluates it in the current context.

In [None]:
x = 4
eval('x + 5')

- It seems like `eval` can do the same thing that `json.load` does...

In [None]:
f = open(os.path.join('data', 'family.json'), 'r')
eval(f.read())

- But you should **never use `eval`**. The next slide demonstrates why.

### `eval` gone wrong

Observe what happens when we use `eval` on a string representation of a JSON object:

In [None]:
f_other = open(os.path.join('data', 'evil_family.json'))
eval(f_other.read())

- Oh no! Since `evil_family.json`, which could have been downloaded from the internet, contained malicious code, we now lost all of our files.


- This happened because `eval` **evaluates** all parts of the input string as if it were Python code.

- You never need to do this – instead, use the `.json()` method of a response object, or use the `json` library.

### Using the `json` module

Let's process the same file using the `json` module. Recall:
- `json.load(f)` loads a JSON file from a file object.
- `json.loads(f)` loads a JSON file from a **s**tring.

In [None]:
f_other = open(os.path.join('data', 'evil_family.json'))
s = f_other.read()
s

In [None]:
json.loads(s)

- Since `util.err()` is not a string in JSON (there are no quotes around it), `json.loads` is not able to parse it as a JSON object.

- This "safety check" is intentional.

### Handling _unfamiliar_ data

- Never trust data from an unfamiliar site.

- **Never** use `eval` on "raw" data that you didn't create!

- The JSON data format needs to be **parsed**, not evaluated as a dictionary.
    - It was designed with safety in mind!

## Summary, next time

### Summary

- HTTP is the protocol the internet uses for transferring information.
- Clients can make `GET` HTTP requests to ask for information and `POST` HTTP requests to send information.
- Servers send responses with the desired information.
- We can use `curl` in the command-line or the `requests` Python module to make HTTP requests.
- The two main file formats used for storing information on the internet are HTML and JSON.
    - JSON objects resemble Python dictionaries, but they are not quite the same. 
    - Use the `.json()` method of a response object or the `json` package to parse them, **not** `eval`.

### Next time

- Using HTTP to make API requests and scrape the web. 
- Parsing HTML files.