# Working With Data From the Internet

<style>
section.present > section.present { 
    max-height: 90%; 
    overflow-y: scroll;
}
</style>

<small><a href="https://colab.research.google.com/github/brandeis-jdelfino/cosi-10a/blob/main/lectures/notebooks/17_requests.ipynb">Link to interactive slides on Google Colab</a></small>

# Where is the International Space Station?

# [Right here](http://api.open-notify.org/iss-now.json)

Does that data format look familiar?

It's JSON!

JSON is a very commonly used data exchange format.

## Requests library

We can fetch and work with this data from Python.

In [None]:
import requests
data = requests.get('http://api.open-notify.org/iss-now.json').json()
print(data)
print(f"The ISS is at lat: {data['iss_position']['latitude']}, long: {data['iss_position']['longitude']}")

This code fetches that JSON string, parses it into Python objects, then prints out the latitude/longitude.

## Networking on the Internet

Let's dig into what it means to "send a request" over the internet.

This is a 30,000 foot overview - many details are glossed over.

## Mailing letters

We'll use an analogy: sending a request over the internet is like mailing someone a letter.

A letter is sent to an address, and is contained inside an envelope.

A URL is like the address the letter is being mailed to.

## Anatomy of a URL

`http://video.google.com/videoplay?docid=1234&hl=en`

* `http://`: The "protocol", or "scheme"
   * This specifies the format of the request
* `video.google.com`: The "domain name", or a unique name for a location on the internet
   * Starts after the protocol, and after `://`
* `videoplay`: The "path", or the name of the resource being accessed
   * Starts after the end of the domain, and after a `/`
* `docid=1234&hl=en`: The "query", or parameters related to a request
   * Starts at the end of the path, and after a `?`
   * `docid=1234` and `hl=en` are each individual "parameters", and parameters are always separated by `&`

## Anatomy of our URL

`http://api.open-notify.org/iss-now.json`

* `http://`: protocol
* `api.open-notify.org`: domain
* `iss-now.json`: path

## IP Addresses and Domain Names

Every computer connected to the internet has a unique IP address. 

* They look like this: `4.36.195.10`

Every domain name has one or more IP addresses associated with it.

* brandeis.edu's IP address is: `76.223.22.0`

The first step of sending a request is to translate the domain to an IP address.


## IP addresses are like formal postal addresses

A domain is like the name of a location that you understand, like "Joe's house" or "the park around the corner from my house".

An IP address is like the actual address of the house or park, including all the necessary details, such as postal code.

## DNS

**DNS** (Domain Name System) is like a big dictionary that maps from domain name to IP address.

The owner of a domain has the ability to specify which IP address(es) a domain maps to.

DNS records are publicly accessible: https://traceroute-online.com/check-dns-records/

All routing of traffic on the internet is done using IP addresses.

## DNS is like an address book

DNS is like your address book - it translates from a "place name" to a physical address that the postal service recognizes.

You can't mail a letter to "Joe's house", but you can mail a letter to "123 Python Lane, Boston MA, 02118" (not my real address).

## IP Routing

Requests will typically be passed between many servers on the way to arriving at the specified IP address. 

How this routing works is beyond the scope of this lecture, but we can look up the route a request takes: https://traceroute-online.com/

## IP Routing is like the postal network 

IP routing is like the network of trucks, planes, trains, etc that are used to get your letter to its final destination. 

Your letter might stop at multiple other post offices along the way.

No one along the way knows or cares what is **in** your letter. They're just delivering an envelope.

You also don't care exactly what route your letter takes, as long as it gets to its destination.

## Load balancing

Eventually, your request ends up at a server with the IP address associated with the domain you requested.

Those first servers are often "load balancers", which route requests to other servers, which can actually respond to the requests.

This routing is different than IP routing, but we don't need to worry about the difference.

Eventually, the request is routed to a server that can generate a response.

## Load balancing is like a campus mail room

If you mail a letter to me at my campus address, the postal service delivers it to the mail room. 

From there, the mail room staff on campus routes the letter to my mail box.

This is similar to the role of a load balancer.

## HTTP Requests

Our protocol is "http", which means we are making an "HTTP request"

**HTTP** stands for "HyperText Transport Protocol"

It is the protocol (agreed upon format) for the majority of requests on the internet.

Other common protocols include: 
* `ftp` (file transfer)
* `smtp` and `imap` (email)
* `udp` (real time communication)

## HTTP is the language in which your letter is written

It doesn't have anything to do with how the letter is routed or handled by the postal service, but it is important that your recipient knows how to interpret the letter we wrote.

## Anatomy of an HTTP Request

An HTTP request is just a blob of text describing our request.

It consists of 3 parts:
1. Request line
   * Specifies the HTTP method, and the resource being requested
2. Headers (Optional)
   * Directives on how to handle the request
3. Body (optional)
   * The payload of the message, if applicable

Here's what our HTTP request looks like when sent from a browser. It has 8 headers and no body:

```
GET /iss-now.json HTTP/1.1
Accept: text/html,application/xhtml+xml,[...]
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Host: api.open-notify.org
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) [...]
```

## HTTP Methods

HTTP requests always have a "method".

There are several methods, but the 2 most common are "GET" and "POST".

**GET** requests are typically used to retrieve data.

**POST** requests are typically used to send data.

## Reading and Responding to the Letter

The letter has been delivered to a person who should be able to understand it.

That person will know how to interpret the contents, and how to respond.

That person might still call on other people (servers) to help create a response.

The response will be returned in the same language and format (protocol) as the original letter.

## HTTP Responses

HTTP responses have status codes, headers, and bodies.

Wikipedia has a [full list](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) of status codes. Common codes are:
* `200` - success
* `301` / `302` - permanent / temporary redirect
* `404` - not found
* `500` - server error

Here's what our HTTP response looks like when received in the browser. It has 6 headers and a body containing a JSON string:

```
HTTP/1.1 200 OK
Server: nginx/1.10.3
Date: Wed, 25 Oct 2023 03:10:09 GMT
Content-Type: application/json
Content-Length: 113
Connection: keep-alive
access-control-allow-origin: *

{"timestamp": 1698203409, "iss_position": {"latitude": "35.2661", "longitude": "110.0297"}, "message": "success"}
```

## Letter Responses

* `200` is like successfully receiving a reply letter.
* `301` / `302` are a bit like forwarding mail to another address - the postal service handles this for you; similarly most HTTP clients will also handle it for you.
* `404` is like trying to send a letter to an unknown address, and having it returned
* `500` is like the box of mail containing your letter falling off the truck and getting lost

The body of the response is like the letter you receive in response to your inquiry.

## All the terminology together

`requests.get('http://api.open-notify.org/iss-now.json')`

We made an HTTP request to the domain `api.open-notify.org`, asking it to `GET` the data for the resource `iss-now.json`.

It returned a response with a `200` (success) status code, and a JSON object in the body.

## Back to the `requests` library...

`requests.get()` returns a `Response` object, which has many fields: [Response](https://requests.readthedocs.io/en/latest/api/#requests.Response)

In [None]:
response = requests.get('http://api.open-notify.org/iss-now.json')
print(response.status_code)
print()
print(response.headers)
print()
print(response.text)

We can use the handy `json()` method, which attempts to parse the body of the response as JSON:

In [None]:
print(type(response.json()))
print(response.json())

## All this networking stuff is great...

But I really just want to know what part of the world the ISS is over, and I can't translate latitude/longitudes in my head.

Let's use another free service, [geocode.xyz/api](http://geocode.xyz/api)

It has several options, but we want the simple one: turn a lat/long into an address.

In [None]:
import json
response = requests.get('http://api.open-notify.org/iss-now.json')
response_json = response.json()

pos = response_json['iss_position']
print(pos)

geocode_url = f"https://geocode.xyz/{pos['latitude']},{pos['longitude']}?json=1"

print(f"Reverse geocoding: {pos=} with {geocode_url=}")
geolocate_response = requests.get(geocode_url)

print(geolocate_response.status_code)
#print(geolocate_response.json())

#pretty print:
print(json.dumps(geolocate_response.json(), indent=2, ensure_ascii=False))

## Meteorites

[JSON Meteorite data from NASA](https://data.nasa.gov/resource/y77d-th95.json)

[Dataset description on NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)

In [None]:
import requests
meteorites = requests.get('https://data.nasa.gov/resource/gh4g-9sfh.json?').json()
print(type(meteorites))

In [None]:
print(len(meteorites))

Hm, exactly 1000. Seems suspicious - maybe we aren't getting all the data.

This required some digging, but NASA's datasets are powered by the "Socrata Data API". Here's [some documentation](https://dev.socrata.com/foundry/data.nasa.gov/gh4g-9sfh). 

A little Google searching unearths this documentation about how to ask for more results:
[Information on paging with Socrata Data API](https://dev.socrata.com/docs/paging.html)

In [None]:
import requests
meteorites = requests.get('https://data.nasa.gov/resource/gh4g-9sfh.json?$limit=50000').json()
print(type(meteorites))

In [None]:
print(len(meteorites))

In [None]:
print(meteorites[0])

What can we do with this data?

List all meteorites above a certain size.

In [None]:
def list_bigger_than(mites, min_mass):
    results = []
    for m in mites:
        if m['mass'] > min_mass:
            results.append(m)
    return results

In [None]:
list_bigger_than(meteorites, 10000000)

Oh, right, mass is a string in the JSON. We need to convert it.

In [None]:
def list_bigger_than(mites, min_mass):
    results = []
    for m in mites:
        if float(m['mass']) > min_mass:
            results.append(m)
    return results

In [None]:
list_bigger_than(meteorites, 10000000)

Ok, not all meteorites have a mass... 

In [None]:
def list_bigger_than(mites, min_mass):
    results = []
    for m in mites:
        if float(m.get('mass', 0)) > min_mass:
            results.append(m)
    return results

In [None]:
list_bigger_than(meteorites, 10000000)

Let's list all the meteorites in a year. 

In [None]:
def list_in_year(mites, year):
    results = []
    for m in mites:
        if m['year'] == year:
            results.append(m)
    return results

In [None]:
list_in_year(meteorites, 1945)

This data is dirty - not all meteorites have years either. Let's get an idea of how bad the data is.

In [None]:
full_count = len(meteorites)
no_mass = 0
no_year = 0

for m in meteorites:
    if 'year' not in m:
        no_year += 1

for m in meteorites:
    if 'mass' not in m:
        no_mass += 1

print(f"No year: {no_year} ({no_year / full_count * 100:.2f} %)")
print(f"No mass: {no_mass} ({no_mass / full_count * 100:.2f} %)")

In [None]:
def list_in_year(mites, year):
    results = []
    for m in mites:
        if m.get('year', 0) == year:
            results.append(m)
    return results

In [None]:
list_in_year(meteorites, 1945)

Suspicious...

In [None]:
print(meteorites[0]['year'])

Aha, it's formatted as a full timestamp. 

We could parse the whole date using the [datetime](https://docs.python.org/3/library/datetime.html), but we only care about the year. 

In [None]:
def list_in_year(mites, year):
    results = []
    for m in mites:
        if int(m.get('year', '0000')[:4]) == year:
            results.append(m)
    return results

In [None]:
list_in_year(meteorites, 1945)

Maybe we should clean up all the data at once, rather than doing it every time we try to access something.

A class would make sense here:

In [None]:
class Meteorite:
    def __init__(self, name, mass, year, lat, long):
        self.name = name
        self.mass = mass
        self.year = year
        self.lat = lat
        self.long = long

Now we need code to translate from that JSON to a `Meteorite` instance.

It's tempting to add a method to the class that populates the data from JSON. 

This is bad design - we prefer to fully initialize our instances on creation, rather than create them in an unusable state and then populate them.

It's also temping to make the consructor take the JSON directly. 

This is also bad design - it "couples" our class too tightly to this particular representation of the data by making it impossible to create an instance without writing out the JSON.

Instead, we'll write a function that takes in JSON, and returns a new `Meteorite` instance:

In [None]:
def new_mite_from_json(json_data):
    if 'name' not in json_data or 'year' not in json_data or 'mass' not in json_data or 'geolocation' not in json_data:
        return None

    loc = json_data['geolocation']
    if 'latitude' not in loc or 'longitude' not in loc:
        return None

    return Meteorite(
        name = json_data['name'],
        year = int(json_data['year'][:4]),
        mass = float(json_data['mass']),
        lat = float(loc['latitude']),
        long = float(loc['longitude'])
    )

In [None]:
mites = []
for m in meteorites:
    mite = new_mite_from_json(m)
    if mite:
        mites.append(mite)
print(len(mites))

In [None]:
def list_bigger_than(mites, min_mass):
    results = []
    for m in mites:
        if m.mass > min_mass:
            results.append(m)
    return results

In [None]:
list_bigger_than(mites, 10000000)

One downside to using a class is that we can't easily print all the info out.

Let's define the handy `__str__` method.

In [None]:
class Meteorite:
    def __init__(self, name, mass, year, lat, long):
        self.name = name
        self.mass = mass
        self.year = year
        self.lat = lat
        self.long = long
    
    def __str__(self):
        return f"{self.name} ({self.year}) ({int(self.mass/1000)} kg) [{self.lat}, {self.long}]"

In [None]:
mites = []
for m in meteorites:
    mite = new_mite_from_json(m)
    if mite:
        mites.append(mite)
print(len(mites))

In [None]:
def list_bigger_than(mites, min_mass):
    results = []
    for m in mites:
        if m.mass > min_mass:
            results.append(m)
    return results

In [None]:
for m in list_bigger_than(mites, 10000000):
    print(m)

In [None]:
def list_in_year(mites, year):
    results = []
    for m in mites:
        if m.year == year:
            results.append(m)
    return results

In [None]:
for m in list_in_year(mites, 1945):
    print(m)


Count meteorites by year

In [None]:
def count_by_year(mites):
    years = {}
    for m in mites:
        if m.year in years:
            years[m.year] += 1
        else:
            years[m.year] = 1

    return years

In [None]:
count_by_year(mites)

We can sort by year, and print them out in order:

In [None]:
counts = count_by_year(mites)
sorted_years = sorted(counts.keys(), reverse=True)
for y in sorted_years:
    print(f"{y}: {counts[y]}")

We could also sort by count:

In [None]:
counts = count_by_year(mites)
sorted_counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
for year, count in sorted_counts:
    print(f"{year}: {count}")

Find all meteorites within a certain distance.

In [None]:
def list_within_distance(mites, lat, long, miles):
    results = []
    for m in mites:
        if distance(m.lat, m.long, lat, long) <= miles:
            results.append(m)

    return results

def distance(lat1, long1, lat2, long2):
    # ???
    pass

It turns out that accurately calculating the distance between 2 lat/long pairs is **really** complicated.

The formulas for calculating the surface distance between 2 points on a sphere are a [bit complicated](https://en.wikipedia.org/wiki/Great-circle_distance). 

But the Earth isn't perfectly spherical, so if we really want to be accurate, we need some [serious math](https://en.wikipedia.org/wiki/Geodesics_on_an_ellipsoid)!

Luckily, someone else has written this for us. We can try the [geopy](https://github.com/geopy/geopy) library. 

Their homepage even contains a code sample for getting the geodesic distance between 2 points.

In [None]:
from geopy.distance import geodesic

def distance(lat1, long1, lat2, long2):
    return geodesic((lat1, long1), (lat2, long2)).miles

def list_within_miles(mites, lat, long, miles):
    results = []
    for m in mites:
        d = distance(m.lat, m.long, lat, long)
        if d <= miles:
            results.append((d, m))

    return results

In [None]:
brandeis_lat = 42.3654
brandeis_long = -71.2586
results = list_within_miles(mites, brandeis_lat, brandeis_long, 150)
results.sort()

In [None]:
for r in results:
    print(f"{r[1]}, {int(r[0])} miles")

If you noticed, the `geopy` library also provides functions to geocode.

These functions are just wrappers around a service like [geocode.xyz](geocode.xyz), which we used earlier.

In [None]:
# Usage instructions on the geopy homepage: https://github.com/geopy/geopy
from geopy.geocoders import Nominatim

def geocode(lat, long):
    geolocator = Nominatim(user_agent="brandeis_cosi_10a")
    location = geolocator.reverse(f"{lat}, {long}")
    return location

print(results[0][1])
print(geocode(results[0][1].lat, results[0][1].long))

# Wrap up

We're starting to see the real power of Python, and coding in general. We can automate tasks and data processing that would be tedious and painful to do by hand.

Check out this [list of free JSON data sets](https://github.com/jdorfman/awesome-json-datasets) - try fetching one of them and doing something interesting. Or, try joining together the data from more than one!