# APIs and Scraping

Overview of today's topic:

  - What are APIs and how do you work with them?
  - Geocoding place names and addresses
  - Reverse-geocoding coordinates
  - Looking up places near some location
  - Web scraping when no API is provided
  - Using data portals programmatically

To follow along with this lecture, you need a working Google API key to use the Google Maps Geocoding API and the Google Places API Web Service. These APIs require you to set up billing info, but we won't use them in class beyond the free threshold.

In [None]:
import geopandas as gpd
import folium
import osmnx as ox
import pandas as pd
import re
import requests
import time
from bs4 import BeautifulSoup
from geopy.geocoders import GoogleV3

from keys import google_api_key

# define a pause duration between API requests
pause = 0.1

An API is an application programming interface. It provides a structured way to send commands or requests to a piece of software. "API" often refers to a web service API. This is like web site (but designed for applications, rather than humans, to use) that you can send requests to to execute commands or query for data. Today, REST APIs are the most common. To use them, you simply send them a request, and they reply with a response, similar to how a web browser works. The request is sent to an endpoint (a URL) typically with a set of parameters to provide the details of your query or command.

In the example below, we make a request to the [ipify](https://www.ipify.org/) API and request a JSON formatted response. Then we look up the location of the IP address it returned, using the [ip-api](https://ip-api.com/) API.

In [None]:
# what is your current public IP address?
url = 'https://api.ipify.org?format=json'
data = requests.get(url).json()
data

In [None]:
# and what is the location of that IP address?
url = 'http://ip-api.com/json/{}'.format(data['ip'])
requests.get(url).json()

What's the current weather? Use the [National Weather Service](https://www.weather.gov/documentation/services-web-api) API.

In [None]:
# query for the forecast url for a pair of lat-lng coords
location = '34.019268,-118.283554'
url = 'https://api.weather.gov/points/{}'.format(location)
data = requests.get(url).json()

# extract the forecast url and retrieve it
forecast_url = data['properties']['forecast']
forecast = requests.get(forecast_url).json()

# convert the forecast to a dataframe
pd.DataFrame(forecast['properties']['periods']).head()

You can use any web service's API in this same basic way: request the URL with some parameters. Read the API's documentation to know how to use it and what to send. You can also use many web service's through a Python package to make complex services easier to work with. For example, there's a fantastic package called [cenpy](https://cenpy-devs.github.io/cenpy/) that makes downloading and working with US census data super easy.

## 1. Geocoding

"Geocoding" means converting a text description of some place (such as the place's name or its address) into geographic coordinates identifying the place's location on Earth. These geographic coordinates may take the form of a single latitude-longitude coordinate pair, or a bounding box, or a boundary polygon, etc.

### 1a. Geocoding place names with OpenStreetMap via OSMnx

[OpenStreetMap](https://www.openstreetmap.org/) is a worldwide mapping platform that anyone can contribute to. [OSMnx](https://github.com/gboeing/osmnx) is a Python package to work with OpenStreetMap for geocoding, downloading geospatial data, and modeling/analyzing networks. OpenStreetMap and OSMnx are free to use and do not require an API key. We'll work with OSMnx more in a couple weeks.

In [None]:
# geocode a place name to lat-lng
place = 'University of Southern California'
latlng = ox.geocode(place)
latlng

In [None]:
# geocode a series of place names to lat-lng
places = pd.Series(['San Diego, California',
                    'Los Angeles, California',
                    'San Francisco, California',
                    'Seattle, Washington',
                    'Vancouver, British Columbia'])
coords = places.map(ox.geocode)

In [None]:
# parse out lats and lngs to individual columns in a dataframe
pd.DataFrame({'place': places,
              'lat': coords.map(lambda x: x[0]),
              'lng': coords.map(lambda x: x[1])})

Instead of lat-lng coordinates, we can also geocode place names to their place's *boundaries* with OSMnx. This essentially looks-up the place in OpenStreetMap's database (note: that means the place has to exist in its database!) then returns its details, including geometry and bounding box, as a GeoPandas GeoDataFrame. We'll review GeoDataFrames next week.

In [None]:
# geocode a list of place names to a GeoDataFrame
# by default, OSMnx retrieves the first [multi]polygon object
# specify which_result=1 to retrieve the top match, regardless of geometry type
gdf_places = ox.geocode_to_gdf(places.to_list(), which_result=1)
gdf_places

In [None]:
# geocode a single place name to a GeoDataFrame
gdf = ox.geocode_to_gdf(place)
gdf

In [None]:
# extract the value from row 0's geometry column
polygon = gdf['geometry'].iloc[0]
polygon

Use OSMnx to query for geospatial entities within USC's boundary polygon. You can specify what kind of entities to retrieve by using a `tags` dictionary. In a couple weeks we'll see how to model street networks within a place's boundary.

In [None]:
# get all the buildings within that polygon
tags = {'building': True}
gdf_bldg = ox.geometries_from_polygon(polygon, tags)
gdf_bldg.shape

In [None]:
# plot the building footprints
fig, ax = ox.plot_footprints(gdf_bldg)

In [None]:
# now it's your turn
# get all the building footprints within santa monica


### 1b: Geocoding addresses to lat-lng

You can geocode addresses as well with OpenStreetMap, but it can be a little hit-or-miss compared to the data coverage of commercial closed-source services.

In [None]:
# geocode an address to lat-lng
address = '704 S Alvarado St, Los Angeles, California'
latlng = ox.geocode(address)
latlng

We will use the Google Maps geocoding API. Their geocoder is very powerful, but you do have to pay for it beyond a certain threshold of free usage.

Documentation: https://developers.google.com/maps/documentation/geocoding/start

In [None]:
locations = pd.DataFrame(['704 S Alvarado St, Los Angeles, CA',
                          '100 Larkin St, San Francisco, CA',
                          '350 5th Ave, New York, NY'], columns=['address'])
locations

In [None]:
# function accepts an address string, sends it to Google API, returns lat-lng result
def geocode(address, print_url=False):
    
    # pause for some duration before each request, to not hammer their server
    time.sleep(pause)
    
    # api url with placeholders to fill in with variables' values
    url_template = 'https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}'
    url = url_template.format(address, google_api_key)
    if print_url: print(url)
    
    # send request to server, get response, and convert json string to dict
    data = requests.get(url).json()
    
    # if results were returned, extract lat-lng from top result
    if len(data['results']) > 0:
        lat = data['results'][0]['geometry']['location']['lat']
        lng = data['results'][0]['geometry']['location']['lng']
        
        # return lat-lng as a string
        return '{},{}'.format(lat, lng)

In [None]:
# test the function
geocode('350 5th Ave, New York, NY')

In [None]:
# for each value in the address column, geocode it, save results as new column
locations['latlng'] = locations['address'].map(geocode)
locations

In [None]:
# parse the result into separate lat and lng columns, if desired
locations[['lat', 'lng']] = pd.DataFrame(data=locations['latlng'].str.split(',').to_list())
locations

In [None]:
# now it's your turn
# create a new pandas series of 3 addresses and use our function to geocode them
# then create a new pandas series of 3 famous site names and use our function to geocode them
# create new variables to contain your work so as to not overwrite the locations df


## 2. Google Places API

We will use Google's Places API to look up places in the vicinity of some location.

Documentation: https://developers.google.com/places/web-service/intro

In [None]:
# google places API URL, with placeholders
url_template = 'https://maps.googleapis.com/maps/api/place/search/json?keyword={}&location={}&radius={}&key={}'

# what keyword to search for
keyword = 'restaurant'

# define the radius (in meters) for the search
radius = 500

# define the location coordinates
location = '34.019268,-118.283554'

In [None]:
# add our variables into the url, submit the request to the api, and load the response
url = url_template.format(keyword, location, radius, google_api_key)
response = requests.get(url)
data = response.json()

In [None]:
# how many results did we get?
len(data['results'])

In [None]:
# inspect a result
data['results'][0]

In [None]:
# turn the results into a dataframe of places
places = pd.DataFrame(data=data['results'],
                      columns=['name', 'geometry', 'rating', 'vicinity'])
places.head()

In [None]:
# parse out lat-long and return it as a series
# this creates a dataframe of all the results when you .apply()
def parse_coords(geometry):
    if isinstance(geometry, dict):
        lng = geometry['location']['lng']
        lat = geometry['location']['lat']
        return pd.Series({'lat':lat, 'lng':lng})
    
# test our function
places['geometry'].head().apply(parse_coords)

In [None]:
# now run our function on the whole dataframe and save the output to 2 new dataframe columns
places[['lat', 'lng']] = places['geometry'].apply(parse_coords)
places_clean = places.drop('geometry', axis='columns')

In [None]:
# sort the places by rating
places_clean = places_clean.sort_values(by='rating', ascending=False)
places_clean.head(10)

In [None]:
# now it's your turn
# find the five highest-rated bars within 1/2 mile of pershing square
# create new variables to contain your work so as to not overwrite places and places_clean


## 3. Reverse geocoding

Reverse geocoding, as you might expect from its name, does the opposite of regular geocoding: it takes a pair of coordinates on the Earth's surface and looks up what address or place corresponds to that location.

We'll use Google's reverse geocoding API. Documentation: https://developers.google.com/maps/documentation/geocoding/intro#ReverseGeocoding

As we saw with OSMnx, you often don't have to query the API yourself manually: many popular APIs have dedicated Python packages to work with them. You *can* do this manually, just like in the previous Google examples, but it's a little more complicated to parse Google's address component results. If we just want addresses, we can use [geopy](https://geopy.readthedocs.io/) to simply interact with Google's API automatically for us.

In [None]:
# we'll use the points from the Places API, but you could use any point data here
points = places_clean[['lat', 'lng']].head()
points

In [None]:
# create a column to put lat-lng into the format google likes
points['latlng'] = points.apply(lambda row: '{},{}'.format(row['lat'], row['lng']), axis='columns')
points.head()

In [None]:
# tell geopy to reverse geocode using Google's API and return address
def reverse_geopy(latlng):
    time.sleep(pause)
    geocoder = GoogleV3(api_key=google_api_key)
    address, _ = geocoder.reverse(latlng, exactly_one=True)
    return address

In [None]:
# now reverse-geocode the points to addresses
points['address'] = points['latlng'].map(reverse_geopy)
points.head()

### What if you just want the city or state?
You could try to parse the address strings, but you're relying on them always having a consistent format. This might not be the case if you have international location data. In this case, you should call the API manually and extract the individual address components you are interested in.

In [None]:
# pass the Google API latlng data to reverse geocode it
def reverse_geocode(latlng):
    time.sleep(pause)
    url_template = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}&key={}'
    url = url_template.format(latlng, google_api_key)
    response = requests.get(url)
    data = response.json()
    if len(data['results']) > 0:
        return data['results'][0]
    
geocode_results = points['latlng'].map(reverse_geocode)

In [None]:
geocode_results.iloc[0]

Now look inside each reverse geocode result to see if address_components exists. If it does, look inside each component to see if we can find the city or the state. Google calls the city name by the abstract term 'locality' and the state name by the abstract term 'administrative_area_level_1' ...this lets them use consistent terminology anywhere in the world.

In [None]:
def get_city(geocode_result):
     if 'address_components' in geocode_result:
        for address_component in geocode_result['address_components']:
            if 'locality' in address_component['types']:
                return address_component['long_name']
                
def get_state(geocode_result):
     if 'address_components' in geocode_result:
        for address_component in geocode_result['address_components']:
            if 'administrative_area_level_1' in address_component['types']:
                return address_component['long_name']

In [None]:
# now map our functions to extract city and state names
points['city'] = geocode_results.map(get_city)                
points['state'] = geocode_results.map(get_state)
points.head()

In [None]:
# now it's your turn
# write a new function get_neighborhood() to parse the neighborhood name and add it to the points df


## 4. Web Scraping

If you need data from a web page that doesn't offer an API, you can scrape it. Note that many web sites prohibit scraping in their terms of use, so proceed respectfully and cautiously. Web scraping means downloading a web page, parsing individual data out of its HTML, and converting those data into a structured dataset.

For straightforward web scraping tasks, you can use the powerful BeautifulSoup package. However, some web pages load content dynamically using JavaScript. For such complex web scraping tasks, consider using the Selenium browser automation package.

In this example, we'll scrape https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_arenas

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_arenas'
response = requests.get(url)
html = response.text

In [None]:
# look at the html string
html[5000:7000]

In [None]:
# parse the html
soup = BeautifulSoup(html, features='html.parser')
#soup

In [None]:
rows = soup.find('tbody').findAll('tr')
#rows
#rows[1]

In [None]:
data = []
for row in rows[1:]:
    cells = row.findAll('td')
    d = [cell.text.strip('\n') for cell in cells[1:-1]]
    data.append(d)

In [None]:
cols = ['arena', 'city', 'team', 'capacity', 'opened']
df = pd.DataFrame(data=data, columns=cols).dropna()
df

In [None]:
# strip out all the wikipedia notes in square brackets
df = df.applymap(lambda x: re.sub(r'\[.\]', '', x))
df

In [None]:
# convert capacity and opened to integer
df['capacity'] = df['capacity'].str.replace(',', '')
df[['capacity', 'opened']] = df[['capacity', 'opened']].astype(int)

In [None]:
df.sort_values('capacity', ascending=False)

Web scraping is really hard! It takes lots of practice. If you want to use it, read the BeautifulSoup and Selenium documentation carefully, and then practice, practice, practice. You'll be an expert before long.

## 5. Data Portals

Many governments and agencies now open up their data to the public through a data portal. These often offer APIs to query them for real-time data. This example uses the LA Open Data Portal... browse the portal for public datasets: https://data.lacity.org/browse

Let's look at parking meter data for those that have sensors telling us if they're currently occupied or vacant: https://data.lacity.org/A-Livable-and-Sustainable-City/LADOT-Parking-Meter-Occupancy/e7h6-4a3e

In [None]:
# define API endpoint
url = 'https://data.lacity.org/resource/e7h6-4a3e.json'

# request the URL and download its response
response = requests.get(url)

# parse the json string into a Python dict
data = response.json()
len(data)

In [None]:
# turn the json data into a dataframe
df = pd.DataFrame(data)
df.shape

In [None]:
df.columns

In [None]:
df.head()

We have parking space ID, occupancy status, and reporting time. But we don't know where these spaces are! Fortunately the LA GeoHub has sensor location data: http://geohub.lacity.org/datasets/parking-meter-sensors/data

In [None]:
# define API endpoint
url = 'https://opendata.arcgis.com/datasets/723c00530ea441deaa35f25e53d098a8_16.geojson'

# request the URL and download its response
response = requests.get(url)

# parse the json string into a Python dict
data = response.json()
len(data['features'])

In [None]:
# turn the geojson data into a geodataframe
gdf = gpd.GeoDataFrame.from_features(data)
gdf.shape

In [None]:
# what columns are in our data?
gdf.columns

In [None]:
gdf.head()

In [None]:
# now merge sensor locations with current occupancy status
parking = pd.merge(left=gdf, right=df, left_on='SENSOR_UNIQUE_ID', right_on='spaceid', how='inner')
parking.shape

In [None]:
parking = parking[['occupancystate', 'geometry', 'ADDRESS_SPACE']]

# extract lat and lon from geometry column
parking['lon'] = parking['geometry'].x
parking['lat'] = parking['geometry'].y

parking

In [None]:
# how many vacant vs occupied spots are there right now?
parking['occupancystate'].value_counts()

In [None]:
# map it
vacant = parking[parking['occupancystate'] == 'VACANT']
ax = vacant.plot(c='b', markersize=1, alpha=0.5)

occupied = parking[parking['occupancystate'] == 'OCCUPIED']
ax = vacant.plot(ax=ax, c='r', markersize=1, alpha=0.5)

That's impossible to see! At this scale, all the vacant spots are obscured by occupied spots next to them. It would be much better if we had an interactive map. We'll use folium more in coming weeks to create interactive web maps, but here's a preview.

In [None]:
# create leaflet web map centered/zoomed to downtown LA
m = folium.Map(location=(34.05, -118.25), zoom_start=15, tiles='cartodbpositron')

# add blue markers for each vacant spot
cols = ['lat', 'lon', 'ADDRESS_SPACE']
for lat, lng, address in vacant[cols].values:
    folium.CircleMarker(location=(lat, lng), radius=5, color='#3186cc',
                        fill=True, fill_color='#3186cc', tooltip=address).add_to(m)

# add red markers for each occupied spot
for lat, lng, address in occupied[cols].values:  
    folium.CircleMarker(location=(lat, lng), radius=5, color='#dc143c',
                        fill=True, fill_color='#dc143c', tooltip=address).add_to(m)

In [None]:
# now view the web map we created
m

## Individual exercise

1. Visit the LA data portal (link provided above) or another data portal and identify a different data set of interest
1. Download it using Python as we did above
1. Clean the data set if necessary and calculate descriptive stats for 2 or more columns
1. Map the data. Do you see any patterns of interest?