# Lab 7 - Introduction to Web APIs
---

Material adapted from CYPLAN 101, [D-Lab Web API Workshop](https://github.com/dlab-berkeley/Python-Web-APIs)

In today's lab, we are going to download data from the internet using an API. API stands for application programming interface. Companies often create APIs as a way to allow users to more directly interact with their servers to retrieve data. Today, we are going to be using CKAN's API to download data from the City of Toronto's Open Data Portal to get some experience working with larger datasets.

## What is a web API?

APIs are often official services offered by companies and other entities, which allow you to directly query their servers in order to retrieve their data. Platforms like The New York Times, Twitter and Reddit offer APIs to retrieve data. In the case of this lab, you'll be working with Toronto Open Data stored on a CKAN instance, with its API documented [here](https://docs.ckan.org/en/latest/api/).

## What if one isn't available? 

Then you would do web scraping, which generally requires parsing through HTML tags (BeautifulSoup is a popular library to help with that), simulating browser clicks using Selenium (creates an instance of a browser you can 'drive' using Python), and/or simply inspecting front end content. If you're curious to try it out, there is a workshop available from Berkeley's D-Lab [here](https://github.com/dlab-berkeley/Python-Web-Scraping) that you can use as a guideline for your own web scraping adventures.

## Why wouldn't an API be available?

Time, cost, resources, lack of user or developer interest. Do note that the absence of an API does not necessarily mean there is free reign to scrape a webpage. 


## Imports
---

In [1]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import warnings
import requests
from ckan_utils import *
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

%matplotlib inline
import re
import json

# Downloading data from the Internet

**Note:**

If you are having trouble with any of the following cells- fear not. You can read in an already downloaded version of the dataset(s) later on in the lab. 

In [None]:
# toronto public library info

# Toronto Open Data is stored in a CKAN instance. Its APIs are documented here:
# https://docs.ckan.org/en/latest/api/

# To use the API, you'll be making requests to:
base_url = "https://ckan0.cf.opendata.inter.prod-toronto.ca"

# Datasets are called "packages". Each package can contain many "resources"
# To retrieve the metadata for this package and its resources, use the package name in this page's URL:
url = base_url + "/api/3/action/package_show"
params = { "id": "library-branch-general-information"}
package = requests.get(url, params = params).json()

In [None]:
package

This is an example of another `python` data structure called a *dictionary*. Dictionaries store *values* by associating them with a *key* rather than by an integer index. You can access the values stored in a dictionary using bracket notation just like a list. For example:

In [None]:
# In this dictionary, the keys are strings, and the values are all numbers
d = {'a': 1,
    'b': 2,
    'c': 3}

d['a']

In the case of `package`, it is an example of a nested dictionary. To access its values, we need to use a key of a key. From the documentation, we see that the `result` of each package contains a list under the key `resources`. 

In [None]:
package.keys()

In [None]:
# print the metadata
[x for x in package["result"]["resources"]]

There are some important fields here to take note of that will guide how you download the information through the API. Note that the first resource has `datastore_active == True`. This means an instance of the data is stored on the Open Data portal's database. Not all records will have this value as `True`, as you can see in the event that a resource can be downloaded in `csv`, `json`, or `xml` format. For now, we will download the instance where this is true, but later in the lab we will learn what to do when the data is stored elsewhere. 

In [None]:
# To get resource data:
# iterate over the resources
for idx, resource in enumerate(package["result"]["resources"]):

    # set a condition for when you want to access the resource:
    if resource["datastore_active"]:

        # to get all records in CSV format, append the resource id to the base_url
        url = base_url + "/datastore/dump/" + resource["id"]
        # do a GET request on the url and access its text attribute
        resource_dump_data = requests.get(url).text
        # read the raw csv text into a pandas dataframe to work with it
        tpl_libraries = pd.read_csv(StringIO(resource_dump_data), sep=",")
tpl_libraries.head()

Now that we have information on the libraries, let's see if we can find out a little more about what goes on inside of them using the dataset `library-branch-programs-and-events-feed`. To do this, we will repeat the same actions as we did to retrive the library location data, but instead of writing everything over again, we can create a helper function.  

In [None]:
# repeat the setup from above
#url = base_url + "/api/3/action/package_show"
params = { "id": "library-branch-programs-and-events-feed"}
package = requests.get(url, params = params).json()

In [None]:
package

In [None]:
[x for x in package["result"]["resources"]]

In [None]:
# To get resource data:
# iterate over the resources
for idx, resource in enumerate(package["result"]["resources"]):

    if resource["datastore_active"]:

        # to get all records in CSV format
        url = base_url + "/datastore/dump/" + resource["id"]
        # do a GET request on the url and access its text attribute
        resource_dump_data = requests.get(url).text
        # read the raw csv text into a pandas dataframe to work with it
        tpl_events = pd.read_csv(StringIO(resource_dump_data), sep=",")
tpl_events.head()

## Data Cleaning
---
Now, we want to extract out only the columns that are relevant to us. Discarding columns that do not help us answer our question can be helpful because it prevents the computer from having to do unnecessary computations. However, if we want to be able to connect any conclusions we make after we get rid of columns, it is helpful to keep an identifying column in your `DataFrame` even if you are not performing analyses on it.

You can read about all of the columns under the data features tab [here](https://open.toronto.ca/dataset/library-branch-programs-and-events-feed/). It's good practice to read as much as you can about the metadata of a dataset, when and where it is available to minimize the amount of guesswork or reconstruction you'll have to do. 

In [None]:
tpl_events.columns

In [None]:
tpl_events = tpl_events[['_id', 'title', 'startdate', 'enddate', 'starttime', 'endtime',
       'length', 'library',  'description',  'id',
       'rcid', 'eventtype1', 'eventtype2', 'eventtype3', 'agegroup1',
       'agegroup2', 'agegroup3',  'lastupdated']]
tpl_events.head()

## Reshaping and pivoting dataframes

But that's not all we can do to the data to make it easier to work with. It would be nice if the event type and age group columns were pivoted to one rather than three separate columns each. We can reshape dataframes into a 'long' format using the `melt` function. 

There is an important distinction to make in pandas datatypes. Normally, `None` is not a string in Python, it has a particular value which you can think of as null. But columns in Pandas must all be of a single type, and when a `None` is in a column with other strings, it too becomes a string. Therefore, to drop rows with `None`, you must use `!= "None"` rather than `!= None`. 

In [None]:
event_types = tpl_events.melt(id_vars = ["id", "library"], value_vars = ["eventtype1", "eventtype2", "eventtype3"], value_name = "eventtype")
event_types = event_types[event_types['eventtype'] != "None"].drop(columns = "variable")
event_types

In [None]:
age_groups = tpl_events.melt(id_vars = ["id", "library"], value_vars = ["agegroup1", "agegroup2", "agegroup3"], value_name = "agegroup")
age_groups = age_groups[age_groups['agegroup'] != "None"].drop(columns = "variable")
age_groups

In [None]:
# join these back to tpl_events
tpl_events_long = tpl_events.drop(columns = ["eventtype1", "eventtype2", "eventtype3", "agegroup1", "agegroup2", "agegroup3"]).merge(event_types, on = ["id", "library"], how = "left").merge(age_groups, on = ["id", "library"], how = "left")
tpl_events_long.head()

Let's use the `.groupby()` method to summarize event types.

The `.groupby()` method takes in a table, a column, and optionally, an aggregate function (the default is count() which counts how many rows have the same value for the column we are grouping by. Other options include sum() and max() or min()). Groupby goes through each row, looks at the column that has been given to it of the current row, and groups each row based on if they have the same value at given column. After it has a list of rows for each distinct column value, it applies the aggregate function for each list, and returns a table of each distinct column value with the aggregate function applied to the rows that corresponded with the column.

Let's see if we can find the most popular library event type.  

In [None]:
tpl_events_long.groupby('eventtype').size().sort_values(ascending = False)

**Your turn:** Let's find the most common public library event type by age group for each branch. We've provided some starter code, but you need to fill in wherever you see a `...`!

In [None]:
tpl_events_long.groupby(['library', 'agegroup'])['eventtype'].agg(pd.Series.mode)

## Temporal Data
---
Another facet of urban data that you may want to analyze is the time at which something occurs. `python` compares strings by assigning values to the letters themselves based on their position in the alphabet. We want to convert these strings to `datetime` objects, which will tell `python` at what time the precipitation was measured.

Notice that we are not adding parentheses at the end of each line. That is because the `.day` and `.month` are not *functions* we are calling, but rather *attributes* of the particular `datetime` object. If we want to look at the day of the month library events start on, we can extract these attributes.

In [None]:
start_date = tpl_events[['id', 'library', 'startdate']].drop_duplicates()
start_date['time'] = pd.to_datetime(start_date['startdate'])
start_date['day'] = start_date['time'].dt.day
start_date['month'] = start_date['time'].dt.month

In [None]:
start_date['day'].hist();
plt.xlabel("Day of Month");
plt.ylabel("Number of Events");

**Question:** What observations or trends do you notice about this graph?

**Question:** What could be improved about this graph or the process we used to obtain the data that generated it?

## Sentiment Analysis
---
We can use the words the tweets to measure the sentiment, or the positive/negative feeling generated by the description text. To do so we will be using the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment), which is a rule-based sentiment analysis tool specifically designed for social media. It even includes emojis! Run the following cell to load in the lexicon.

In [None]:
vader = load_vader()
vader.iloc[500:510, :]

# Toronto Office of Recovery and Rebuild – Public Survey Results

The more positive the polarity of a word, the more positive feeling the word evokes in the reader. All of the words in `vader` are all lowercase, while many of our tweets are not. We need to modify the text in the tweets so that the words in our tweets will match up with the words stored in `vader`. Additionally, we need to remove punctuation since that will cause the words to not match up as well. We will put these modified tweets into another column in our `DataFrame` so that we can still have access to them later.

**Note about the dataset:** 

**Limitations**

This dataset includes all survey records that were exported from the survey tool, including records that were started but not completed, i.e. partial completes. Respondent begins with #18 as #1-17 were test records and were removed. Also, this dataset has been cleaned to remove offensive language and personal data, along with those responses that may fall into the “offensive speech” category upon contextual review.

In [None]:

params = { "id": "toronto-office-of-recovery-and-rebuild-public-survey-results"}
package = requests.get(url, params = params).json()

In [None]:
[x for x in package["result"]["resources"]]

In [None]:
# To get resource data:
for idx, resource in enumerate(package["result"]["resources"]):

    # To get metadata for non datastore_active resources:
    if resource["format"] == "CSV":
        url = base_url + "/api/3/action/resource_show?id=" + resource["id"]
        resource_data = requests.get(url).json()
        # do a GET request on the url and access its text attribute
        resource_dump_data = requests.get(resource_data['result']['url']).text
        # read the raw csv text into a pandas dataframe to work with it
        
        public_survey_results = pd.read_csv(StringIO(resource_dump_data), sep=",")

In [None]:
public_survey_results.columns

In [None]:
# we only want the columns that have strings in their values - these will have dtype "object"
text_cols = public_survey_results.columns[public_survey_results.dtypes.values == "object"]
text_cols

In [None]:
public_survey_results[text_cols]

There are a lot of NA values and some of these responses are words or phrases rather than sentences. What are some characteristcs of the columns that are more likely to contain free responses?

**Hint:** Free text responses are more likely to be unique to the individual respondent and contain more characters than fields containing phrases or dropdown text responses. 

In [None]:
# the columns with the most unique responses
public_survey_results[text_cols].apply(func = lambda x : x.nunique(), axis = 0).sort_values(ascending = False)

In [None]:
# of the top ten, what are the median characters per response by column?
top_text_cols = public_survey_results[text_cols].apply(func = lambda x : x.nunique(), axis = 0).sort_values(ascending = False)[:10].index.values
average_length = {}
for ttc in top_text_cols:
    response_text = public_survey_results[ttc].dropna()
    average_length[ttc] = np.median(response_text.apply(func = lambda x : len(x)))
dict(sorted(average_length.items(), key=lambda item: item[1], reverse = True))

It seems like the 'Tell us...' responses have the longest responses and are also among the top 10 unique response columns. Therefore, these will be an ideal set of columns for text analysis. But in their current state, they don't say much about what it is the survey administrators want to hear about. Let's try renaming them to something more descriptive using their position in the readme.  

In [None]:
public_survey_text = (public_survey_results[['Respondent', 
                                            '2. Tell us a bit about the choices you made above.(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)',
                                            '4. Tell us a bit about the choices you made above.\n(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)',
                                            '6. Tell us how the choices you made above will help you, your community, your neighbourhood or the city.(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)',
                                            '8. Tell us more about this action you would like the City to consider in its recovery and rebuilding work?(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)']]
                      .rename(columns = {'Respondent' : 'id',
                                         '2. Tell us a bit about the choices you made above.(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)':'recover_rebuild_priority',
                                         '4. Tell us a bit about the choices you made above.\n(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)':'city_fed_prov_govt_coord',
                                         '6. Tell us how the choices you made above will help you, your community, your neighbourhood or the city.(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)':'city_nongovt_coord',
                                         '8. Tell us more about this action you would like the City to consider in its recovery and rebuilding work?(Please do not include any personal information (i.e. your name, telephone number, address) in your response.)':'civic_actions'
                                        })
                      .set_index('id'))
public_survey_text

In [None]:
# fill NA with blank strings
free_responses = public_survey_text.drop_duplicates().fillna("")

# Remove punctuation and lowercase all text
free_responses_cleaned = free_responses.apply(lambda x : clean_string(x), axis = 0)

free_responses_cleaned.head()

Next, we want to merge our sentiment lexicon with our cleaned responses. 

**Question:** Use the readme to see which of the free text responses corresponded to each set of survey questions. What conclusions can you draw about polarity and responses? How does this compare with your assumptions?

In [None]:
response_polarities = []
for col in free_responses_cleaned.columns.values:
    polarity_df = compose_polarity(free_responses_cleaned.loc[:, col], vader)
    polarity_df = polarity_df.rename(columns = {'polarity' : col})
    response_polarities.append(polarity_df)

In [None]:
response_polarities_df = pd.concat(response_polarities, axis = 1)

In [None]:
response_polarities_df.describe()