# Lab 7 - Web Scraping
---
In today's lab, we are going to download data from the internet using an API. API stands for application programming interface. Companies often create APIs as a way to allow users to more directly interact with their servers to retrieve data. Today, we are going to be using CKAN's API to download data from the City of Toronto's Open Data Portal to get some experience working with larger datasets.

In [26]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zipfile
import warnings
import requests
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

%matplotlib inline
import re
import json

## Setup
---


In [27]:
# toronto public library info

# Toronto Open Data is stored in a CKAN instance. It's APIs are documented here:
# https://docs.ckan.org/en/latest/api/

# To hit our API, you'll be making requests to:
base_url = "https://ckan0.cf.opendata.inter.prod-toronto.ca"

# Datasets are called "packages". Each package can contain many "resources"
# To retrieve the metadata for this package and its resources, use the package name in this page's URL:
url = base_url + "/api/3/action/package_show"
params = { "id": "library-branch-general-information"}
package = requests.get(url, params = params).json()

In [29]:
package

{'help': 'https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/help_show?name=package_show',
 'success': True,
 'result': {'author': 'planning@tpl.ca',
  'author_email': 'planning@tpl.ca',
  'creator_user_id': '329e1506-b545-4fc7-a4ea-e614f220eea7',
  'dataset_category': 'Table',
  'date_published': '2023-06-29 00:00:00',
  'excerpt': 'This dataset shows the current characteristics of Toronto Public Library branches, such as location, size, and the availability of specific features (e.g. parking, KidsStops, Digital Innovation Hubs, etc.)',
  'formats': 'CSV,JSON,XML',
  'id': 'f5aa9b07-da35-45e6-b31f-d6790eb9bd9b',
  'information_url': 'https://www.torontopubliclibrary.ca/opendata',
  'is_retired': 'false',
  'isopen': False,
  'last_refreshed': '2023-06-29 13:48:12.063852',
  'license_id': 'open-government-licence-toronto',
  'license_title': 'open-government-licence-toronto',
  'maintainer': None,
  'maintainer_email': 'planning@tpl.ca',
  'metadata_created': '2023-06-29T14:2

This is an example of another `python` data structure called a *dictionary*. Dictionaries store *values* by associating them with a *key* rather than by an integer index. You can access the values stored in a dictionary using bracket notation just like a list. For example:

In [30]:
# In this dictionary, the keys are strings, and the values are all numbers
d = {'a': 1,
    'b': 2,
    'c': 3}

d['a']

1

In the case of `package`, it is an example of a nested dictionary. To access its values, we need to use a key of a key. It would also appear that there are many values to access, so instead of hard coding the keys one at a time, let's use Python's list comprehension to print all of them out at once. 

In [31]:
# print the metadata
[x for x in package["result"]["resources"]]

[{'cache_last_updated': None,
  'cache_url': None,
  'created': '2023-06-29T13:47:40.532513',
  'datastore_active': True,
  'datastore_cache': {'CSV': '1c9e7b16-c8fc-4925-9639-1253b6e02422',
   'XML': 'ba42d38c-02a4-46fb-9d3a-88c6a90af2c4',
   'JSON': '0a48b601-9a07-4de6-ae73-b955527b3e70'},
  'datastore_cache_last_update': '2023-06-29T14:23:37.874384',
  'extract_job': 'Airflow - files_to_datastore.py - library-branch-general-information',
  'format': 'CSV',
  'hash': '',
  'id': '77f8b217-b83a-4b71-be4f-90685137af20',
  'is_preview': True,
  'last_modified': None,
  'metadata_modified': '2023-06-29T14:23:37.963569',
  'mimetype': 'text/csv',
  'mimetype_inner': None,
  'name': 'tpl-branch-general-information-2023',
  'package_id': 'f5aa9b07-da35-45e6-b31f-d6790eb9bd9b',
  'package_name_or_id': 'library-branch-general-information',
  'position': 0,
  'resource_type': None,
  'size': None,
  'state': 'active',
  'url': 'https://ckan0.cf.opendata.inter.prod-toronto.ca/datastore/dump/77f

That's a lot of information compared to what we want- a simple dataset of library information! There are some important fields here to take note of that will guide how you download the information through the API. Note that the first resource has `datastore_active == True`. This means an instance of the data is stored on the Open Data portal's database. Not all records will have this value as `True`, as you can see in the event that a resource can be downloaded in `csv`, `json`, or `xml` format. For now, we will download the instance where this is true, but later in the lab we will learn what to do when the data is stored elsewhere. 

In [32]:
# To get resource data:
# iterate over the resources
for idx, resource in enumerate(package["result"]["resources"]):

    # set a condition for when you want to access the resource:
    if resource["datastore_active"]:

        # to get all records in CSV format (this is specific to CKAN's API)
        url = base_url + "/datastore/dump/" + resource["id"]
        # do a GET request on the url and access its text attribute
        resource_dump_data = requests.get(url).text
        # read the raw csv text into a pandas dataframe to work with it
        tpl_libraries = pd.read_csv(StringIO(resource_dump_data), sep=",")
tpl_libraries.head()

Unnamed: 0,_id,BranchCode,PhysicalBranch,BranchName,Address,PostalCode,Website,Telephone,SquareFootage,PublicParking,...,Workstations,ServiceTier,Lat,Long,NBHDNo,NBHDName,TPLNIA,WardNo,WardName,PresentSiteYear
0,1,AB,1,Albion,"1515 Albion Road, Toronto, ON, M9V 1B2",M9V 1B2,https://www.tpl.ca/albion,416-394-5170,29000,59,...,38.0,DL,43.739826,-79.584096,2.0,Mount Olive-Silverstone-Jamestown,1.0,1.0,Etobicoke North,2017.0
1,2,ACD,1,Albert Campbell,"496 Birchmount Road, Toronto, ON, M1K 1N8",M1K 1N8,https://www.tpl.ca/albertcampbell,416-396-8890,28957,45,...,36.0,DL,43.708019,-79.269252,120.0,Clairlea-Birchmount,1.0,20.0,Scarborough Southwest,1971.0
2,3,AD,1,Alderwood,"2 Orianna Drive, Toronto, ON, M8W 4Y1",M8W 4Y1,https://www.tpl.ca/alderwood,416-394-5310,7341,shared,...,7.0,NL,43.601944,-79.547252,20.0,Alderwood,0.0,3.0,Etobicoke-Lakeshore,1999.0
3,4,AG,1,Agincourt,"155 Bonis Avenue, Toronto, ON, M1T 3W6",M1T 3W6,https://www.tpl.ca/agincourt,416-396-8943,27000,86,...,42.0,DL,43.785167,-79.29343,118.0,Tam O'Shanter-Sullivan,0.0,22.0,Scarborough-Agincourt,1991.0
4,5,AH,1,Armour Heights,"2140 Avenue Road, Toronto, ON, M5M 4M7",M5M 4M7,https://www.tpl.ca/armourheights,416-395-5430,2988,shared,...,5.0,NL,43.739337,-79.421889,39.0,Bedford Park-Nortown,0.0,8.0,Eglinton-Lawrence,1982.0


Now that we have information on the libraries, let's see if we can find out a little more about them using the dataset `library-branch-programs-and-events-feed`. 

In [33]:
# Toronto Open Data is stored in a CKAN instance. It's APIs are documented here:
# https://docs.ckan.org/en/latest/api/

# To interact with the API, you'll be making requests to:
base_url = "https://ckan0.cf.opendata.inter.prod-toronto.ca"

# Datasets are called "packages". Each package can contain many "resources"
# To retrieve the metadata for this package and its resources, use the package name in this page's URL:
url = base_url + "/api/3/action/package_show"
params = { "id": "library-branch-programs-and-events-feed"}
package = requests.get(url, params = params).json()

In [34]:
[x for x in package["result"]["resources"]]

[{'cache_last_updated': None,
  'cache_url': None,
  'created': '2023-06-20T19:41:22.030427',
  'datastore_active': True,
  'datastore_cache': {'CSV': '64b78724-6bba-45ac-b760-7faa046834bf',
   'XML': '493579ec-0305-459e-8e8d-4a32db622c19',
   'JSON': 'aa5e1425-77f4-4cb4-a8ea-eb5ce7d4f34c'},
  'datastore_cache_last_update': '2023-09-28T11:19:56.813671',
  'extract_job': 'Airflow - files_to_datastore.py - library-branch-programs-and-events-feed',
  'format': 'CSV',
  'hash': '',
  'id': 'c73bbe54-3a48-4ada-8eef-a1a2864021e4',
  'is_preview': True,
  'last_modified': None,
  'metadata_modified': '2023-09-28T11:19:56.983925',
  'mimetype': 'application/json',
  'mimetype_inner': None,
  'name': 'tpl-events-feed',
  'package_id': 'fb343332-03cd-40b9-a1c8-c03a4a85ca1e',
  'package_name_or_id': 'library-branch-programs-and-events-feed',
  'position': 0,
  'resource_type': None,
  'size': None,
  'state': 'active',
  'url': 'https://ckan0.cf.opendata.inter.prod-toronto.ca/datastore/dump/c73bb

In [35]:
# To get resource data:
# iterate over the resources
for idx, resource in enumerate(package["result"]["resources"]):

    # set a condition for when you want to access the resource:
    if resource["datastore_active"]:

        # to get all records in CSV format (this is specific to CKAN's API)
        url = base_url + "/datastore/dump/" + resource["id"]
        # do a GET request on the url and access its text attribute
        resource_dump_data = requests.get(url).text
        # read the raw csv text into a pandas dataframe to work with it
        tpl_events = pd.read_csv(StringIO(resource_dump_data), sep=",")
tpl_events.head()

Unnamed: 0,_id,title,startdate,enddate,starttime,endtime,length,library,location,description,...,agegroup2,agegroup3,relatedlink,relatedlinktext,imagepath,imagetext,imageheight,imagewidth,otherinfo,lastupdated
0,1,Down the Fairy Tale Road: 200 Years of Brothers Grimm in English,2023-10-25,2024-01-13,,,1380.0,Lillian H. Smith,,"Celebrating the 200th anniversary of the first English translation of the folk and fairy tales of the Brothers Grimm. Featuring tales collected by German linguists and folklorists Jacob and Wilhelm Grimm. Beloved tales including Snow White, Little Red Riding Hood, Hansel & Gr...",...,School-Age Children,Teen,,,,,,,"{""smallImageURL"":""https://tpl.razuna.com/assets/3434/AFFE3E8A206E4A0296C482610291AC65/img/454A8EA93B54442EAB17EE1A9CA5F0C9/cinderella_454A8EA93B54442EAB17EE1A9CA5F0C9.jpg"",""mediumImageURL"":""https://tpl.razuna.com/assets/3434/AFFE3E8A206E4A0296C482610291AC65/img/B9610D809A7241...",2023-09-14T16:56:19Z
1,2,Dog Days: Dogs in Children's Books,2023-07-31,2023-10-14,,,1380.0,Lillian H. Smith,,"Every dog has its day in this exhibit at the Osborne Collection of Early Children's Books. From Old Yeller to Lassie to Harry the Dirty Dog, fall in puppy love with all the books that celebrate our best friends. <br /><br />July 31 - October 14 <br /><br />Free. All are welc...",...,School-Age Children,Teen,,,,,,,"{""smallImageURL"":""https://tpl.razuna.com/assets/3434/B9890D3382984035A76621B50B5F38E3/img/14FC9EA5C7BA4EE5B170E2AEADEF748B/dog_14FC9EA5C7BA4EE5B170E2AEADEF748B.jpg"",""mediumImageURL"":""https://tpl.razuna.com/assets/3434/B9890D3382984035A76621B50B5F38E3/img/2B015187C6AC42208DB18...",2023-06-08T15:59:50Z
2,3,"Sunday Storytimes, Crafts and Games",2023-10-01,2023-12-17,,,,Maria A. Shchuka,,"Storytimes, crafts and games for children 0-10 years old.<br />Time: 3-4pm, every Sunday, October to December.",...,Pre-School Children,School-Age Children,,,,,,,,2023-09-17T17:02:15Z
3,4,Pop up Learning lab - After School Activities for Kids,2023-10-04,2023-10-25,,,,St. Clair/Silverthorn,,"Join us every Wednesday from 4-5pm with Pop up Learning Labs as we explore the world of tech, Coding 3D printing and 3D design. <br /><br />Scratch Coding <br />Wednesday October 4th 2023 <br /><br />3D Design for kids <br />Wednesday October 11th 2023<br /><br />Makey Makey<...",...,,,,,,,,,,2023-09-13T09:23:48Z
4,5,Handmade Humanoids: A Merril Collection Exhibit,2023-10-16,2023-12-30,,,,Lillian H. Smith,,"Whether they're made of metal, mud or magic, fiction is full of fantastic human-shaped creatures that aren't quite human. There are golems and robots, puppets and clones, magical constructs and more. Come and explore their stories at this exhibit!<br /><br />Free. All are wel...",...,Teen,,,,,,,,"{""smallImageURL"":""https://tpl.razuna.com/assets/3434/D153343069004201AE3553348BE37E30/img/115CDCC0E6C6434CB4318D0390AD16B4/Metropolis_Aubrey_hammond_events_115CDCC0E6C6434CB4318D0390AD16B4.jpg"",""mediumImageURL"":""https://tpl.razuna.com/assets/3434/D153343069004201AE3553348BE37...",2023-08-17T09:16:53Z


In [41]:
# what are the library locations?
events_loc = tpl_events.sort_values(by = 'library')['library'].unique()
events_loc

array(['Agincourt', 'Albert Campbell', 'Albion', 'Alderwood',
       'All Branches', 'Amesbury Park', 'Annette Street', 'Barbara Frum',
       'Beaches', 'Bendale', 'Black Creek', 'Bloor/Gladstone',
       'Brentwood', 'Bridlewood', 'Burrows Hall', 'Centennial',
       'City Hall', 'Cliffcrest', 'College/Shaw', 'Danforth/Coxwell',
       'Davenport', 'Dawes Road', 'Deer Park', 'Don Mills', 'Downsview',
       'Dufferin/St. Clair', 'Eatonville', 'Eglinton Square',
       'Elmbrook Park', "Ethennonnhawahstihnen'", 'Evelyn Gregory',
       'Fairview', 'Flemingdon Park', 'Forest Hill', 'Fort York',
       'Gerrard/Ashdale', 'Goldhawk Park', 'Guildwood', 'High Park',
       'Highland Creek', 'Hillcrest', 'Humber Summit', 'Humberwood',
       'Jane/Dundas', 'Jane/Sheppard', 'Jones', 'Kennedy/Eglinton',
       'Leaside', 'Lillian H. Smith', 'Locke', 'Long Branch',
       'Main Street', 'Malvern', 'Maria A. Shchuka', 'Maryvale',
       'McGregor Park', 'Mimico Centennial', 'Morningside',
     

In [43]:
libraries_loc = tpl_libraries.sort_values(by = 'BranchName')['BranchName'].unique()
libraries_loc

array(['Agincourt', 'Albert Campbell', 'Albion', 'Alderwood',
       'Amesbury Park', 'Annette Street', 'Answerline', 'Armour Heights',
       'Automated Phone System', 'Barbara Frum', 'Beaches', 'Bendale',
       'Black Creek', 'Bloor/Gladstone', 'Bookmobile One',
       'Bookmobile Two', 'Brentwood', 'Bridlewood', 'Brookbanks',
       'Burrows Hall', 'Cedarbrae', 'Centennial', 'City Hall',
       'Cliffcrest', 'College/Shaw', 'Danforth/Coxwell', 'Davenport',
       'Dawes Road', 'Deer Park', 'Departmental Staff', 'Don Mills',
       'Downsview', 'Dufferin/St. Clair', 'Eatonville', 'Eglinton Square',
       'Elmbrook Park', "Ethennonnhawahstihnen'", 'Evelyn Gregory',
       'Fairview', 'Flemingdon Park', 'Forest Hill', 'Fort York',
       'Gerrard/Ashdale', 'Goldhawk Park', 'Guildwood', 'High Park',
       'Highland Creek', 'Hillcrest', 'Home Library Service',
       'Humber Bay', 'Humber Summit', 'Humberwood', 'Interloan',
       'Jane/Dundas', 'Jane/Sheppard', 'Jones', 'Kennedy/Egli

In [45]:
tpl_events[['library', 'eventtype1', 'eventtype1', 'eventtype2', 'eventtype3', 'agegroup1','agegroup2', 'agegroup3']]

Unnamed: 0,library,eventtype1,eventtype1.1,eventtype2,eventtype3,agegroup1,agegroup2,agegroup3
0,Lillian H. Smith,00-Art Exhibits,00-Art Exhibits,00-Culture Arts & Entertainment,01-Osborne Collection of Early Children's Books,Adult,School-Age Children,Teen
1,Lillian H. Smith,00-Art Exhibits,00-Art Exhibits,00-Culture Arts & Entertainment,01-Osborne Collection of Early Children's Books,Adult,School-Age Children,Teen
2,Maria A. Shchuka,01-Ready for Reading Storytimes,01-Ready for Reading Storytimes,00-Hobbies Crafts & Games,,All Children,Pre-School Children,School-Age Children
3,St. Clair/Silverthorn,01-Pop-Up Learning Labs,01-Pop-Up Learning Labs,00-After School,,School-Age Children,,
4,Lillian H. Smith,00-Culture Arts & Entertainment,00-Culture Arts & Entertainment,01-Merril Collection: Science Fiction & ...,00-Art Exhibits,Adult,Teen,
...,...,...,...,...,...,...,...,...
3979,St. James Town,00-Book Clubs & Writers Groups,00-Book Clubs & Writers Groups,,,Adult,Older Adult,
3980,St. James Town,00-After School,00-After School,00-Science & Technology,00-Hobbies Crafts & Games,School-Age Children,,
3981,Gerrard/Ashdale,00-Science & Technology,00-Science & Technology,,,School-Age Children,,
3982,Dufferin/St. Clair,00-Culture Arts & Entertainment,00-Culture Arts & Entertainment,,,Adult,,


Let's use the `.groupby()` method to summarize event types by library locations.

The `.groupby()` method takes in a table, a column, and optionally, an aggregate function (the default is count() which counts how many rows have the same value for the column we are grouping by. Other options include sum() and max() or min()). Groupby goes through each row, looks at the column that has been given to it of the current row, and groups each row based on if they have the same value at given column. After it has a list of rows for each distinct column value, it applies the aggregate function for each list, and returns a table of each distinct column value with the aggregate function applied to the rows that corresponded with the column.

In [49]:
tpl_events[['library', 'eventtype1']].groupby(['eventtype1', 'library']).size()

eventtype1                     library                  
00-Adult Literacy              All Branches                  1
00-After School                Agincourt                     1
                               Albert Campbell               5
                               Albion                       12
                               Barbara Frum                 11
                                                            ..
01-TPL Teens                   Thorncliffe                  18
                               Toronto Reference Library     2
                               Woodside Square               3
01-Writer in Residence         Toronto Reference Library     7
z-DO NOT USE - Fragile Planet  Jones                         1
Length: 633, dtype: int64

In [None]:
# Toronto Open Data is stored in a CKAN instance. It's APIs are documented here:
# https://docs.ckan.org/en/latest/api/

# To hit our API, you'll be making requests to:
base_url = "https://ckan0.cf.opendata.inter.prod-toronto.ca"

# Datasets are called "packages". Each package can contain many "resources"
# To retrieve the metadata for this package and its resources, use the package name in this page's URL:
url = base_url + "/api/3/action/package_show"
params = { "id": "rain-gauge-locations-and-precipitation"}
package = requests.get(url, params = params).json()

In [24]:
# This accesses the first (zero index) element of the list
package[0]

KeyError: 0

In [None]:
# let's look at which resources are available from this package
[x for x in package["result"]["resources"]]

In [None]:
# To get resource data, iterate over the package
for idx, resource in enumerate(package["result"]["resources"]):

    # a CKAN value to indicate it can be downloaded
    if not resource["datastore_active"]:
        # the name/url of the resource we want
        if resource['name'] == "precipitation-data-2022":
            # get its url and read it into a pandas dataframe in memory
            resource_url = resource["url"]
            resource_data = pd.read_csv(resource_url)

If you are having trouble with the previous cell, you can read in an already downloaded version of the dataset using the next cell.

In [None]:
resource_data

## Data Cleaning
---
The dictionary we were looking at above is a little bit hard to interpret because there dictionaries nested inside of some our keys. We can look only at the first level of keys in our dictionary by using the `.keys()` method.

In [None]:
gentrification[0].keys()
# You can also use .values() to access all of the values

Unfortunately, Twitter by default does not attach geographic data to the metadata of each tweet. To get around this, we can use the location associated to the account of each poster. First, we want to extract only the parts of the data that are relevant to what we are looking for. To do this, we first need to turn our list of dictionaries into a `pandas DataFrame`. Fortunately, there is a function that can do this easily for us.

In [None]:
#gentrification_df = pd.DataFrame(gentrification)
tpl_libraries.columns

Next, we want to extract out only the columns that are relevant to us. Discarding columns that do not help us answer our question can be helpful because it prevents the computer from having to do unnecessary computations. However, if we want to be able to connect any conclusions we make after we get rid of columns, it is helpful to keep an identifying column in your `DataFrame` even if you are not performing analyses on it.

In [None]:
users = tpl_libraries[['id', 'user']]

Let's take a closer look at an element of the `user` column.

In [None]:
users.loc[0, 'user']

In each row, the `user` column contains another dictionary with information about the user who posted the tweet. We can access the user's location using the `location` key.

The strategy that we are going to use to extract the locations from each user will be to iterate through the rows of `users`; at each row we will add the tweet id and the user location to a new dictionary. This dictionary will then be added to a list. Once we have iterated through all of the rows of `users`, we will convert our final list of dictionaries into a `DataFrame`.

In [None]:
# Create an empty list.
locations_list = list()

for i in range(len(users)):
    # Create an empty dictionary.
    new_entry = {}
    # Copy the tweet id into the new dictionary.
    new_entry['id'] = users.loc[i, 'id']
    # Create a new key ('location') and assign it to the location of the user who
    # wrote that tweet.
    new_entry['location'] = users.loc[i, 'user']['location']
    # Append the dictionary as another element of our list.
    locations_list.append(new_entry)
    
# Transform our list into a DataFrame. As before, each element of the list becomes
# a row in the DataFrame, and each key becomes a column.
all_locations = pd.DataFrame(locations_list)

# Display the first 10 tweets.
all_locations.head(10)

Clearly this isn't a foolproof method, since the location associated with an account may have little bearing on the actual location from which a tweet was posted. Also, not all users have a specific location connected to their account. Depending on the data you have pulled from Twitter, you may also notice that some of the "locations" are not actually real places. We can do a bit of data cleaning to filter out the rows that contain true locations. First, let's get rid of the rows that do not contain any text at all in the `location` column.

In [None]:
# Create an empty DataFrame with the columns 'location' and 'id'
no_empties = pd.DataFrame(columns = ['id', 'location'])
for i in range(len(all_locations)):
    # This filters out tweets whose location column is an empty string.
    if all_locations.loc[i, "location"] != '':
        no_empties = no_empties.append(all_locations.loc[i,:])
no_empties.head(10)

This looks pretty good! We would still like to filter through our locations for places that actually exist. Let's use the `.groupby()` method to take a look at what locations we have in our data.

The `.groupby()` method takes in a table, a column, and optionally, an aggregate function (the default is count() which counts how many rows have the same value for the column we are grouping by. Other options include sum() and max() or min()). Groupby goes through each row, looks at the column that has been given to it of the current row, and groups each row based on if they have the same value at given column. After it has a list of rows for each distinct column value, it applies the aggregate function for each list, and returns a table of each distinct column value with the aggregate function applied to the rows that corresponded with the column.

In [None]:
no_empties.groupby('location').count()

If you scroll through this list, you will likely see a whole litany of "locations" that do not resemble locations since the user is allowed to write whatever they like as their location. We do not have time to day to sort through all of these right now, so we are goign to move on to a few other techniques that we can use to analyze these kinds of data.

Since many of the tweets we scraped earlier do not have useful locations, we may want to filter by location when we ask the API for tweets. We can use the same function as before, using the optional `location` argument. The format of the location argument is `"latitude,longitude,radius"`. The following code searches for tweets hashtagged "gentrification" within a 5 km radius of the Temescal Oakland area.

In [None]:
gentrification_oak = download_recent_tweets_by_hashtag(hashtag = "gentrification",
                                                       keys = keys,
                                                       location = "37.829314,-122.264433,5km",
                                                       count = 100)

If you are running into errors downloading these tweets, uncomment and run the following cell to load in tweets that we scraped earlier.

**Your turn:** Let's use the procedure we went through earlier to find the most common public library event type for each year. We've provided some starter code, but you need to fill in wherever you see a `...`!

In [None]:
tpl_events = pd.DataFrame(gentrification_oak)
event_types = ... # select columns of interest

In [None]:
locations_list_oak = list()
for i in range(len(users_oak)):
    new_entry = {}
    new_entry['id'] = users_oak.loc[i, 'id']
    new_entry['location'] = users_oak.loc[i, 'user']['location']
    locations_list_oak.append(...) # we want to add the new entry to our list
all_locations_oak = ... # turn the list into a DataFrame

no_empties_oak = pd.DataFrame(columns = ['id', 'location'])
for i in range(len(all_locations_oak)):
    if all_locations_oak.loc[i, "location"] != '':
        no_empties_oak = no_empties_oak.append(all_locations_oak.loc[i,:])
        
grouped_locations = ...

# This finds the number of repeats of the most common location.
max_number_of_tweets = grouped_locations['id'].max()

most_common_location = grouped_locations[grouped_locations['id'] == max_number_of_tweets]

# most_common_location is a DataFrame with one item. This access all of the indices in the
# DataFrame, then takes the first (and only) one.
most_common_location.index[0]

## Temporal Data
---
Another facet of urban data that you may want to analyze is the time at which they were posted. Currently, the only way we have information about the time the tweets were posted is in the `'created_at'` column, which is a string. As you may remember from the Introductory lab, `python` compares strings by assigning values to the letters themselves based on their position in the alphabet. We want to convert these strings to `datetime` objects, which will tell `python` at what time tweets were posted.

In [None]:
post_time = pd.DataFrame(gentrification_oak)[['id', 'created_at']]
post_time['time'] = pd.to_datetime(post_time['created_at'])
post_time['time'].head()

Now that each string has been converted into a `datetime` object, we can extract the day, hour, minute, etc. of each time point like so

In [None]:
post_time.loc[0, 'time'].day

In [None]:
post_time.loc[0, 'time'].hour

In [None]:
post_time.loc[0, 'time'].minute

Notice that we are not adding parentheses at the end of each line. That is because the `.day`, `.hour`, and `.minute` are not *functions* we are calling, but rather *attributes* of the particular `datetime` object. If we want to look at the time of day that people tend to tweet about #gentrification, we can extract these attributes.

In [None]:
post_time['hour'] = [post_time.loc[i, 'time'].hour + post_time.loc[i, 'time'].minute/60 +
                     post_time.loc[i, 'time'].second/3600 for i in range(len(post_time))]
post_time['hour'].hist()
plt.xlabel("Hour (UTC)")
plt.ylabel("Number of Tweets");

**Question:** What observations or trends do you notice about this graph?

**Question:** What could be improved about this graph or the process we used to obtain the data that generated it?

## Sentiment Analysis
---
We can use the words the tweets to measure the sentiment, or the positive/negative feeling generated by the tweet. To do so we will be using the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment), which is a rule-based sentiment analysis tool specifically designed for social media. It even includes emojis! Run the following cell to load in the lexicon.

In [None]:
vader = load_vader()
vader.iloc[500:510, :]

The more positive the polarity of a word, the more positive feeling the word evokes in the reader. All of the words in `vader` are all lowercase, while many of our tweets are not. We need to modify the text in the tweets so that the words in our tweets will match up with the words stored in `vader`. Additionally, we need to remove punctuation since that will cause the words to not match up as well. We will put these modified tweets into another column in our `DataFrame` so that we can still have access to them later.

In [None]:
# Select our columns of interest
tweets_and_retweets = pd.DataFrame(gentrification_oak)[['id', 'text', 'retweet_count']]

# Set the index of the DataFrame to the tweet ID. This step is necessary
# in order to use our utility functions.
tweets_and_retweets.set_index('id', inplace = True)

# Remove punctuation and lowercase tweets
tweets_and_retweets['cleaned'] = clean_tweets(tweets_and_retweets['text'])

tweets_and_retweets.head()

Next, we want to merge our sentiment lexicon with our cleaned tweets. 

In [None]:
tweets_and_retweets['polarity'] = compose_polarity(tweets_and_retweets, vader)
tweets_and_retweets.head()

Next, we want to see if more polarizing tweets are retweeted more often. To do this, we can plot the `polarity` and `retweet_count` columns against each other.

In [None]:
tweets_and_retweets.plot('polarity', 'retweet_count', kind='scatter');

**Question:** What conclusions can you draw about polarity and retweets from this graph? How does this compare with your assumptions?

## Your turn!
---
If time allows, try these exercises on your own or as a class!

**Exercise 1:** Using the `gentrification_oak` tweets, make a histogram of the time of day the tweets were posted. Note that if you would like the x-axis of the plot to reflect the correct time of day, you will have to convert the time from UTC to PDT.

In [None]:
# YOUR CODE HERE

**Exercise 2:** Try scraping tweets from multiple locations and the same hashtag. Make a histogram for each location and see if there are any differences in the distribution of polarity of the tweets. Feel free to use multiple cells to avoid querying the API repeatedly.

In [None]:
# YOUR CODE HERE