In [1]:
# COMP5339 Week 5 Tutorial
# Material last updated: 23 March 2025
# Note materials were designed with the Roboto Condensed font, which can be installed here: https://www.1001fonts.com/roboto-condensed-font.html

from IPython.display import HTML
HTML('''
    <style> body {font-family: "Roboto Condensed Light", "Roboto Condensed";} h2 {padding: 10px 12px; background-color: #E64626; position: static; color: #ffffff; font-size: 40px;} .text_cell_render p { font-size: 15px; } .text_cell_render h1 { font-size: 30px; } h1 {padding: 10px 12px; background-color: #E64626; color: #ffffff; font-size: 40px;} .text_cell_render h3 { padding: 10px 12px; background-color: #0148A4; position: static; color: #ffffff; font-size: 20px;} h4:before{ 
    content: "@"; font-family:"Wingdings"; font-style:regular; margin-right: 4px;} .text_cell_render h4 {padding: 8px; font-family: "Roboto Condensed Light"; position: static; font-style: italic; background-color: #FFB800; color: #ffffff; font-size: 18px; text-align: center; border-radius: 5px;}input[type=submit] {background-color: #E64626; border: solid; border-color: #734036; color: white; padding: 8px 16px; text-decoration: none; margin: 4px 2px; cursor: pointer; border-radius: 20px;}</style>
    <script> code_show=true; function code_toggle() {if (code_show){$('div.input').hide();} else {$('div.input').show();} code_show = !code_show} $( document ).ready(code_toggle);</script>
    <form action="javascript:code_toggle()"><input type="submit" value="Hide/show all code."></form>
''')

# Week 5 - Web APIs: Working with semi-structured data

This week will be going beyond scraping data from websites and using APIs to help collect data efficiently. Web APIs are specifically provided to give access to data through programs. The advantages are that the data is well defined and consistent, with a predefined set of operations.

This tutorial will give you a taste on the potential that APIs offer and hopefully opens your mind to the possibilities of data integrations in your future projects.

This will require the following Python libraries:
- **Request**         for interacting with websites and web services
- **Pandas**          for dataframe management

## 1. Web APIs

We'll explore a few fun examples of APIs with a variety of data types. These libraries will help us process JSON data types, and even to display images.

### 1.1 Exploring JSON Objects with APIs

The World Bank provides the following web API which gives the total population in a Country in a specific year. More information about the API can be found in their [Developer Information
](https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information).

In [2]:
import requests

year = '2022'

response = requests.get(f"https://api.worldbank.org/v2/country/aus/indicator/SP.POP.TOTL?date={year}&format=json")

response

<Response [200]>

Simply returning the response yields nothing of particular use, just the HTTP status code of our request (you should get [200](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), which indicates success).

To interpret its full body of returned information as a JSON object, we can use the `.json()` function:

In [3]:
response.json()

[{'page': 1,
  'pages': 1,
  'per_page': 50,
  'total': 1,
  'sourceid': '2',
  'lastupdated': '2025-07-01'},
 [{'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'AU', 'value': 'Australia'},
   'countryiso3code': 'AUS',
   'date': '2022',
   'value': 26014399,
   'unit': '',
   'obs_status': '',
   'decimal': 0}]]

The **json** library can also be leveraged to display the information in a nicer way. ``json.dumps()`` takes a JSON object and turns it into a string, while ``json.loads()`` can parse a valid string into a JSON object. The example below takes our response, formats it nicely, and sorts the keys alphabetically:

In [4]:
import json
print(json.dumps(response.json(), indent=4, sort_keys=True))

[
    {
        "lastupdated": "2025-07-01",
        "page": 1,
        "pages": 1,
        "per_page": 50,
        "sourceid": "2",
        "total": 1
    },
    [
        {
            "country": {
                "id": "AU",
                "value": "Australia"
            },
            "countryiso3code": "AUS",
            "date": "2022",
            "decimal": 0,
            "indicator": {
                "id": "SP.POP.TOTL",
                "value": "Population, total"
            },
            "obs_status": "",
            "unit": "",
            "value": 26014399
        }
    ]
]


In this case, the JSON object returned is list of dictionaries. We can access individual elements by index and name:

In [5]:
population = response.json()[1][0]['value']
print("Population of Australia in year", year, "is", population)

Population of Australia in year 2022 is 26014399


### 1.2 Exploring Images with APIs

We can even query the web to extract images, rather than a JSON object. The below helper function will process an image as a JPG with fixed dimensions:

In [6]:
import ipywidgets as widgets
def display_image(response, w=200, h=300):
    return widgets.Image(value=response.content, format='jpg', width=w, height=h)

Take the simple PlaceDog website, which returns a randomised image of dogs for a given dimension ratio (in our case, just 300x200)

In [7]:
response = requests.get('https://place.dog/300/200')
display_image(response)

Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01,\x00\x00\x00\xc8\x08\x06\x00\x00\x00R\xdf\xdcU\x…

### 1.3 Using API Documentation

Most APIs will have more functionality and require you to read their documentation to understand what other features can be extracted. Let's try a dog API with slightly more sophisticated documentation that allows us to extract both information and images - the [Stanford Dogs Dataset](https://dog.ceo/dog-api/documentation/).


In [8]:
response = requests.get('https://dog.ceo/api/breed/labrador/images/random')
breeds = response.json()
breeds.keys()

dict_keys(['message', 'status'])

As seen by observing the keys above, or consulting the documentation, two things are returned - a "message", and the "status". We're interested in the "message":

In [9]:
breeds['message']

'https://images.dog.ceo/breeds/labrador/n02099712_7815.jpg'

The "[breed](https://dog.ceo/dog-api/documentation/breed)" page of the documentation defines the general format of a URL to return a random image of a selected dog breed. The function below leverages this to return a single image from the selected breed:

In [10]:
def random_dog(breed):
    # This command returns a URL to a random image of the selected dog breed.
    # We would need to do another call of requests.get() to get the actual image.
    response = requests.get(f'https://dog.ceo/api/breed/{breed}/images/random')
    breed = response.json()['message']
    response_image = requests.get(breed)
    return display_image(response_image)

Try choosing a breed from the JSON above, and returning an image for it below (currently returns a 'husky'):

In [11]:
random_dog('husky')

Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x08\x06\x0…

### 1.4 Using DataFrames with APIs

Take another animal-related example of an API below, with the [**TheCatAPI**](https://developers.thecatapi.com/), where users can request animal data in JSON format.

In [12]:
import pandas as pd
response = requests.get('https://api.thecatapi.com/v1/images/search')
print(response.status_code)
animal_info = response.json()
animal_info

200


[{'id': '5lr',
  'url': 'https://cdn2.thecatapi.com/images/5lr.jpg',
  'width': 3264,
  'height': 1952}]

**Task: Return the random entries of 10 animals from the API as a dataframe.**

Investigate the documentation, find the url to request so that the data of 10 animals can be extracted at once, and return this as a Pandas dataframe. Note that with dataframes, this will prove much trickier on less consistent data sources! JSON objects can contain vary greatly between objects with varying depths - it is much simpler here since each animal's object contains a single level of depth.

In [13]:
### TO DO


## 2. Map APIs

Additionally to simply taking what is returned to us, we can specify parameters to send to a web service.

There are several Map API systems that allow you to convert a location address to a GPS location (and some more information). The most popular of these is Google Maps API. However, this underwent an access change in June 2018 which meant that it now requires an API key and associated billing information.

Instead, we will use the Open Street Maps project at https://www.openstreetmap.org, particularly their `nominatim` data access API ([documentation here](https://nominatim.org/release-docs/develop/api/Search/)).

**Note that this service does have however a restriction of 1 API call per second.**

The helper function below will receive given parameters, and carry out the request, leaving it a few seconds first:

In [14]:
import time as t
def address_details(params, wait=5):
    base_url = 'https://nominatim.openstreetmap.org/search'
    headers = {'User-Agent': 'COMP5339'}
    t.sleep(wait)  # 5 second wait to avoid overloading (too many requests could look out the whole uni's IP range!)
    response = requests.get(base_url, params = params, headers = headers)
    return response.json()

The following example looks up the GPS location of building oppositing the School of CS building at "50 Cleveland Street, Chippendale, Australia":

In [15]:
parameters = {'q': '50 Cleveland Street, Chippendale, Australia', 'format': 'json', 'addressdetails': 0}
results = address_details(parameters)
print(json.dumps(results, indent=4))

[
    {
        "place_id": 19476830,
        "licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
        "osm_type": "node",
        "osm_id": 4127856495,
        "lat": "-33.8879173",
        "lon": "151.1949359",
        "class": "place",
        "type": "house",
        "place_rank": 30,
        "importance": 8.246051728079679e-05,
        "addresstype": "place",
        "name": "",
        "display_name": "50, Cleveland Street, Chippendale, Sydney CBD, Sydney, Council of the City of Sydney, New South Wales, 2008, Australia",
        "boundingbox": [
            "-33.8879673",
            "-33.8878673",
            "151.1948859",
            "151.1949859"
        ]
    }
]


If we were less specific about which suburb this "50 Cleveland Street" is in Australia, multiple results will be returned by the search:

In [16]:
parameters = {'q': '50 Cleveland Street, Australia', 'format': 'json', 'addressdetails': 1}
results = address_details(parameters)
print(json.dumps(results, indent=4))

[
    {
        "place_id": 19476830,
        "licence": "Data \u00a9 OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
        "osm_type": "node",
        "osm_id": 4127856495,
        "lat": "-33.8879173",
        "lon": "151.1949359",
        "class": "place",
        "type": "house",
        "place_rank": 30,
        "importance": 8.246051728079679e-05,
        "addresstype": "place",
        "name": "",
        "display_name": "50, Cleveland Street, Chippendale, Sydney CBD, Sydney, Council of the City of Sydney, New South Wales, 2008, Australia",
        "address": {
            "house_number": "50",
            "road": "Cleveland Street",
            "suburb": "Chippendale",
            "borough": "Sydney CBD",
            "city": "Sydney",
            "municipality": "Council of the City of Sydney",
            "state": "New South Wales",
            "ISO3166-2-lvl4": "AU-NSW",
            "postcode": "2008",
            "country": "Australia",
            "countr

Note in this case, an extra layer of depth is introduced for each search result - the address, which is itself a dictionary inside the main dictionary. We can "flatten" this into a dataframe using Pandas' `json_normalize()` function, which will return a dataframe with one row per search result:

In [17]:
pd.json_normalize(results)

Unnamed: 0,place_id,licence,osm_type,osm_id,lat,lon,class,type,place_rank,importance,...,address.suburb,address.borough,address.city,address.municipality,address.state,address.ISO3166-2-lvl4,address.postcode,address.country,address.country_code,address.city_district
0,19476830,"Data © OpenStreetMap contributors, ODbL 1.0. h...",node,4127856495,-33.8879173,151.1949359,place,house,30,8.2e-05,...,Chippendale,Sydney CBD,Sydney,Council of the City of Sydney,New South Wales,AU-NSW,2008,Australia,au,
1,23514924,"Data © OpenStreetMap contributors, ODbL 1.0. h...",node,6351438673,-27.4999818,153.044542,place,house,30,7.7e-05,...,Stones Corner,,City of Brisbane,,Queensland,AU-QLD,4120,Australia,au,Stones Corner
2,29198069,"Data © OpenStreetMap contributors, ODbL 1.0. h...",way,424520657,-31.9043936,115.8776633,building,detached,30,7.5e-05,...,Dianella,,,City of Stirling,Western Australia,AU-WA,6059,Australia,au,


**Task: Find the maximum road speed of a location of your choice using the API.**

Select a location known to you (or take the definitely random example of [150 Freston Road, London](https://www.youtube.com/watch?v=j0FyxbiEgo0)), and try using the 'extratags' parameter (check the documentation) to determine the maximum speed of that road.

In [18]:
### TO DO


## 3. Extension (Optional)

As a data engineer, sometimes our responsibility is creating a data pipeline to extract information. The setup of these data pipelines is the most time consuming task and the entire data wrangling/cleaning phase of a data science product is  more than 80% of the time cost [(Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity)](https://medium.com/@armand_ruiz/breaking-the-80-20-rule-how-data-catalogs-transform-data-scientists-productivity-7759a23a8893). We will have to wear many hats, which is why the extension task this week is using YouTube's Data API.

The YouTube Data API will require a Google account and the creation of an API key to query data. Here are instructions on how to do that: https://developers.google.com/youtube/v3/getting-started

**Note: Your default quota allocation is 10,000 units per day. This is 100 search queries a day!**

Google calculates your quota usage by assigning a cost to each request. Different types of operations have different quota costs. For example:

* read operation that retrieves a list of resources (channels, videos, playlists) usually costs 1 unit.
* write operation that creates, updates, or deletes a resource usually has costs 50 units.
* search request costs 100 units.
* video upload costs 1600 units.

After creating your API key, we will need to install a Python module:
- Anaconda: `conda install -c conda-forge google-api-python-client`
- or Pip: `pip install google-api-python-client`

Once this is configured, you can use your API key in the code below, which would return data on all YouTube videos posted by the [University of Sydney on YouTube](https://www.youtube.com/channel/UChyxYzq0ZAB0iBw-A1l5jFA).

In [19]:
API_KEY = "?" # your api key here
CHANNEL_ID = "UChyxYzq0ZAB0iBw-A1l5jFA"
pageToken = ""

url = "https://www.googleapis.com/youtube/v3/search?key="+\
API_KEY+"&channelId="+CHANNEL_ID+"&part=snippet,id&order=date&maxResults=10000"+pageToken
response = requests.get(url).json()
response

{'error': {'code': 400,
  'message': 'API key not valid. Please pass a valid API key.',
  'errors': [{'message': 'API key not valid. Please pass a valid API key.',
    'domain': 'global',
    'reason': 'badRequest'}],
  'status': 'INVALID_ARGUMENT',
  'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo',
    'reason': 'API_KEY_INVALID',
    'domain': 'googleapis.com',
    'metadata': {'service': 'youtube.googleapis.com'}},
   {'@type': 'type.googleapis.com/google.rpc.LocalizedMessage',
    'locale': 'en-US',
    'message': 'API key not valid. Please pass a valid API key.'}]}}

We can once more turn this into a dataframe:

In [20]:
pd.json_normalize(response['items']).head()

KeyError: 'items'

Now, let's leverage `googleapiclient` so that we can send parameters to YouTube for our search query, in this case, for the Will Smith video:

In [0]:
import googleapiclient
import googleapiclient.discovery

api_service_name = "youtube"
api_version = "v3"

youtube = googleapiclient.discovery.build(api_service_name, api_version, developerKey = API_KEY)

request = youtube.search().list(
        part="id,snippet",
        type='video',
        q="Will Smith Chris Rock",
        videoDefinition='high',
        maxResults=10
)

response = request.execute()

In [0]:
pd.json_normalize(response['items'])[['id.videoId','snippet.title', 'snippet.description', 'snippet.channelTitle']]

**Task: Extract comments from the top video above.**

Try extracting 100 comments from the top video listed above.

Hint: Use the Youtube Data API Documentation and look into commentThreads.

In [0]:
### TO DO
