<a href="https://colab.research.google.com/github/ayarzuki/Orbit-Futur-Academy/blob/main/tutorial_1_1_rest_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using REST API to Collect Data

*by [Vladimir Haltakov](https://twitter.com/haltakov)*

Welcome to the first tutorial of my upcoming Machine Learning course! I decided to share this one at a very early stage in order to gather feedback from the community. It still needs quite some polishing, but I wanted to put it out there as fast as possible and get your opinion on the presentation style and the content.

I would appreciate if you fill out this short survey (less than 5 minutes) after you complete the tutorial: [https://airtable.com/shr4p4kwKF6mqJl4h](https://airtable.com/shr4p4kwKF6mqJl4h). Feel free to reach out to me on Twitter as well [@haltakov](https://twitter.com/haltakov).

Thank you!

---
## Setup

Please, execute the cell below to setup your environment.

In [None]:
# @title Setup Environment

# Import packages
import os
import json
from pprint import pprint
from IPython.display import Image, YouTubeVideo, HTML
from urllib.request import urlopen, Request

def show_youtube_video(id):
  display(YouTubeVideo(id, 854, 480))
  display(HTML(f"Video available at <a href='https://www.youtube.com/watch?v={id}' target='_blank'>https://www.youtube.com/watch?v={id}</a>"))

---
## Introduction

Using REST APIs to collect data from web services is a common way to build a dataset. In this tutorial, you will learn how to interact with a REST API in Python and how to use it to collect data.

In [None]:
# @title Video 1 - Introduction
show_youtube_video("6Beo2399xWA")

## Register for the Unsplash API

In order to use the Unsplash API you need to register for a free account and create a demo application.
1. Create a free Unsplash account here: [https://unsplash.com/join](https://unsplash.com/join)
2. Create a new application here: [https://unsplash.com/oauth/applications](https://unsplash.com/oauth/applications)
3. Check out the documentation here: [https://unsplash.com/documentation](https://unsplash.com/documentation)

In [None]:
# @title Video 2 - Register on Unsplash
show_youtube_video("dYzpGWj57cc")

## Using the Unsplah API in the browser

Here is a simple GET request you can try out directly in your browser. Make sure you put in you access key in the URL.

`https://api.unsplash.com/search/photos?query=beach&client_id=<YOUR SECRET KEY>`

You can also try using [https://reqbin.com/](https://reqbin.com/) to create more complicated requests and inspect the responses in detail.

In [None]:
# @title Video 3 - API requests in the browser
show_youtube_video("mpj9QgCdDb8")

In [None]:
# @title Video 4 - API requests in reqbin.com
show_youtube_video("_sy2TUJFBdM")

## Handling secrets in Python

After you register your demo applciation on Unsplash, you will get 2 keys you need to access the API: an access key and a secret key. But where should we store them?

In [None]:
# @title Video 5 - Handling Secrets

show_youtube_video("MDdNRPlKaoU")

It is a bad idea to store keys and passwords together with your code, so we'll use the `dotenv` package for that.

You need to create a file called `.env` in the root folder of your project where you can put your keys. After that, you just need to run the code below and you can access your keys like that: `os.environ["ACCESS_KEY"]`.

The conten of the `.env` file should look like this:

> `ACCESS_KEY=<YOUR ACCESS KEY>`

> `SECRET_KEY=<YOUR SECRET KEY>`

In [None]:
# @title Video 6 - Handling Secrets, Good Practices

show_youtube_video("IPPaklKxC0E")

In [None]:
# Load the variables from the .env file
from dotenv import load_dotenv

load_dotenv(".env")

## Using the Unsplash API in Python

It's now time to automate the usage of the API using Python. Let's start with the same simple GET request we used directly in the browser.

In [None]:
# @title Video 7 - REST API in Python

show_youtube_video("dIUvzmPu_pA")

In [None]:
# Create the request URL
url = f"https://api.unsplash.com/search/photos?query=beach&client_id={os.environ['ACCESS_KEY']}"

# Send the request
response = urlopen(url)

# Check the response code (should be 200)
print("Response code:", response.code)

In [None]:
# Fetch the response textand show the beginning
response_text = response.read()
print(response_text[:300])

In [None]:
# Parse the response as JSON and show the number of photos
response_json = json.loads(response_text)
print("Photos in the response:", len(response_json["results"]))

## Create a Request with custome headers

Now, let's create a more complex Request using custome headers, where we'll put the access key.

In [None]:
# @title Video 8 - Custom request headers

show_youtube_video("YSrWqwtvFlE")

In [None]:
# Create the request and add an authentication header
request = Request(f"https://api.unsplash.com/search/photos?query=beach")
request.add_header("Authorization", f"Client-ID {os.environ['ACCESS_KEY']}")
response = urlopen(request)

# Parse the response and show the number of photos
response = json.loads(response.read())
print("Photos in the response:", len(response_json["results"]))

## Creating mutliple requests to fetch data for multiple photos

Now let's build a more complex script that fetches the information for multiple photos and some related information and visualizes them.

In [None]:
# @title Video 9 - Multiple Requests Example

show_youtube_video("wGPNjnPdsg8")

In [None]:
# Function that fetches the data for a specified photo
def load_photo_data(photo_id):
    request = Request(f"https://api.unsplash.com/photos/{photo_id}")
    request.add_header("Authorization", f"Client-ID {os.environ['ACCESS_KEY']}")
    response = urlopen(request)
    return json.loads(response.read())


# Test the function by displaying the information for this photo: https://unsplash.com/photos/i5VhkcMiqHw
photo_data = load_photo_data("i5VhkcMiqHw")
pprint.pprint(photo_data)

In [None]:
# Define the search query
search_query = "beach"

# Build the search request
request = Request(f"https://api.unsplash.com/search/photos?query={search_query}&orientation=landscape")
request.add_header("Authorization", f"Client-ID {os.environ['ACCESS_KEY']}")
response = urlopen(request)
search_results = json.loads(response.read())

# Display some statistics about the results
print("Total photos found:", search_results["total"])

In [None]:
# Display each result photo
for photo in search_results["results"]:
    # Fetch additional photo information
    photo_data = load_photo_data(photo["id"])

    display(Image(url=photo["urls"]["small"]))
    print("Description:", photo["description"])
    print("Photographer:", photo["user"]["first_name"], photo["user"]["last_name"])
    print("Views:", photo_data["views"])
    print("Downloads:", photo_data["downloads"])
    print()

## Writing data in CSV file

In machine learning we often work with datasets containing all the required information. A popular way to store data is using CSV files. Let's dump the collected data in a file.

In [None]:
# @title Video 10 - Writing data in CSV file

show_youtube_video("lYRqEpPZZiA")

In [None]:
import csv

# Define the search query
search_query = "beach"

# Fetch the search results
request = Request(f"https://api.unsplash.com/search/photos?query={search_query}&per_page=30")
request.add_header("Authorization", f"Client-ID {os.environ['ACCESS_KEY']}")
response = urlopen(request)
search_results = json.loads(response.read())

# Write the results to photos.csv file
with open("photos.csv", "w") as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=";")

    # Write the column headers
    csvwriter.writerow(
        [
            "photo_id",
            "created_at",
            "description",
            "url",
            "likes",
            "username",
            "name",
        ]
    )

    # Write the data for each photo
    for photo_data in search_results["results"]:
        csvwriter.writerow(
            [
                photo_data["id"],
                photo_data["created_at"],
                photo_data["description"],
                photo_data["urls"]["small"],
                photo_data["likes"],
                photo_data["user"]["username"],
                f"{photo_data['user']['first_name']} {photo_data['user']['last_name']}",
            ]
        )

## EXCERCISE: Fetch photos from multiple pages

You may have noticed while playing around with the API that you can't fetch the data of more than 30 photos at once. This limit is posed by the Unsplash API on purpose in order to prevent scraping attempts. You can still download more images, by creating multiple requets - one for each "page".

Read the chapter on Pagination in the Unsplash [API documentation](https://unsplash.com/documentation#pagination) and write the code to automatically download 100 photos and save the photo ID and URL in a CSV file.

In [None]:
# @title Video 11 - Handling pages. Exercise

show_youtube_video("RzDTAnqsSrs")

In [None]:
# Define the search query and number of photos to download
search_query = "beach"
photos_to_download = 100

### EXCERCISE ###
# TODO Write the code to download more than 30 photos using pages
# ...

## SOLUTION: Fetch photos from multiple pages

After trying to write the code youself, you can check out my answer below. Please, really give it a try - this is the best way to learn new things, even if it may be difficult.

In [None]:
# @title Video 12 - Handling pages. Solution

show_youtube_video("94ovdEwxnYw")

In [None]:
import math

# Define the search query and number of photos to download
search_query = "beach"
photos_to_download = 100

# Calculate the number of pages needed
photos_per_page = 30
photos_on_last_page = photos_to_download % photos_per_page
pages_count = math.ceil(photos_to_download / photos_per_page)

# Create the output CSV file
with open("photos_100.csv", "w") as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=";")

    # Write the column headers
    csvwriter.writerow([
        "photo_id",
        "url",
    ])

    # Photos left to download
    photos_left = photos_to_download

    # Download each page
    for page in range(1, pages_count + 1):
        # Check if there are more photos to download
        if photos_left <= 0:
            break

        # Request a page
        request = Request(f"https://api.unsplash.com/search/photos?query={search_query}&per_page=30&page={page}")
        request.add_header("Authorization", f"Client-ID {os.environ['ACCESS_KEY']}")
        print("New request:", request.full_url)

        response = urlopen(request)
        search_results = json.loads(response.read())

        # Check if enough photos are available
        total_photos = int(response.headers["X-Total"])
        if photos_left > total_photos:
            photos_left = total_photos

        # Determine how much photos need to be downloaded (handling the last page)
        if photos_left > photos_per_page:
            results = search_results["results"]
        else:
            results = search_results["results"][:photos_left]

        # Download all photos
        for photo_data in results:
            csvwriter.writerow([photo_data["id"], photo_data["urls"]["small"]])

        # Update the photos left to download
        photos_left -= photos_per_page

## Conclusion

Thank you for taking your time to do this early version of the tutorial. I would appreciate if you fill out this short survey (less than 5 minutes) to tell me what you think about it: [https://airtable.com/shr4p4kwKF6mqJl4h](https://airtable.com/shr4p4kwKF6mqJl4h).

Feel free to reach out to me on Twitter as well [@haltakov](https://twitter.com/haltakov).