<div width=50% style="display: block; margin: auto">
    <img src="figures/ucl-logo.svg" width=100%>
</div>

### [UCL-ELEC0136 Data Acquisition and Processing Systems 2024]()
University College London
# Lab 1: Data Acquisition


<hr width=70% style="float: left">

### Objectives

The data acquisition lab aims to show how we can find and acquire various data sources (e.g., web scraping, social media, databases, sensors).   
In this lab, you will learn:
- How to use Python to acquire data from various sources
- How to store data in a json format
- How to commit and push your code to GitHub


### Outline

In this class we will cover the following topics:

1. [Accepting an assignment](#1.-Accepting-an-assignment)
2. [Cloning a repository](#2.-Cloning-a-repository)
3. [Creating a virtual environment](#3.-Creating-a-virtual-environment)
4. [Acquiring data using the `requests` module and the GitHub RESTful APIs](#4.-Acquiring-data-using-the-requests-module-and-the-GitHub-RESTful-APIs)
5. [Pushing your code to GitHub](#5.-Pushing-your-code-to-GitHub)
6. [Submitting your assignment](#6.-Submitting-your-assignment)

<hr width=70% style="float: left">

## 1. Accepting an assignment

- For every week of the course, you will be given an assignment to complete.

- You can find the **links** to your assignments on Moodle.

- When you click on the link, you will be redirected to GitHub Classroom, where you will be asked to accept the assignment. The page will look like this:

- GitHub Classroom will create a **public repository** for you, which will contain the assignment, with pattern `<assignment-name>-<your-github-username>`.



<div width=50% style="display: block; margin-left: auto">
    <img src="figures/accepted-assignment.png" style="display: block; margin: auto" width=50%>
</div>


In [2]:
# write your code here, if applicable

## 2. Cloning a repository


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Open a terminal
- Navigate to the directory where you want to store your assignment
- Clone the repository using the command `git clone <repository-url>`
- Navigate to the cloned repository using the command `cd <repository-name>`

</div>

In [3]:
# write your code here, if applicable

## 3. Creating a virtual environment

You should already have installed [**Anaconda**](https://docs.anaconda.com/free/anaconda/) or an open-source equivalent (preferred), such as [**Miniconda**](https://docs.conda.io/projects/miniconda/en/latest/miniconda-install.html).

In this course we advice to mangage your python version using `conda` (e.g. Python 3.9 vs 3.10) and the packages required by Python using `pip`.
Check our the preliminary material on the [UCL Moodle](https://moodle.ucl.ac.uk/mod/page/view.php?id=5693674) for instructions on how to use enviornment files.
Alternatively, have a look [here](https://github.com/UCL-ELEC0136/example-requirements) to see how an environment file looks like when pip requirements are included in the file itself,
or [here](https://github.com/UCL-ELEC0136/setup) for how an environment file where the pip requirements are stored separate into another file.


This is a key step as it guarantees that your development environment is **reproducible** by somebody else.
Failing do to this in the final assignment will get you a score of 0 (zero) in reproducibility.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Create a `conda` environment file and name it `environment.yml`.
- The environment (not the file) **must** have name `daps`.
- Add `pip` dependencies **within** the environment file to specify the packages that your project needs to run. If you don't know yet which dependencies to use, yo ucan leave the file blank for now and add dependencies later.
- Activate the environment using `conda activate daps`.
- Install the dependencies using `pip install -r requirements.txt`.
</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

You can prepend an exclamation mark `!` to any command in a Jupyter notebook to run it in the terminal. For example, `!conda activate daps` will activate the `daps` environment in the terminal.

</div>



In [4]:
# write your code here, if applicable

## 4. Acquiring data using the `requests` module and the GitHub RESTful APIs

The `requests` module is a Python module that allows you to send HTTP requests to a server and receive a response.
We will use it to acquire data from the web.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Import the `requests` module using `import requests`.
- Use the `requests.get()` function to send a `GET` request to a server. The function takes as input the URL of the server, and returns a `Response` object.
- Acquire 100 repositories from the `https://github.com/orgs/UCL-ELEC0136` organisation
- Use the `Response` object to access the response of the server. For example, you can access the response status code using `Response.status_code`.
- Use the `Response.json()` function to convert the response content to a `dict` object.
- Use the `json` module to save the `dict` object to a file using `json.dump()`
- Use the `json.load()` function to load the `dict` object from the file.
- Verify that the `dict` object you loaded from the file is the same as the one you saved to the file.

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Check out the GitHub REST API documentation to see how to acquire information from GitHub using the API.
For example, you can use the following URL to retrieve information about the repository `https://api.github.com/repos/<username>/<repository-name>`, but this is not the query we want. We want **all** at least 100 repositories of an organisation.
You can check the API's documentation here https://docs.github.com/en/rest?apiVersion=2022-11-28 or check Stackoverflow answers.
</div>

In [1]:
import requests
import json

# Store the URL of the GitHub API endpoint under the variable query
query = "https://api.github.com/orgs/UCL-ELEC0136/repos"

# send GET request to the GitHub API endpoint
response = requests.get(query)

# check if the response was successful and raise an exception if not
response.raise_for_status()

# print the response status code
print("Response status code:", response.status_code)
print("Response content:", response.content)

# format the response as JSON and store it in a dictionary variable
repositories = response.json()

# print the number of repositories
print("Number of repositories:", len(repositories))
# print the name of each repository
print("Repository names:", [repo["name"] for repo in repositories])

# save the response content to a file
with open("repositories.json", "w") as f:
    json.dump(repositories, f)

# read the response content from the file
with open("repositories.json", "r") as f:
    repositories_loaded = json.load(f)
    # can use assert to see whether they are the same ones
    assert repositories == repositories_loaded
    print(repositories_loaded)

Response status code: 200
Response content: b'[{"id":413766604,"node_id":"R_kgDOGKmTzA","name":"github-starter-course","full_name":"UCL-ELEC0136/github-starter-course","private":false,"owner":{"login":"UCL-ELEC0136","id":91945376,"node_id":"O_kgDOBXr5oA","avatar_url":"https://avatars.githubusercontent.com/u/91945376?v=4","gravatar_id":"","url":"https://api.github.com/users/UCL-ELEC0136","html_url":"https://github.com/UCL-ELEC0136","followers_url":"https://api.github.com/users/UCL-ELEC0136/followers","following_url":"https://api.github.com/users/UCL-ELEC0136/following{/other_user}","gists_url":"https://api.github.com/users/UCL-ELEC0136/gists{/gist_id}","starred_url":"https://api.github.com/users/UCL-ELEC0136/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/UCL-ELEC0136/subscriptions","organizations_url":"https://api.github.com/users/UCL-ELEC0136/orgs","repos_url":"https://api.github.com/users/UCL-ELEC0136/repos","events_url":"https://api.github.com/users/UCL-ELE

## 4.1 Pagination

<div class="alert alert-block alert-warning">
<b>👩‍💻👨‍💻 Optional action</b>

- What if we want 200 repositories?

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

The method is in the title of this section
</div>

In [5]:
# Store the URL of the GitHub API endpoint under the variable query
query = "https://api.github.com/orgs/UCL-ELEC0136/repos"


# send GET request to the GitHub API endpoint
repos = []

# iterate over the first two pages of the response
for page in [1, 2, 3, 4]:
    # send GET request to the GitHub API endpoint for the current page
    response = requests.get(query, params={"page": page, "per_page": 100})
    # check if the response was successful and raise an exception if not
    response.raise_for_status()
    # format the response as JSON and store it in a dictionary variable
    data = response.json()
    # add the repositories in the current page to the list of repositories
    print(data)
    repos.extend(response.json())


# print the number of repositories
print("Number of repositories:", len(repos))

# Make sure that the number of repositories is 200
# assert len(repos) == 537

[{'id': 413766604, 'node_id': 'R_kgDOGKmTzA', 'name': 'github-starter-course', 'full_name': 'UCL-ELEC0136/github-starter-course', 'private': False, 'owner': {'login': 'UCL-ELEC0136', 'id': 91945376, 'node_id': 'O_kgDOBXr5oA', 'avatar_url': 'https://avatars.githubusercontent.com/u/91945376?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/UCL-ELEC0136', 'html_url': 'https://github.com/UCL-ELEC0136', 'followers_url': 'https://api.github.com/users/UCL-ELEC0136/followers', 'following_url': 'https://api.github.com/users/UCL-ELEC0136/following{/other_user}', 'gists_url': 'https://api.github.com/users/UCL-ELEC0136/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/UCL-ELEC0136/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/UCL-ELEC0136/subscriptions', 'organizations_url': 'https://api.github.com/users/UCL-ELEC0136/orgs', 'repos_url': 'https://api.github.com/users/UCL-ELEC0136/repos', 'events_url': 'https://api.github.com/users/UCL-ELEC0136/

## 5. Pushing your code to GitHub

We are now ready to push our code that acquires data from GitHub to our repository (which is also GitHub, but this is just a coincidence, we could have used any other API, like Twitter's or Facebook's).

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Go into your terminal
- Check the git status of your repository using the command `git status`
- Verify that the files you want to commit are listed, and that there is no pending changes to pull from the remote
- Add the files to the staging area using the command `git add <file-name>`
- Commit the files using the command `git commit -m "<commit-message>"`
- Use a meaningful commit message, e.g., `Acquire data from GitHub`
- Push the files to GitHub using the command `git push`
- Verify that the files have been pushed to GitHub by refreshing the page of your repository on GitHub. 
</div>

In [7]:
# Check the status of the repository
!git status
# Pull the latest changes from the remote repository
!git pull
# Add the new files to the staging area
!git add repositories.json 1-data-acquisition-solutions.ipynb
# Commit the changes with the following message
!git commit -m "Add a method to acquire data from GitHub API, and also the acquired data"
# Push the changes to the remote repository
!git push
# Check the status of the repository to make sure that the changes have been pushed
!git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m1-data-acquisition-solutions.ipynb[m
	[31mrepositories.json[m

nothing added to commit but untracked files present (use "git add" to track)
Already up to date.
[main 3f90610] Add a method to acquire data from GitHub API, and also the acquired data
 2 files changed, 404 insertions(+)
 create mode 100644 1-data-acquisition-solutions.ipynb
 create mode 100644 repositories.json
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 32.35 KiB | 5.39 MiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/UCL-ELEC0136/1-data-acquisition-uceewl4.git
   e8c2509..3f90610  main -> main
On branch main
Your branch is up to date 

## 6. Submitting your assignment

To submit this assignment and **every other future assignment**, included the **final assignment** you have to:
- Commit and push your code to GitHub
- Go to **your** repository of the assignment. This must be on our course organisation `UCL-ELEC0136` and usually has the pattern `https://github.com/UCL-ELEC0136/<assignment-name>-<your-github-username>`.
- Go in the `Pull requests` tab and click on the `Feedback` pull request.
- Click on `Files changed` and verify that the files you have changed are listed.
- Merge the pull request by clicking on `Merge pull request` and then `Confirm merge`.

We are now ready to push our code that acquires data from GitHub to our repository (which is also GitHub, but this is just a coincidence, we could have used any other API, like Twitter's or Facebook's).

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

Submit your assignment by following the steps above.
</div>

In [8]:
# Write your code here, if applicable