<div width=50% style="display: block; margin: auto">
    <img src="figures/ucl-logo.svg" width=100%>
</div>

### [UCL-ELEC0136 Data Acquisition and Processing Systems 2024]()
University College London
# Lab 2: Data Acquisition


<hr width=70% style="float: left">

### Objectives

The data acquisition lab aims to show how we can find and acquire various data sources (e.g., web scraping, social media, databases, sensors).   
In this lab, you will learn:
- How to use Python to acquire data from various sources
- How to store data in a json format
- How to commit and push your code to GitHub


### Outline

In this class we will cover the following topics:

1. [GitHub Classrooms explained](#1.-GitHub-Classrooms-explained)
2. [Activating the daps virtual environment](#3.-Creating-a-virtual-environment)
3. [Acquiring data using the `requests` module and the GitHub RESTful APIs](#4.-Acquiring-data-using-the-requests-module-and-the-GitHub-RESTful-APIs)
4. [Pushing your code to GitHub](#5.-Pushing-your-code-to-GitHub)
5. [Submitting your assignment](#6.-Submitting-your-assignment)

<hr width=70% style="float: left">

## 1. GitHub Classrooms explained
### 1.1 Accepting an assignment

- For every week of the course, you will be given an assignment to complete.

- You can find the **links** to your assignments on Moodle.

- When you click on the link, you will be redirected to GitHub Classroom, where you will be asked to accept the assignment. The page will look like this:

- GitHub Classroom will create a **public repository** for you, which will contain the assignment, with pattern `<assignment-name>-<your-github-username>`.



<div width=50% style="display: block; margin-left: auto">
    <img src="figures/accepted-assignment.png" style="display: block; margin: auto" width=50%>
</div>


### 1.2. Cloning a repository


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Open a terminal
- Navigate to the directory where you want to store your assignment
- Clone the repository using the command `git clone <repository-url>`
- Navigate to the cloned repository using the command `cd <repository-name>`

</div>

Remember lab `0-DAPS-SETUP`, your file structure should look like: 

```plaintext
├── root
│   ├── env
│   │   ├── environment.yml
│   │   ├── requirements.txt
│   ├── 0-DAPS-SETUP
│   │   ├── 0-SETUP.ipynb
│   │   ├── figures
│   │   ├── README.md
│   │   ├── .gitignore
│   ├── 1-GIT-AND-GITHUB
            .
            .
            .
```

## 3. Activating the daps virtual environment

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Make sure you have completed part 0.0 of the Notebook of lab `0-DAPS-SETUP` that instructs you how to create a virtual environment called `daps`.
- Make sure to run this Notebook on the python kernel of the `daps`virtual environment.

</div>

<div width=50% style="display: block; margin: auto">
    <img src="./figures/kernel.png" width=75%>
</div>

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Fill the `requirements.txt` with all the packages you will need for this assignment then run following cell. You can look up the requirements.txt file you created in lab 0-SETUP for reference.
</div>


In [1]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable


## 4. Acquiring data using the `requests` module and the GitHub RESTful APIs

The `requests` module is a Python module that allows you to send HTTP requests to a server and receive a response.
We will use it to acquire data from the web.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Import the `requests` module using `import requests`.
- Use the `requests.get()` function to send a `GET` request to a server. The function takes as input the URL of the server, and returns a `Response` object.
- Acquire 100 repositories from the `https://github.com/orgs/UCL-ELEC0136` organisation
- Use the `Response` object to access the response of the server. For example, you can access the response status code using `Response.status_code`.
- Use the `Response.json()` function to convert the response content to a `dict` object.
- Use the `json` module to save the `dict` object to a file using `json.dump()`
- Use the `json.load()` function to load the `dict` object from the file.
- Verify that the `dict` object you loaded from the file is the same as the one you saved to the file.

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

Check out the GitHub REST API documentation to see how to acquire information from GitHub using the API.
For example, you can use the following URL to retrieve information about the repository `https://api.github.com/repos/<username>/<repository-name>`, but this is not the query we want. We want **all** at least 100 repositories of an organisation.
You can check the API's documentation here https://docs.github.com/en/rest?apiVersion=2022-11-28 or check Stackoverflow answers.
</div>

In [2]:
import requests
import json

# Function to get repository information
def get_repositories():
    url = "https://api.github.com/orgs/UCL-ELEC0136/repos?page=1&per_page=100"
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        print("Successfully fetched the data!")
        data = response.json()  # Convert the response content to JSON
        return data
    else:
        print(f"Failed to fetch data, status code: {response.status_code}")
        return None

# Function to save data to a file
def save_to_file(data, filename):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
        print(f"Data successfully saved to {filename}")

# Function to load data from a file
def load_from_file(filename):
    with open(filename, 'r') as file:
        data = json.load(file)
        print(f"Data successfully loaded from {filename}")
        return data

# Function to verify if two dictionaries are identical
def verify_data(original_data, loaded_data):
    if original_data == loaded_data:
        print("Verification successful: The data matches!")
    else:
        print("Verification failed: The data does not match.")

In [3]:
# Get repository information
repos_data = get_repositories()

if repos_data:
    # Save the data to a file
    filename = 'ucl_repos.json'
    save_to_file(repos_data, filename)

    # Load the data from the file
    loaded_data = load_from_file(filename)

    # Verify that the saved data and the loaded data are the same
    verify_data(repos_data, loaded_data)

Successfully fetched the data!
Data successfully saved to ucl_repos.json
Data successfully loaded from ucl_repos.json
Verification successful: The data matches!


## 4.1 Pagination

<div class="alert alert-block alert-warning">
<b>👩‍💻👨‍💻 Optional action</b>

- What if we want 200 repositories?

</div>


<div class="alert alert-heading alert-danger" style="background-color: white; border: 2px solid; border-radius: 5px; color: #000; border-color:#AAA; padding: 10px">
    <b>💎 Tip</b>

The method is in the title of this section
</div>

In [4]:
import requests
import json

# Function to get repositories with pagination support
def get_repositories():
    all_repos = []
    for page in range(1, 3):  # Fetch page 1 and 2, with a maximum of 100 repos per page
        url = f"https://api.github.com/orgs/UCL-ELEC0136/repos?page={page}&per_page=100"
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            print(f"Successfully fetched page {page} data!")
            data = response.json()  # Convert response content to JSON
            all_repos.extend(data)   # Add the repos from this page to the total list
        else:
            print(f"Failed to fetch data for page {page}, status code: {response.status_code}")
            return None

    return all_repos

# Function to save data to a file
def save_to_file(data, filename):
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)
        print(f"Data successfully saved to {filename}")

# Function to load data from a file
def load_from_file(filename):
    with open(filename, 'r') as file:
        data = json.load(file)
        print(f"Data successfully loaded from {filename}")
        return data

# Function to verify if the original data matches the loaded data
def verify_data(original_data, loaded_data):
    if original_data == loaded_data:
        print("Verification successful: The data matches!")
    else:
        print("Verification failed: The data does not match.")

In [5]:
# Get repositories data
repos_data = get_repositories()
    
if repos_data:
    # Save data to a file
    filename = 'ucl_repos_200.json'
    save_to_file(repos_data, filename)

    # Load data from the file
    loaded_data = load_from_file(filename)

    # Verify that the saved data and loaded data are the same
    verify_data(repos_data, loaded_data)

Successfully fetched page 1 data!
Successfully fetched page 2 data!
Data successfully saved to ucl_repos_200.json
Data successfully loaded from ucl_repos_200.json
Verification successful: The data matches!


## 5. Pushing your code to GitHub

We are now ready to push our code that acquires data from GitHub to our repository (which is also GitHub, but this is just a coincidence, we could have used any other API, like Twitter's or Facebook's).

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Go into your terminal
- Check the git status of your repository using the command `git status`
- Verify that the files you want to commit are listed, and that there is no pending changes to pull from the remote
- Add the files to the staging area using the command `git add <file-name>`
- Commit the files using the command `git commit -m "<commit-message>"`
- Use a meaningful commit message, e.g., `Acquire data from GitHub`
- Push the files to GitHub using the command `git push`
- Verify that the files have been pushed to GitHub by refreshing the page of your repository on GitHub. 
</div>

In [6]:
# write your code here, if applicable

## 6. Submitting your assignment

To submit this assignment and **every other future assignment**, included the **final assignment** you have to:
- Commit and push your code to GitHub
- Go to **your** repository of the assignment. This must be on our course organisation `UCL-ELEC0136` and usually has the pattern `https://github.com/UCL-ELEC0136/<assignment-name>-<your-github-username>`.
- Go in the `Pull requests` tab and click on the `Feedback` pull request.
- Click on `Files changed` and verify that the files you have changed are listed.
- Merge the pull request by clicking on `Merge pull request` and then `Confirm merge`.

We are now ready to push our code that acquires data from GitHub to our repository (which is also GitHub, but this is just a coincidence, we could have used any other API, like Twitter's or Facebook's).

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

Submit your assignment by following the steps above.
</div>

In [7]:
# Write your code here, if applicable