# Denison DA210/CS181 SW Lab #14 - Step 1

Before you get your checkpoints, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells**.

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import os.path
import sys
import importlib
import io
import pandas as pd
from lxml import etree

module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

import requests

---

## Part A: High-Level Planning

We can use the GitHub API to retrieve information about organizations, repositories, and users.  Even without authenticating, we can find the commits and users that have made changes to specific files in a repository.

With this goal in mind, we will divide our work into two phases:
* Build a table of commits to a specific file.
* Build a table of users who have modified that file.  

Both of these two phases can be further divided into several steps:
1. Understand the API endpoint.  
2. Design a function to issue our requests.  
3. Design the commit table (data frame).  
4. [If necessary] Handle multiple pages.

To accomplish our overall goals, we'll make use of two GitHub API endpoints:
* [/repos/{owner}/{repo}/commits](https://docs.github.com/en/rest/commits/commits#list-commits)
* [/users/{username}](https://docs.github.com/en/rest/users/users#get-a-user)

---

## Part B: Building a table of commits to a specific file

We'll try to gather information about the `pandas` repository on GitHub, specifically looking at changes to the `groupby.py` file.  Here is its [file docstring](https://github.com/pandas-dev/pandas/blob/main/pandas/core/groupby/groupby.py):

```
    """
    Provide the groupby split-apply-combine paradigm. Define the GroupBy
    class providing the base-class of operations.

    The SeriesGroupBy and DataFrameGroupBy sub-class
    (defined in pandas.core.groupby.generic)
    expose these user-facing objects to provide specific functionality.
    """
```

#### Step 1: Understand the `list-commits` API endpoint

We can use the list-commits endpoint to query information about this file.  First, let's explore the results we get from this endpoint.

We can view the commits for this file on GitHub: https://github.com/pandas-dev/pandas/commits/main/pandas/core/groupby/groupby.py.

Also, as this is a GET request, we can view the general version (without query parameters) in a web browser to get a feel for the results: https://api.github.com/repos/pandas-dev/pandas/commits.  (Firefox in particular has a very nice view of the headers, raw JSON data, and parsed JSON result.)

The endpoint documentation tells us we'll need to specify the filepath (relative within the repo) as a _query parameter_.  Let's try out a request:

In [None]:
# Build the URL
host = "api.github.com"
resource_path = f"/repos/pandas-dev/pandas/commits"
url = util.buildURL(resource_path, host, protocol="https")

# Make the request
query_params = {"path": "pandas/core/groupby/groupby.py"}
try:
    response = requests.get(url, params=query_params)
    assert response.status_code == 200
except AssertionError:
    print(f"Failed: {resource_path} with status code {response.status_code}")

# Display the parsed JSON object
data = response.json()
util.print_json(data, level=2)

#### Step 2: Design a function to issue a request

Next, we can write a function to enable more programmatic access to this endpoint, and provide some abstraction between the endpoint parameters and how the request is made.

**Q1:** Write a function `getRepositoryCommitsSimple(owner, repo, path)` to access the [list-commits](https://docs.github.com/en/rest/commits/commits#list-commits) endpoint (the one from the example above) for a given path within a repository.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Example using list-commits endpoint
owner = "pandas-dev"
repo = "pandas"
query_path = "pandas/core/groupby/groupby.py"
data = getRepositoryCommitsSimple(owner, repo, query_path)

util.print_json(data, level=2)

The [GitHub documentation](https://docs.github.com/en/rest/commits/commits#list-commits) shows that this result should be a JSON array (corresponding to a Python list) of JSON objects (Python dictionaries).  There are at most 30 (by default) objects per "page", and each object should represent a single commit.

In [None]:
# Check how many commits' info we got
len(data)

In [None]:
# Look at the most recent commit's info
commit_obj = data[0]
util.print_json(commit_obj, level=2)

In [None]:
# Look at the most recent commit's message
commit_obj["commit"]["message"]

In [None]:
# Look at the most recent commit's timestamp
commit_obj["commit"]["author"]["date"]

#### Step 3: Design commit table

We'll collect the following information about each commit:
- commit ID
- message
- commiter username
- commit timestamp

We can write a function that produces a list of row dictionaries (LoD) representation from the JSON-parsed data structure of a request.  This is provided for you, below.  Take a look through it and make sure you understand where the pieces are coming from.

In [None]:
def commitResult2LoD(result, maxelements=None):
    """
    Build an LoD from a JSON result from the GitHub list-commits API endpoint.
    """
    assert isinstance(result, list)

    LoD = []
    count = 0
    for commit_obj in result:
        if maxelements != None and count >= maxelements:
            break

        D = {}
        D["id"] = commit_obj["sha"]
        D["message"] = commit_obj["commit"]["message"]
        D["author"] = commit_obj["author"]["login"]
        D["timestamp"] = commit_obj["commit"]["author"]["date"]
        LoD.append(D)

        count += 1

    return LoD

In [None]:
# Try parsing the commit results from our previous request
LoD = commitResult2LoD(data)
for row in LoD[:3]:
    util.print_data(row)

#### Step 4: Handle multiple pages

API service providers often throttle results to avoid sending too much data at once.  The results are typically divided into _chunks_, or _pages_, and the request must specify the desired page and/or the desired number of results per page.  Then, it is up to the client to navigate this, and issue additional requests if necessary, until the desired amount of data is acquired.

If we want more than 30 results (the default page size for the list-commit endpoint), or if we want a later page of results, we can add additional parameters, as outlined in the GitHub API documentation.

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: Take a look at the [list-commits endpoint documentation](https://docs.github.com/en/rest/commits/commits#list-commits).  How can you specify the number of results per page, and the page to retrieve?

**Q2:** Write a new function `getRepositoryCommitsByPage(owner, repo, path, num_per_page=10, page=1)` that requests one _page_ of `num_per_page` results from the list-commits endpoint.  Using the `page` parameter will allow us to easily programmatically request different pages of results.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Example using list-commits endpoint
owner = "pandas-dev"
repo = "pandas"
query_path = "pandas/core/groupby/groupby.py"
data = getRepositoryCommitsByPage(owner, repo, query_path, page=2)

util.print_json(data, level=2)

We will make use of this new function to make several requests.  We do so in `getCommits(owner, repo, query_path, num_commits)`, below:

In [None]:
def getCommits(owner, repo, query_path, num_commits=15, num_per_page=10):
    """
    Retrieve up to num_commits commits for a given filepath in a GitHub repo,
    in pages of num_per_page commits at a time.
    """
    fullLoD = []

    page = 1
    commits_left = num_commits
    more_pages = True

    while more_pages and commits_left > 0:
        commit_page = getRepositoryCommitsByPage(owner, repo, query_path, num_per_page, page)

        if len(commit_page) < num_per_page:
            more_pages = False

        pageLoD = commitResult2LoD(commit_page)
        fullLoD.extend(pageLoD)

        commits_left -= len(pageLoD)
        page += 1

    df = pd.DataFrame(fullLoD)
    return df

In [None]:
# Build a table of commits
num_commits = 12
num_per_page = 8
commits_df = getCommits(owner, repo, query_path, num_commits, num_per_page)

print("Number of commits in DataFrame:", len(commits_df))
commits_df.iloc[:5, :]

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: Look at the code for `getCommits`.  Why do we use `fullLoD.extend()` instead of `fullLoD.append()`?

---

## Part C: Building a table of users who modified the file

Given the set of author usernames from the previous part, we can build a table of user information for those users.  This will involve multiple requests, one per user, to obtain information about each user.  From this, we can build a table and remove any duplicates.

#### Step 1: Understand `users` API endpoint

First, we'll need to understand the `users` API endpoint.  Here is the documentation: https://docs.github.com/en/rest/users/users#get-a-user.



**Q3:** Write a function `getUser(username)` to make a request to the `users` endpoint for a given GitHub username.  Your function should return the parsed `JSON` object of the response.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Look up one of the people in the pandas-dev org
# (https://github.com/orgs/pandas-dev/people)
user1 = getUser("cpcloud")
util.print_json(user1, level=1, maxchildren=30)

#### Step 2: Understand the results

Let's look at the documentation regarding the results for requests to this endpoint:
- The root of the returned value (JSON object).
- The root has lots of children, including: `"login"`, `"type"`, `"company"`, `"name"`, and `"email"`.
- Each of the values of these children is a string.

In [None]:
# Get the user's name, if provided
user1["name"]

In [None]:
# Get the user's email, if provided
print(user1["email"])

#### Step 3: Design users table

To build a tabular representation of GitHub users, we need to decide on the fields.  For simplicity, we'll specify just four fields:

| Field     | Python type | Notes |
| --------- | ----------- | ----- |
| `username`   | `str`       | The GitHub username of the user |
| `name`    | `str`       | The name of the user |
| `location`   | `str`       | The location of the user |
| `company` | `str`       | The company of the user |

**Q4:** Write a function `getUserRow(user)` to return a dictionary containing these values for a given user object returned from the `users` endpoint.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Try this out
getUserRow(user1)

In [None]:
# Test that it's working
user1_row = getUserRow(user1)

assert user1_row["username"] == "cpcloud"
assert user1_row["name"] == "Phillip Cloud"

**Q5:** Using these functions, we can build the full list of users given a list of usernames.  Write a function `getUsers(usernames)` that takes a list of usernames (as strings), and for each username, queries the `user` GitHub API endpoint, then uses `getUserRow` to build a dictionary for that user with just the fields we care about.  Finally, your function should build a `pandas DataFrame` of the resulting data.

Note: Make sure to remove duplicates, ideally before making API requests, or at least from your resulting `DataFrame` (using the function `drop_duplicates`).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Given the function `getUsers` we can now use the usernames of those who authored commits to the `pandas` repository to build a `DataFrame` of those users' information:

In [None]:
# Build the users DataFrame
usernames = list(commits_df["author"])
users_df = getUsers(usernames)

print("Number of users in DataFrame:", len(users_df))
users_df.head()

> You've reached the third checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 3: Why is it important to remove duplicates from the Users table?  Reference both performance and tidy data assumptions.

---

## Summary

We created two DataFrames, one for commit info for a given file, and another for the users involved in those commits.  The two DataFrames are shown again below.

In [None]:
commits_df.head()

In [None]:
users_df.head()

> You've reached the fourth (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 4: If you run this notebook too often, you start getting a `403` status code for all requests.  Why?

---

---

## Part D

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE