# Learning about Jupyter in Jupyter Notebooks 🎉

As always, good to start with some imports.
For this demo, we will be using `ghapi`, `pandas`, `numpy`, and `matplotlib` (oh my!) 

In [None]:
from ghapi.all import GhApi, paged, GhDeviceAuth
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

First, we need to create an instance of the API class.

In [None]:
# If you need to setup an api key for higher rate limits, uncomment the following lines and run them once
# ghauth = GhDeviceAuth()
# print(ghauth.url_docs())
# ghauth.open_browser()

In [None]:
api = GhApi()

Let's call the GitHub API and store the results as a list.
The API results are paged, as some of the data you might ask for could be very long, so we just nest a list comprehension to "flip" through the pages.

In [None]:
api_data = paged(api.repos.list_for_org, org="jupyter")  # type: ignore
jupyter_repos = [r for page in api_data for r in page]

## Process and clean the data with `pandas`

Lists of dictionaries are ok, but let's make that a `pd.DataFrame` so we can more easily work with the data.
Here is also where we can reduce the size of our data by picking out the interesting bits.

In [None]:
interesting_info = [
    "name",
    "html_url",
    "description",
    "homepage",
    "size",
    "stargazers_count",
    "watchers_count",
    "language",
    "forks_count",
    "open_issues_count",
    "license",
    "topics",
    "default_branch",
]

jupyter_df = pd.DataFrame(jupyter_repos, columns=interesting_info)

Now that the DataFrame is loaded, we can use the variable inspector in VS Code to look at the values, even better than printing out the value here.

Let's also use some other tools to explore the data a bit.

In [None]:
# Show stats about numeric columns in the DataFrame
jupyter_df.describe()

In [None]:
jupyter_df.info()

In [None]:
# The license column is an object, but how can we extract the license info...

The license column has dictionary values that would be nice to split out, and get rid of unnecessary info. We can use the autoDocstring extension to help us document the function, and inlay hints can help show inferred missing type hints.

In [None]:
def format_dict_column(data : pd.DataFrame, column : str, mapping : dict):
    split_df = pd.json_normalize(data.loc[:,column]).rename(columns = mapping)  # type: ignore
    return data.drop(column, axis=1).join(split_df.loc[:,list(mapping.values())])

jupyter_df_clean = format_dict_column(jupyter_df, "license", {"key" : "license_key", "name" : "license_name", "url" : "license_url"})

In [None]:
import pandas_profiling
# If you only want to look at numeric columns, you can use the following to filter the DataFrame
# df_numeric = jupyter_df_clean.select_dtypes(include=np.int64)
pandas_profiling.ProfileReport(jupyter_df_clean)# can use minimal=True to reduce output


## Visualizations

In [None]:
language_totals = jupyter_df_clean.groupby("language").sum(numeric_only=True).sort_values("stargazers_count", ascending=False)

In [None]:
language_totals.plot.barh(figsize=(10, 5), fontsize=14)

---
## Utility

Check you GitHub API limits:

In [None]:
api.rate_limit.get()  # type: ignore

If the API limit is reached, load from included csv:

In [None]:
# If you need to export the DataFrame to a csv file, you can use the following
# df.to_csv("jupyter_repos.csv")
pd.read_csv("jupyter_repos.csv")