## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

In [1]:
import requests
import pandas as pd

In [2]:
with open('repos.csv') as f:
    data = f.readlines()

In [3]:
# notice that data is a list filled with strings that contain info about each line
data

['Name,URL\n',
 'Jonathan,https://github.com/JonathanBechtel\n',
 'Muhammad,https://github.com/fawad07\n',
 'Carlie,https://github.com/carliedeboer\n',
 'David,https://github.com/davidbroxmeyer']

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

In [4]:
# this loops through each item in the list, starting at position 1, and replace the \n character with nothing
cleaned_data = [repo.replace('\n', "") for repo in data[1:]]

In [5]:
# we can confirm now that these marks are no longer there
cleaned_data

['Jonathan,https://github.com/JonathanBechtel',
 'Muhammad,https://github.com/fawad07',
 'Carlie,https://github.com/carliedeboer',
 'David,https://github.com/davidbroxmeyer']

### Step 3:  Separate the username in each string from everything else

In [7]:
# we do the same thing, except we take the LAST item from the list returned by split()
usernames = [url.split('/')[-1] for url in cleaned_data]

### Step 4: Obtain the repo data for every single github username

In [8]:
# this part of the url will never change
base_url = 'https://api.github.com'

In [9]:
# this goes through every username, and inserts it into the api url, and then passes that into requests.get().json()
# to obtain a list of repos for every single user
repo_lists = [requests.get(f"{base_url}/users/{username}/repos").json() for username in usernames]

### Step 5: Create a 'flat' list that contains every unique repo for every single user

Answer with list comprehension:

In [10]:
# this is a nested for-loop using a list comprehension that returns each item inside the inner list
repos = [repo for user in repo_lists for repo in user]

Nested loops with comprehensions can be difficult to interpret sometimes, so if a regular for-loop is easier to digest, this is a different way of writing the same thing:

In [None]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)

### Step 6:  Get information about the name, owner, url, and date of every single repo.

In [11]:
# this creates a list of all the values for the name key
repo_names = [repo['name'] for repo in repos]
# ditto for the login key -- notice it's accessed inside the owner key
owners     = [repo['owner']['login'] for repo in repos]
# next two work the same way
urls       = [repo['html_url'] for repo in repos]
dates      = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

In [12]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

In [13]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [14]:
# look how pretty it is :)
df.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,bitcoin,https://github.com/JonathanBechtel/bitcoin,2020-05-02T19:57:48Z
1,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
2,JonathanBechtel,covid-19,https://github.com/JonathanBechtel/covid-19,2020-05-01T14:46:48Z
3,JonathanBechtel,covid-app,https://github.com/JonathanBechtel/covid-app,2021-01-18T00:57:37Z
4,JonathanBechtel,dashingdemo,https://github.com/JonathanBechtel/dashingdemo,2021-01-19T03:48:59Z
