## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

In [1]:
import requests
import pandas as pd

In [2]:
with open('repos.csv') as f:
    data = f.readlines()

FileNotFoundError: [Errno 2] No such file or directory: 'repos.csv'

In [4]:
# notice that data is a list filled with strings that contain info about each line
data

['Name,Repo\n',
 'Jonathan Bechtel,https://github.com/JonathanBechtel\n',
 'Aoife Duna,https://github.com/aoifeduna\n',
 'Erik Lindernoren,https://github.com/eriklindernoren']

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

In [5]:
# this loops through each item in the list, starting at position 1, and replace the \n character with nothing
cleaned_data = [repo.replace('\n', "") for repo in data[1:]]

In [6]:
# we can confirm now that these marks are no longer there
cleaned_data

['Jonathan Bechtel,https://github.com/JonathanBechtel',
 'Aoife Duna,https://github.com/aoifeduna',
 'Erik Lindernoren,https://github.com/eriklindernoren']

### Step 3:  Separate the username in each string from everything else

In [10]:
# we do the same thing, except we take the LAST item from the list returned by split()
usernames = [url.split('/')[-1] for url in repo_urls]

### Step 4: Obtain the repo data for every single github username

In [11]:
# this part of the url will never change
base_url = 'https://api.github.com'

In [12]:
# this goes through every username, and inserts it into the api url, and then passes that into requests.get().json()
# to obtain a list of repos for every single user
repo_lists = [requests.get(f"{base_url}/users/{username}/repos").json() for username in usernames]

### Step 5: Create a 'flat' list that contains every unique repo for every single user

Answer with list comprehension:

In [13]:
# this is a nested for-loop using a list comprehension that returns each item inside the inner list
repos = [repo for user in repo_lists for repo in user]

Nested loops with comprehensions can be difficult to interpret sometimes, so if a regular for-loop is easier to digest, this is a different way of writing the same thing:

In [14]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)

### Step 6:  Get information about the name, owner, url, and date of every single repo.

In [16]:
# this creates a list of all the values for the name key
repo_names = [repo['name'] for repo in repos]
# ditto for the login key -- notice it's accessed inside the owner key
owners     = [repo['owner']['login'] for repo in repos]
# next two work the same way
urls       = [repo['html_url'] for repo in repos]
dates      = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

In [17]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

In [18]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [19]:
# look how pretty it is :)
df.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
1,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
2,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z
3,JonathanBechtel,DAT-10-14,https://github.com/JonathanBechtel/DAT-10-14,2019-10-14T16:13:47Z
4,JonathanBechtel,data,https://github.com/JonathanBechtel/data,2019-01-14T22:09:06Z
