## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

**Note:** There are a number of ways to do this, but the easiest way is usually this:

    `with open('file.csv') as f:

        data = f.readlines()`

In [1]:
file = "/Users/cameronlefevre/Desktop/gitrepos.csv"

with open(file) as f:
    data = f.readlines()

What you should have now is a list, and each item is a string that contains the comma separated values of each cell in the row of that csv file.  

It should generically look like this:

    `['Name,Repo\n',
      'Person 1,https://github.com/username1\n',
      'Person 2,https://github.com/username2\n',
       ......
       ]`

Double check this is the case.

In [2]:
data

['Name,Github URL\n',
 'Chloé,https://github.com/chloemd\n',
 ',\n',
 'Gary,https://github.com/Gmarin10\n',
 'Cameron ,https://github.com/clefevre01\n',
 'Oore,https://github.com/ladipoore\n',
 'Jaryd Thornton,https://github.com/jcolethornton\n',
 'Peter,https://github.com/Lothdyn/my-1019-repo\n',
 'Alvaro ,https://github.com/alvarog01/mydat1019\n',
 ',\n',
 'Amanda Chernishkin,https://github.com/amandachernishkin\n',
 'John Mayer,https://github.com/mayerjp01\n',
 'Nidhi Mahambre,https://github.com/nidhim03']

The only thing we need out of each item is the persons username, that part contained in the string at: `https://github.com/username_here`.  Everything else is junk.  

We'll need to go through a few steps to get our info down to a usable format.  

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

When you're done you should have a list that looks like this:

      `[
      'Person 1,https://github.com/username1',
      'Person 2,https://github.com/username2',
       ......
       ]`

**hint:** The `replace()` method for strings is probably one of the more useful options that you have.  If you want to replace something with nothing, you can simply specify `""` for that part.

In [54]:
# your code here
users = []

for item in data:
    if "https://github.com/" in item:
        users.append(item.split(",")[1].replace("https://github.com/","").split("/")[0].rstrip())

print(users)

Name,Github URL

Chloé,https://github.com/chloemd

,

Gary,https://github.com/Gmarin10

Cameron ,https://github.com/clefevre01

Oore,https://github.com/ladipoore

Jaryd Thornton,https://github.com/jcolethornton

Peter,https://github.com/Lothdyn/my-1019-repo

Alvaro ,https://github.com/alvarog01/mydat1019

,

Amanda Chernishkin,https://github.com/amandachernishkin

John Mayer,https://github.com/mayerjp01

Nidhi Mahambre,https://github.com/nidhim03


### Step 3:  Separate the username in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'username1',
       'username2',
       ...
     ]`

**hint:** `split()` is a helpful method for this.  Since the parts of the string are separated by a `/` you could use that to split the string.

In [None]:
# your code here


### Step 4: Obtain the repo data for every single github username

The repository info for every single public account is available via the following url: `https://api.github.com/users/:the_username/repos`

So basically, `requests.get('https://api.github.com/users/:the_username/repos').json()` will return a list filled with every single repo that user has.  

When you're done, you should have a *list of lists*, with each list containing each users individual repos.  It'll look like this:

`[[{user1, repo1}, {user1, repo2}], [{user2, repo1}], [{user3, repo1}, {user3, repo2}, {user3, rep3}], .....]`

**Warning:** We're using the free, unauthenticated version of the API here.  That means we can only make 60 API calls per hour before getting throttled.  If we've used up our bandwidth the response you'll get is a dictionary telling you we've exceeded our rate limit or something similar.

If that's the case, try using your phone (or your neighbors) as a hotspot and connect from there to get a new IP address.

In [53]:
import requests

users = []

for item in data:
    if "https://github.com/" in item:
        users.append(item.split(",")[1].replace("https://github.com/","").split("/")[0].rstrip())

base_url = 'https://api.github.com'

repo_lists = [requests.get(f"{base_url}/users/{user}/repos").json() for user in users]

[requests.get(f"{base_url}/users/{user}/repos").json() for user in users]

print(repo_lists)
    
    

chloemd
[[{'id': 305542140, 'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=', 'name': 'DAT-1019-Chloe', 'full_name': 'chloemd/DAT-1019-Chloe', 'private': False, 'owner': {'login': 'chloemd', 'id': 73141231, 'node_id': 'MDQ6VXNlcjczMTQxMjMx', 'avatar_url': 'https://avatars1.githubusercontent.com/u/73141231?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/chloemd', 'html_url': 'https://github.com/chloemd', 'followers_url': 'https://api.github.com/users/chloemd/followers', 'following_url': 'https://api.github.com/users/chloemd/following{/other_user}', 'gists_url': 'https://api.github.com/users/chloemd/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/chloemd/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/chloemd/subscriptions', 'organizations_url': 'https://api.github.com/users/chloemd/orgs', 'repos_url': 'https://api.github.com/users/chloemd/repos', 'events_url': 'https://api.github.com/users/chloemd/events{/privacy}', 'received_events_

### Step 5: Create a 'flat' list that contains every unique repo for every single user

When you're done you should have a list that looks like this: `[{user1 repo1}, {user1 repo2}, ....{user n, repo m}]`

Ie, instead of having a list filled with other lists with dictionaries inside of them, make it a list with just dictionaries on the inside.

Ie, no nested levels like you had before.

So, go from this:

`[[{user1, repo1}, {user1, repo2}, {user1, repo3}], [{user2, repo1}, {user2, repo2}]]`
    
To this:

`[{user1, repo1}, {user1, repo2}, {user1, repo3}, {user2, repo1}, {user2, repo2}]`
    
If you have questions about what this entails, then please contact me ASAP.

In [57]:
# your code here
repos = [repo for user in repo_lists for repo in user]

repos[0]

{'id': 305542140,
 'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=',
 'name': 'DAT-1019-Chloe',
 'full_name': 'chloemd/DAT-1019-Chloe',
 'private': False,
 'owner': {'login': 'chloemd',
  'id': 73141231,
  'node_id': 'MDQ6VXNlcjczMTQxMjMx',
  'avatar_url': 'https://avatars1.githubusercontent.com/u/73141231?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/chloemd',
  'html_url': 'https://github.com/chloemd',
  'followers_url': 'https://api.github.com/users/chloemd/followers',
  'following_url': 'https://api.github.com/users/chloemd/following{/other_user}',
  'gists_url': 'https://api.github.com/users/chloemd/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/chloemd/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/chloemd/subscriptions',
  'organizations_url': 'https://api.github.com/users/chloemd/orgs',
  'repos_url': 'https://api.github.com/users/chloemd/repos',
  'events_url': 'https://api.github.com/users/chloemd/events{/priva

### Step 6:  Get information about the name, owner, url, and date of every single repo.

We are **almost** there to getting a true dataset out of our API calls.  The problem we have now is that our dictionary is nested.  Ie, some of the values are other dictionaries or lists.  

To make these values accessible to pandas we need to have them all on an equal footing.

In the dictionary for each repo there are keys called `name`, `login`, `html_url`, and `created_at`.  These are going to populate the values for our different columns.

If you look, you'll see that the `login` key is nested inside another dictionary.

What we want to do is have the values for each of these keys inside their own separate lists, so their values can be re-used more easily.

So what you need to do is use for loops to create a list that contains all the individual values for each of these dictionary keys.  

You should results that look like this:

`names  = [list filled with all of the values of the name key]`

`logins = [list filled with all of the values of the login key]`

`urls   = [list filled with all of the values of the html_url key]`

`dates  = [list filled with all of the values of the created_at key]`

**hint:** Notice that the `login` key is nested inside a dictionary that's the value to the `owner` key at the outer level.

In [69]:
repo_names = [repo['name'] for repo in repos]
repo_logins = [repo['owner']['login'] for repo in repos]
repo_urls = [repo['url'] for repo in repos]
repo_dates = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

Your final output should look like this:

    `{
       'Owner': [list with the `login` values for each user],
       'Name' : [list with the `name` values for each user],
       'URL'  : [list with the `html_url` values for each user],
       'Date' : [list with the `created_at` values for each user]
     }`

In [70]:
repo_dict = {
    'Owner': repo_logins,
    'Name': repo_names,
    'URL': repo_urls,
    'Date': repo_dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

Use the `df.head()` method to confirm that you have something that's formatted appropriately.

In [74]:
import pandas as pd

df = pd.DataFrame(repo_dict)

df.head()


Unnamed: 0,Owner,Name,URL,Date
0,chloemd,DAT-1019-Chloe,https://api.github.com/repos/chloemd/DAT-1019-...,2020-10-20T00:00:09Z
1,Gmarin10,DP2,https://api.github.com/repos/Gmarin10/DP2,2018-02-01T22:47:29Z
2,Gmarin10,GA-DAT-1019,https://api.github.com/repos/Gmarin10/GA-DAT-1019,2020-10-20T00:00:09Z
3,Gmarin10,Project-Basta,https://api.github.com/repos/Gmarin10/Project-...,2019-06-11T23:08:20Z
4,Gmarin10,Wallbreakers,https://api.github.com/repos/Gmarin10/Wallbrea...,2019-06-24T22:59:34Z


Unnamed: 0,Owner,Name,URL,Date
0,chloemd,DAT-1019-Chloe,https://github.com/chloemd/DAT-1019-Chloe,2020-10-20T00:00:09Z
1,Gmarin10,DP2,https://github.com/Gmarin10/DP2,2018-02-01T22:47:29Z
2,Gmarin10,GA-DAT-1019,https://github.com/Gmarin10/GA-DAT-1019,2020-10-20T00:00:09Z
3,Gmarin10,Project-Basta,https://github.com/Gmarin10/Project-Basta,2019-06-11T23:08:20Z
4,Gmarin10,Wallbreakers,https://github.com/Gmarin10/Wallbreakers,2019-06-24T22:59:34Z
5,clefevre01,Test-Repo,https://github.com/clefevre01/Test-Repo,2020-10-20T00:00:08Z
6,ladipoore,DAT-1019,https://github.com/ladipoore/DAT-1019,2020-10-20T00:00:13Z
7,ladipoore,EdgarSearch,https://github.com/ladipoore/EdgarSearch,2017-07-27T01:36:00Z
8,ladipoore,Euler,https://github.com/ladipoore/Euler,2017-02-10T04:59:29Z
9,ladipoore,PythonClass,https://github.com/ladipoore/PythonClass,2017-02-14T00:54:42Z
