## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

**Note:** There are a number of ways to do this, but the easiest way is usually this:

    `with open('file.csv') as f:

        data = f.readlines()`

In [55]:
import requests
with open("\\Users\iulia\Downloads\DAT-07-28 Github Repos - Sheet1.csv") as f:
    the_file = f.readlines()

What you should have now is a list, and each item is a string that contains the comma separated values of each cell in the row of that csv file.  

It should generically look like this:

    `['Name,Repo\n',
      'Person 1,https://github.com/username1\n',
      'Person 2,https://github.com/username2\n',
       ......
       ]`

Double check this is the case.

In [57]:
the_file[1:]

['Jonathan Bechtel,https://github.com/JonathanBechtel\n',
 'Luki Elizalde,https://github.com/groovyluki\n',
 'iuliana trufas,https://github.com/Yuliana-GitHub\n',
 'Neraj Thangarajah,https://github.com/nthang1\n',
 'Alina Urs,https://github.com/sprintkayaking\n',
 'Ashleigh Grant,https://github.com/AshleighGrant\n',
 'Nick Hudgell ,Https://github.com/nhudgell/GADS\n',
 'Elisa Scagnetto,https://github.com/lisadt/es_repo280720']

The only thing we need out of each item is the persons username, that part contained in the string at: `https://github.com/username_here`.  Everything else is junk.  

We'll need to go through a few steps to get our info down to a usable format.  

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

When you're done you should have a list that looks like this:

      `[
      'Person 1,https://github.com/username1',
      'Person 2,https://github.com/username2',
       ......
       ]`

**hint:** The `replace()` method for strings is probably one of the more useful options that you have.  If you want to replace something with nothing, you can simply specify `""` for that part.

In [58]:
user_names1 = [row.replace('\n', "") for row in the_file[1:]]
user_names1

['Jonathan Bechtel,https://github.com/JonathanBechtel',
 'Luki Elizalde,https://github.com/groovyluki',
 'iuliana trufas,https://github.com/Yuliana-GitHub',
 'Neraj Thangarajah,https://github.com/nthang1',
 'Alina Urs,https://github.com/sprintkayaking',
 'Ashleigh Grant,https://github.com/AshleighGrant',
 'Nick Hudgell ,Https://github.com/nhudgell/GADS',
 'Elisa Scagnetto,https://github.com/lisadt/es_repo280720']

### Step 3:  Separate the url in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'https://github.com/username1',
       'https://github.com/username2',
       ...
     ]`
     
**hint:** The `split()` method will help you out here.

In [60]:
repo_urls = [row.split(',')[1] for row in user_names1]
repo_urls

['https://github.com/JonathanBechtel',
 'https://github.com/groovyluki',
 'https://github.com/Yuliana-GitHub',
 'https://github.com/nthang1',
 'https://github.com/sprintkayaking',
 'https://github.com/AshleighGrant',
 'Https://github.com/nhudgell/GADS',
 'https://github.com/lisadt/es_repo280720']

### Step 4:  Separate the username in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'username1',
       'username2',
       ...
     ]`

In [70]:
user_list = [url.split('/')[-1] for url in repo_urls]
user_list[2]

'Yuliana-GitHub'

### Step 5: Obtain the repo data for every single github username

The repository info for every single public account is available via the following url: `https://api.github.com/users/:the_username/repos`

So basically, `requests.get('https://api.github.com/users/:the_username/repos').json()` will return a list filled with every single repo that user has.  

When you're done, you should have a *list of lists*, with each list containing each users individual repos.  It'll look like this:

`[[{user1, repo1}, {user1, repo2}], [{user2, repo1}], [{user3, repo1}, {user3, repo2}, {user3, rep3}], .....]`

**Warning:** We're using the free, unauthenticated version of the API here.  That means we can only make 60 API calls per hour before getting throttled.  If we've used up our bandwidth the response you'll get is a dictionary telling you we've exceeded our rate limit or something similar.

If that's the case, try using your phone (or your neighbors) as a hotspot and connect from there to get a new IP address.

In [82]:
base_url = 'https://api.github.com'
repo_lists = [requests.get(f"{base_url}/users/{user}").json() for user in user_list]
repo_lists[7]

{'message': "API rate limit exceeded for 89.32.123.126. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
 'documentation_url': 'https://developer.github.com/v3/#rate-limiting'}

### Step 6: Create a 'flat' list that contains every unique repo for every single user

When you're done you should have a list that looks like this: `[{user1 repo1}, {user1 repo2}, ....{user n, repo m}]`

Ie, instead of having a list filled with other lists with dictionaries inside of them, make it a list with just dictionaries on the inside.

Ie, no nested levels like you had before.

So, go from this:

`[[{user1, repo1}, {user1, repo2}, {user1, repo3}], [{user2, repo1}, {user2, repo2}]]`
    
To this:

`[{user1, repo1}, {user1, repo2}, {user1, repo3}, {user2, repo1}, {user2, repo2}]`
    
If you have questions about what this entails, then please contact me ASAP.

In [74]:
# for user in repo_lists:
#     for repo in user:
#         repos.append(repo)
# print(repos)

### Step 7:  Get information about the name, owner, url, and date of every single repo.

In the dictionary for each repo there are keys called `name`, `login`, `html_url`, and `created_at`.  These are going to populate the values for our different columns.

Values for each one of these keys will need to exist inside their own lists.

**hint:** Notice that the `login` key is nested inside a dictionary that's the value to the `owner` key at the outer level.

In [78]:
repo_names = [repo['name'] for repo in repo_lists if type('name')== dict]
owners     = [repo['owner']['login'] for repo in repo_lists if type('owner')== dict]
urls       = [repo['html_url'] for repo in repo_lists if type('htmp_url')== dict]
dates      = [repo['created_at'] for repo in repo_lists if type('created_at')== dict]
repo_names

[]

### Step 8:  Create a dictionary with the data created from step 7

Your final output should look like this:

    `{
       'Owner': [list with the `login` values for each user],
       'Name' : [list with the `name` values for each user],
       'URL'  : [list with the `html_url` values for each user],
       'Date' : [list with the `created_at` values for each user]
     }`

In [49]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}


### Step 9:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

Use the `df.head()` method to confirm that you have something that's formatted appropriately.

In [53]:
import pandas as pd
df = pd.DataFrame(data_dict)
df.head()

Unnamed: 0,Owner,Name,URL,Date
