## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

**Note:** There are a number of ways to do this, but the easiest way is usually this:

    `with open('file.csv') as f:

        data = f.readlines()`

In [32]:
# your code here
with open('repos.csv') as f:
    data=f.readlines()

What you should have now is a list, and each item is a string that contains the comma separated values of each cell in the row of that csv file.  

It should generically look like this:

    `['Name,Repo\n',
      'Person 1,https://github.com/username\n',
      'Person 2,https://github.com/username\n',
       ......
       ]`

Double check this is the case.

In [75]:
# your code here
data


['Aoife Duna,https://github.com/aoifeduna\n',
 'Christos Simos,https://github.com/simoschr\n',
 'Brittani Kauf,https://github.com/brittanikauf\n',
 'Stanford Turner,https://github.com/sturner08\n',
 'Shelly Seroussi,https://github.com/sturner08\n',
 'Shela Wu,https://github.com/misowu\n',
 'Rishabh Chandra,https://github.com/rishabhchandra35\n',
 '"Behrang ""Brian"" Bidadi",https://github.com/brianb888\n',
 'Alec Schneider,https://github.com/Blitz33697\n',
 'Taku Takamatsu,https://github.com/taku-takamatsu\n',
 'Lindsey Gumann,https://github.com/lgumann\n',
 'David Hurley,https://github.com/davehurl\n',
 'Kina Abe,https://github.com/kinaabe57\n',
 'Elyse Chu,https://github.com/elysechu\n',
 'Michael Lawlor,https://github.com/lawlormj/sample_work\n',
 'Emily Lam,https://github.com/emilylam98\n',
 'Mike Golodner,https://github.com/mgolodner/ga_repo.git\n',
 'Siddharth Uppal,https://github.com/sid25393']

The only thing we need out of each item is the persons username, that part contained in the string at: `https://github.com/username_here`.  Everything else is junk.  

We'll need to go through a few steps to get our info down to a usable format.  

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

When you're done you should have a list that looks like this:

      `[
      'Person 1,https://github.com/username',
      'Person 2,https://github.com/username',
       ......
       ]`

**hint:** The `replace()` method for strings is probably one of the more useful options that you have.  If you want to replace something with nothing, you can simply specify `""` for that part.

In [76]:
# your code here
#
newdata=[]
for people in data:
    y=people.replace('\n',"")
    newdata.append(y)
newdata


['Aoife Duna,https://github.com/aoifeduna',
 'Christos Simos,https://github.com/simoschr',
 'Brittani Kauf,https://github.com/brittanikauf',
 'Stanford Turner,https://github.com/sturner08',
 'Shelly Seroussi,https://github.com/sturner08',
 'Shela Wu,https://github.com/misowu',
 'Rishabh Chandra,https://github.com/rishabhchandra35',
 '"Behrang ""Brian"" Bidadi",https://github.com/brianb888',
 'Alec Schneider,https://github.com/Blitz33697',
 'Taku Takamatsu,https://github.com/taku-takamatsu',
 'Lindsey Gumann,https://github.com/lgumann',
 'David Hurley,https://github.com/davehurl',
 'Kina Abe,https://github.com/kinaabe57',
 'Elyse Chu,https://github.com/elysechu',
 'Michael Lawlor,https://github.com/lawlormj/sample_work',
 'Emily Lam,https://github.com/emilylam98',
 'Mike Golodner,https://github.com/mgolodner/ga_repo.git',
 'Siddharth Uppal,https://github.com/sid25393']

### Step 3:  Separate the url in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'https://github.com/username',
       'https://github.com/username',
       ...
     ]`
     
**hint:** The `split()` method will help you out here.

In [79]:
splitlinks=[]
for newpeople in newdata:
    splitpeople = newpeople.split(',')
    splitlinks.append(splitpeople[1])

splitlinks

['https://github.com/aoifeduna',
 'https://github.com/simoschr',
 'https://github.com/brittanikauf',
 'https://github.com/sturner08',
 'https://github.com/sturner08',
 'https://github.com/misowu',
 'https://github.com/rishabhchandra35',
 'https://github.com/brianb888',
 'https://github.com/Blitz33697',
 'https://github.com/taku-takamatsu',
 'https://github.com/lgumann',
 'https://github.com/davehurl',
 'https://github.com/kinaabe57',
 'https://github.com/elysechu',
 'https://github.com/lawlormj/sample_work',
 'https://github.com/emilylam98',
 'https://github.com/mgolodner/ga_repo.git',
 'https://github.com/sid25393']

### Step 4:  Separate the username in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'username1',
       'username2',
       ...
     ]`

In [81]:
newsplitpeople=[]
for newpeople in newdata:
    splitpeople = newpeople.split(',')
    newsplitpeople.append(splitpeople[0])

newsplitpeople

['Aoife Duna',
 'Christos Simos',
 'Brittani Kauf',
 'Stanford Turner',
 'Shelly Seroussi',
 'Shela Wu',
 'Rishabh Chandra',
 '"Behrang ""Brian"" Bidadi"',
 'Alec Schneider',
 'Taku Takamatsu',
 'Lindsey Gumann',
 'David Hurley',
 'Kina Abe',
 'Elyse Chu',
 'Michael Lawlor',
 'Emily Lam',
 'Mike Golodner',
 'Siddharth Uppal']

### Step 5: Obtain the repo data for every single github username

The repository info for every single public account is available via the following url: `https://api.github.com/users/:the_username/repos`

So basically, `requests.get('https://api.github.com/users/:the_username/repos').json()` will return a list filled with every single repo that user has.  

When you're done, you should have a *list of lists*, with each list containing each users individual repos.  It'll look like this:

`[[{user1, repo1}, {user1, repo2}], [{user2, repo1}], [{user3, repo1}, {user3, repo2}, {user3, repo3}], .....]`

In [None]:
# your code here


### Step 6: Create a 'flat' list that contains every unique repo for every single user

When you're done you should have a list that looks like this: `[{user1 repo1}, {user1 repo2}, ....{user n, repo m}]`

Ie, no nested levels like you had before.

In [None]:
# your code here
#nested for-loop


### Step 7:  Get information about the name, owner, url, and date of every single repo.

In the dictionary for each repo there are keys called `name`, `login`, `html_url`, and `created_at`.  These are going to populate the values for our different columns.

Values for each one of these keys will need to exist inside their own lists.

**hint:** Notice that the `login` key is nested inside a dictionary that's the value to the `owner` key at the outer level.

In [None]:
# your key here


### Step 8:  Create a dictionary with the data created from step 7

Your final output should look like this:

    `{
       'Owner': [list with the `login` values],
       'Name' : [list with the `name` values],
       'URL'  : [list with the `html_url` values],
       'Date' : [list with the `created_at` values]
     }`

In [None]:
# your answer here


### Step 9:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

Use the `df.head()` method to confirm that you have something that's formatted appropriately.

In [None]:
# your answer here
