# Github Jupyter Notebook Scraping

This notebook documents the scraping of the 1,020 Jupyter Notebooks posted most recently to GitHub as of Mon, 26 Jun 2017 17:37:51 GMT. 

For futher analysis of this data, see the [analysis notebook](1000_jupyter_notebooks.ipynb).

## Setup
[GitHub's API documentation](https://developer.github.com/v3/) and especially their [getting started guide](https://developer.github.com/v3/guides/getting-started/) were helpful in doing this analysis.

In [129]:
import os
import time
import json
import requests

import pandas as pd

## Authentication
GitHub limits unauthenticated users to 60 API requests per hour. You can follow [GitHub's instructions](https://developer.github.com/v3/guides/getting-started/#authentication) to generate a personal access token for future use. These tokens are more useful than simply authenticating with one's GitHub account name becuase the token's access rights can be fine tuned and the token can be revoked at any time.

For security's sake, I have not listed the token here explicity. Instead I ran `%env GITHUB_TOKEN = ...` to set an evirnoment variable, and then deleted the cell.

## Redirecting Query Results to a File
I am going to scrape a fair amount of data (about 200Mb), so I want to save this information to a series of files rather rather than holding it all in memory. 

I will create the `data` directory for storing the dowloaded data. Information can be redirected by appending `> file.txt` to the end of the API request.

In [10]:
%%bash
mkdir data

## Searching for recently uploaded Jupyter Notebooks
Now using GitHub's [search API](https://developer.github.com/v3/search/), I can search for notebooks. According to the following query, there are about 1.25 million notebooks on GitHub, each page of GitHub's results returns 30 notebooks along with a few bits of information about each notebook.

I have sorted the results to include the most recently indexed (rough approximation of most recently uploaded) notebooks.

**Note:** The `repository` key holds a lot of inforamtion about the repo that contains the notebook, but I will not be analyzing that for now.

In [43]:
%%bash
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/search/code?q=ipynb+in:path+extension:ipynb+sort:indexed  > data/recent_notebooks.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  169k  100  169k    0     0   236k      0 --:--:-- --:--:-- --:--:--  237k


In [44]:
# let's look at the data included
with open('data/recent_notebooks.json') as data_file:    
    data = json.load(data_file)
    print(data['total_count'])
    print(len(data['items']))
    print(data['items'][0].keys())

1254897
30
dict_keys(['name', 'path', 'sha', 'url', 'git_url', 'html_url', 'repository', 'score'])


## Scaling the query to retrieve all 1000 search results
I am only getting 30 results per page, so we need to [traverse the pagination](https://developer.github.com/v3/guides/traversing-with-pagination/) to look at all the results. GitHub limits search results to 1000 items, so I have 34 pages of resuls.

If we look at the header of the search, we can see a `Link` field with information about the other search result pages. We can use this to traverse the paginated results.

In [59]:
%%bash
curl -i -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/search/code?q=ipynb+in:path+extension:ipynb+sort:indexed

HTTP/1.1 200 OK
Date: Mon, 26 Jun 2017 17:37:51 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 173838
Server: GitHub.com
Status: 200 OK
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 28
X-RateLimit-Reset: 1498498711
Cache-Control: no-cache
X-OAuth-Scopes: gist, public_repo, user
X-Accepted-OAuth-Scopes: 
X-GitHub-Media-Type: github.v3; format=json
Link: <https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=2>; rel="next", <https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=34>; rel="last"
Access-Control-Expose-Headers: ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval
Access-Control-Allow-Origin: *
Content-Security-Policy: default-src 'none'
Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protectio

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0  169k    0   173    0     0    301      0  0:09:37 --:--:--  0:09:37   302100  169k  100  169k    0     0   251k      0 --:--:-- --:--:-- --:--:--  251k


We will use python to run the same requests we have been running up to this point, but now with a loop to handle the pagination. For now, I will save each page of 30 results to its own json file.

In [77]:
# http request variables
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}
url = 'https://api.github.com/search/code?q=ipynb+in:path+extension:ipynb+sort:indexed'

# loop management variables
end_list = False
page = 0

# request loop to go through paginated results
while not end_list:
    print(url)
    r = requests.get(url, headers = header)
    file = 'data/recent_notebooks_%s.json' % page
    with open(file, "w") as json_file:
        json_file.write(r.text)
    if r.links["next"]:
        url = r.links["next"]['url']
        page += 1
    else:
        end_list = True
    if page > 50:
        end_list = True
    
    # sleep a bit to avoid triggering API abuse conditions
    time.sleep(2)

https://api.github.com/search/code?q=ipynb+in:path+extension:ipynb+sort:indexed
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=2
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=3
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=4
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=5
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=6
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=7
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=8
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=9
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Aindexed&page=10
https://api.github.com/search/code?q=ipynb+in%3Apath+extension%3Aipynb+sort%3Ainde

KeyError: 'next'

I got an error there since I did not write my stop condition properly (should have checked for "next" in the keys rather than reference it directly), but it should not have impacted the results. I wanted the analysis to stop anyway. I now have 34 pages of results saved to different json files named `recent_notebooks_#.json`

Inspecting the data about each notebook, for now I just want the name, path, and html_url. See the example below.

In [114]:
with open('data/recent_notebooks_32.json') as data_file:    
    data = json.load(data_file)
    print(data['items'][0]['name'])
    print(data['items'][0]['path'])
    print(data['items'][0]['html_url'])
    # print(data['items'][0]['sha'])
    # print(data['items'][0]['url'])    
    # print(data['items'][0]['git_url'])    
    # print(data['items'][0]['repository'])
    # print(data['items'][0]['score'])

futures-checkpoint.ipynb
BAH/inverse_regression/cython_tutorials/Learning Cython - Working Files/Chapter 09/futures/.ipynb_checkpoints/futures-checkpoint.ipynb
https://api.github.com/repositories/95462574/contents/BAH/inverse_regression/cython_tutorials/Learning%20Cython%20-%20Working%20Files/Chapter%2009/futures/.ipynb_checkpoints/futures-checkpoint.ipynb?ref=82ff0e51fe6d3dfa5d9290e06a78c97a1b50ea6e
https://github.com/datumsays/testing_dir/blob/82ff0e51fe6d3dfa5d9290e06a78c97a1b50ea6e/BAH/inverse_regression/cython_tutorials/Learning%20Cython%20-%20Working%20Files/Chapter%2009/futures/.ipynb_checkpoints/futures-checkpoint.ipynb


## Combining search results into a single json file
Next, I want to combine all the results into a single file. I could have possibly done this from the start, but python cannot insert new values into a json file without rewriting the entire file. No append to speak of.

In [127]:
notebooks = {"total_count":1254897,"incomplete_results":False,"items":[]}
for f in os.listdir('data/recent_notebooks'):
    with open(os.path.join('data/recent_notebooks', f)) as data_file:   
        data = json.load(data_file)
        for i in data['items']:
            item_dict = {}
            item_dict['name'] = i['name']
            item_dict['path'] = i['path']
            item_dict['html_url'] = i['html_url']                     
            notebooks['items'].append(item_dict)
            
with open('data/recent_notebooks.json', "w") as json_file:
        json.dump(notebooks, json_file)

1020


## Data Cleaning and Profiling
Now let's see what we have. From the results below it looks like the max value count on html_url is 1, meaning every one of the 1020 urls I pulled is unique. I was expecting only 1000 results, but it seems Github gave me a full 30 for each of the 34 pages.

In [130]:
with open('data/recent_notebooks.json') as data_file:
    data = json.load(data_file)
    df = pd.DataFrame(data['items'])

In [132]:
df.head()

Unnamed: 0,html_url,name,path
0,https://github.com/ericschulz/compforca/blob/d...,Analysis.ipynb,Data analysis/Analysis.ipynb
1,https://github.com/huang12zheng/jupyter/blob/c...,SetEnv.ipynb,python_study/SetEnv.ipynb
2,https://github.com/bukosabino/btctrading/blob/...,xgboost.ipynb,xgboost.ipynb
3,https://github.com/bukosabino/btctrading/blob/...,Ensembles.ipynb,Ensembles.ipynb
4,https://github.com/BenjiDa/Flowstress/blob/ff9...,Compile_code.ipynb,Compile_code.ipynb


In [151]:
df['html_url'].value_counts().max()

1

## Downloading notebooks as raw JSON

Finally, i want to download each of these notebooks. There may be several notebooks with the same name, so I will just rename them according to their index when I save them (rather than keeping their original name). That will make iterating over them easier as well.

It will be easiest if I pull the data from each notebook's raw link, which I can generate programmatically from the standard link. While I'm at it, I'll create a nex download_index column (in case I want to mess with the true indices later) and save a copy of the dataframe to a csv file.

In [139]:
df['raw_url'] = df['html_url']

df['raw_url'] = df['raw_url'].str.replace("github.com", "raw.githubusercontent.com")
df['raw_url'] = df['raw_url'].str.replace("/blob", "")

In [148]:
df['download_index'] = df.index

In [150]:
df.to_csv('data/recent_notebooks.csv')

Alright, after that kind of hacky string replacing, I want to loop over these files and download the contents to the `raw_notebooks` folder.

I will print a status statement after every 10 notebooks, roughly every 1% of the download, to track the download, and also put in a 1 second delay between each notebook to prevent this looking like abuse of GitHub's servers.

In [147]:
for index, row in df.iterrows():
    if index % 10 == 0:
        print(index)
    r = requests.get(row['raw_url'])
    file = 'data/raw_notebooks/nb%s.json' % index
    with open(file, "w") as json_file:
        json_file.write(r.text)    
    time.sleep(1)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000
1010


And that's a wrap! We now have 1020 notebooks downloaded locally for further analysis. The file structure looks something like:
```
github_scraping.ipynb 
1000_jupyter_notebooks.ipynb
data/
  - recent_notebooks.json
  - recent_notebooks.csv
  - recent_notebooks/
    - recent_notebooks_0.json
    - recent_notebooks_1.json
    - ...
```