# Downloading Notebooks

This notebook is devoted to downloading the actual notebook files from Github. This search began ~1.30p PST on Fri July 14, 2017 and finished 6.40p on Wednesday July 19, 2017.

This downloading was done in batches to check data quality along the way and avoid getting blocked by GitHub

In [None]:
import os
import time
import json
import datetime
import requests

import pandas as pd
import numpy as np
import seaborn as sns


%matplotlib inline

First let's create our dataframe. Then we can write our scraping code and iteratively go through the file sizes by feeding the code different dataframes. I elected not to download all the files at once so I could check quality of the results along the way and hopefully avoid getting this IP address blocked by github. We are not using the Github API to do this download, so we won't get a 403: denied request message if we are pulling too much data. They may just shut down the IP instead which would be a hard stop on the download (and the project).

In [None]:
df = pd.read_csv('cleaned_nb_data.csv')
df.rename(columns = {'Unnamed: 0':'nb_id'}, inplace = True)
print(df.shape)
df.head()

In [None]:
def write_to_log(msg):
    f = 'nb_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

def scrape_nb_from_df(data_frame):
    
    # check the files already downloaded in case we need to restart the search
    current_files = os.listdir('notebooks_under_100kb')
    print(len(current_files))
    
    count = 0
    
    for i, row in data_frame.iterrows():
        
        date_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        count += 1
        
        # keep track of the download progress, and don't download any files we already have
        if count % 10000 == 0:
            print(count)
                
        if 'nb_%s.ipynb' % row['nb_id'] in current_files:
            continue
            
        try:
            # access the raw content webpage and download the file
            raw_url = row['html_url'].replace('github.com','raw.githubusercontent.com')
            raw_url = raw_url.replace('/blob', '')
            r = requests.get(raw_url)

            filename = 'notebooks_under_100kb/nb_%s.ipynb' % row['nb_id']
            with open(filename, 'w') as nb_file:
                nb_file.write(r.text)
            
            msg = "%s: downloaded %s" % (date_string, row['nb_id'])
            write_to_log(msg)
            
            # if needed we can uncomment this line to slow down the downloads
            # time.sleep(0.1)
            
        except:
            msg = "%s: had trouble downloading %s" % (date_string, row['nb_id'])
            write_to_log(msg)
            print(msg)

## Download the files

The download proceeded in three major batches, separated into different folders in case any one folder had too many files or too much data in it for OSX's finder to handle.

1. Files over 1Mb search for in batches of over 100Mb, over 50Mb, over 30Mb and over 1Mb. In total this was 93,962 files and 390 GB of data
2. Files between 100Kb and 1 Mb. Tis was 341,705 files and 118 Gb of data
3. Finally files under 100Kb in size. This was 817,953 files and only 21 Gb of data.

In total we have 1,253,620 files and 529 Gb of data.

In [None]:
df_over_100mb = df[df['max_filesize'] >= 100000000]
print(df_over_100mb['max_filesize'].sum() / 1000000000)
print(df_over_100mb.shape)

In [None]:
scrape_nb_from_df(df_over_100mb)

In [None]:
df_over_50mb = df[(df['max_filesize'] >= 50000000) & (df['max_filesize'] < 100000000)]
print(df_over_50mb['max_filesize'].sum() / 1000000000)
print(df_over_50mb.shape)

In [None]:
scrape_nb_from_df(df_over_50mb)

In [None]:
df_over_30mb = df[(df['max_filesize'] >= 30000000) & (df['max_filesize'] < 50000000)]
print(df_over_30mb['max_filesize'].sum() / 1000000000)
print(df_over_30mb.shape)

In [None]:
scrape_nb_from_df(df_over_30mb)

In [None]:
df_over_1mb = df[(df['max_filesize'] >= 1000000) & (df['max_filesize'] < 30000000)]
print(df_over_1mb['max_filesize'].sum() / 1000000000)
print(df_over_1mb.shape)

In [None]:
scrape_nb_from_df(df_over_1mb)

In [None]:
df_over_100kb = df[(df['max_filesize'] >= 100000) & (df['max_filesize'] < 1000000)]
print(df_over_100kb['max_filesize'].sum() / 1000000000)
print(df_over_100kb.shape)

In [None]:
scrape_nb_from_df(df_over_100kb)

In [None]:
df_under_100kb = df[df['max_filesize'] < 100000]
print(df_under_100kb['max_filesize'].sum() / 1000000000)
print(df_under_100kb.shape)

In [None]:
scrape_nb_from_df(df_under_100kb)

And that's a wrap. We have now downloaded all the nb files. All of our data should be downloaded at this point. We may want to go collect data on the commits for each notebook in the future, but this will take a long time. Even if there were just three commits per notebook, we would have to run one query to list these commits, and another query to get the details of each commit (for us the most relevant number is the number of lines added or removed). For 1.25 million notebooks, this is 5 million queries. With our limit of only 5,000 queries per hour on Github's API, that leaves with with 1000 hours or about 40 days of straight quering. I don't think its worth it at this time.

Also, for cleanliness, we manually merged all the notebooks into a single folder under `data/notebooks` for future analyses.

Now onto [cleaning the notebook](3_nb_cleaning.ipynb) data