# Repository Readme and Metadata Collection

This notebook is devoted to understanding how many different repositories contain jupyter notebooks, as well as collecting the preferred readmes and metadata associated with each repository.

The downloading of repo readmes occured from 11:42a PST on Thur July 20, 2017 to 2.00a on Sat July 22, 2017.

The downloading of repo metadata occured from 9a on Monday July 24, 2017 through 3p on Tuesday July 25, 2017.

In total, the downloading of notebooks, notebook metadata, repo metadata, and repo readmes spanned about two weeks.

In [19]:
import os
import time
import json
import datetime
import requests

import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

## Create Repository Dataframe

First let's make a dataframe with data about each repository.

In [2]:
df = pd.read_csv('../data/csv/nb_metadata.csv')
df.rename(columns = {'Unnamed: 0':'nb_id'}, inplace = True)
print(df.shape)
df.head()

(1253620, 16)


Unnamed: 0,nb_id,html_url,max_filesize,min_filesize,name,owner_html_url,owner_id,owner_login,path,query_page,repo_description,repo_fork,repo_html_url,repo_id,repo_name,repo_private
0,0,https://github.com/dalequark/emotivExperiment/...,10,0,EmotivDataAnalysis.ipynb,https://github.com/dalequark,2328571,dalequark,ipynb/EmotivDataAnalysis.ipynb,1,,False,https://github.com/dalequark/emotivExperiment,26093748,emotivExperiment,False
1,1,https://github.com/kevcisme/madelon_redux/blob...,10,0,Part_IV_Project_3-checkpoint_BASE_63907.ipynb,https://github.com/kevcisme,24496260,kevcisme,ipynb/.ipynb_checkpoints/Part_IV_Project_3-che...,1,,False,https://github.com/kevcisme/madelon_redux,95729593,madelon_redux,False
2,2,https://github.com/HaraldoFilho/DLND-Projects/...,10,0,_.ipynb,https://github.com/HaraldoFilho,15271881,HaraldoFilho,_.ipynb,1,"Index for the projects of the Udacity's ""Deep ...",False,https://github.com/HaraldoFilho/DLND-Projects,88182909,DLND-Projects,False
3,3,https://github.com/mhjensen/CPMLS/blob/4a5b37e...,10,0,csexmas2015.ipynb,https://github.com/mhjensen,2732953,mhjensen,doc/pub/CSETalks/csexmas2015/ipynb/csexmas2015...,1,Master program in Computational Science. The l...,False,https://github.com/mhjensen/CPMLS,35169104,CPMLS,False
4,4,https://github.com/freqn/atom_configuration/bl...,10,0,jupyter.ipynb,https://github.com/freqn,3611075,freqn,packages/file-icons/examples/jupyter.ipynb,1,Atom Config,False,https://github.com/freqn/atom_configuration,57460377,atom_configuration,False


In [3]:
repo_counts = df['repo_id'].value_counts()
df_repos = repo_counts.to_frame("num_nb")
print(df_repos.shape)
df_repos.head()

(193616, 1)


Unnamed: 0,num_nb
13769471,10950
79205031,4483
91523680,4004
68774322,3636
55120679,3074


There appear to be 193,616 unique repositories. That puts us at about 6 notebooks per repository on average, though the mode or median is likely a better measure of central tendency given how skewed I expect this data to be.

Let's now get the owner name and repo name for each repo by merging it with our notebooks table.

In [4]:
df_repos['repo_id'] = df_repos.index 
df_repos.reset_index(drop=True, inplace=True)
print(df_repos.shape)
df_repos.head()

(193616, 2)


Unnamed: 0,num_nb,repo_id
0,10950,13769471
1,4483,79205031
2,4004,91523680
3,3636,68774322
4,3074,55120679


In [5]:
df_repos_full = df_repos.merge(df[['repo_id', 'owner_login', 'repo_name']], how='left', on='repo_id')
df_repos_full.drop_duplicates(inplace=True)
df_repos_full.reset_index(drop=True, inplace=True)
print(df_repos_full.shape)
df_repos_full.head()

(193656, 4)


Unnamed: 0,num_nb,repo_id,owner_login,repo_name
0,10950,13769471,FOSSEE,Python-Textbook-Companions
1,4483,79205031,nbadmin-ibm,automated-bdd-test-repository
2,4004,91523680,ucsd-edx,CSE255-DSE230-Grading
3,3636,68774322,wanglongjuan,15-ODE-homework
4,3074,55120679,TakeToh,pywork


Hmnm, note that we now have 40 additional repo names. Let's check for duplicates. 

In [25]:
multiples = df_repos_full['repo_id'].value_counts()[df_repos_full['repo_id'].value_counts() >= 2]
print(len(multiples))

df_repos_full[df_repos_full['repo_id']==multiples.index[0]]

40

So we have the same repo with different owner names. Looking at these manually, it seems like there are links. When I type in Williams0692 into Github the user does not exist, but when I go to the repo, it redirects me to one with LoriWirth as the owner. 

It looks like if this is the case for all 40, we will be safe removing the duplicates as we will just be redirected if we use one of the other link.

In [8]:
df_repos_dedup = df_repos_full.drop_duplicates(subset = ['repo_id'], keep='first')
df_repos_dedup.reset_index(drop=True, inplace=True)
print(df_repos_dedup.shape)
df_repos_dedup.head()

(193616, 4)


Unnamed: 0,num_nb,repo_id,owner_login,repo_name
0,10950,13769471,FOSSEE,Python-Textbook-Companions
1,4483,79205031,nbadmin-ibm,automated-bdd-test-repository
2,4004,91523680,ucsd-edx,CSE255-DSE230-Grading
3,3636,68774322,wanglongjuan,15-ODE-homework
4,3074,55120679,TakeToh,pywork


# Download preferred readmes

And now, we can attempt to download the preferred readme for each repo.

In [17]:
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}

def write_to_log(msg):
    f = '../logs/repo_readme_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()
    

def get_readmes_from_df(data_frame):
    
    # check the files already downloaded in case we need to restart the search
    current_files = os.listdir('../data/readmes')
    print("There are currently %s readme files saved"% len(current_files))
    
    for i, row in data_frame.iterrows():
        
        # keep track of the download progress
        if i % 10000 == 0:
            print(i)
        
        # don't save files we already have
        if 'readme_%s.json' % row['repo_id'] in current_files:
            continue
        
        recorded = False
        
        wait_time = 0        
        
        while not recorded:
            
            time.sleep(wait_time)
        
            date_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            
            try:
                # query the api
                url = 'https://api.github.com/repos/%s/%s/readme' % (row['owner_login'], row['repo_name'])

                r = requests.get(url, headers = header)
                j = r.json()
                h = r.headers

                # handle abuse rate limiting
                if h['Status'] == "403 Forbidden":
                    print("%s: Hit rate limit. Retry at %s" % (h['Date'], h['X-RateLimit-Reset']))
                    wait_time = int(h['X-RateLimit-Reset']) - time.time() + 1
                    continue
                else:

                    filename = '../data/readmes/readme_%s.json' % row['repo_id']
                    with open(filename, 'w') as readme_file:
                        json.dump(j, readme_file)

                    msg = "%s: downloaded readme for repo %s" % (date_string, row['repo_id'])
                    write_to_log(msg)
                    recorded = True
                    wait_time = 0

            except:
                msg = "%s: had trouble downloading readme for repo %s" % (date_string, row['repo_id'])
                write_to_log(msg)
                print(msg)

In [18]:
get_readmes_from_df(df_repos_dedup)

There are currently 193617 readme files saved
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000


## Download Repo Metadata

We can now use Github's API to also get the metadata for each repo. We are especially interested in the number of forks, watches, and other measures that show collaboration.

In [None]:
os.environ['GITHUB_TOKEN'] = 
header = {'Authorization': 'token %s' % os.environ['GITHUB_TOKEN']}

In [None]:
def write_to_log(msg):
    f = '../logs/repo_metadata_query_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()
    

def get_repo_data_from_df(data_frame):
    
    # check the files already downloaded in case we need to restart the search
    current_files = os.listdir('../data/api_results/repo_metadata')
    print("There are currently %s repo metadata files saved"% len(current_files))
    
    for i, row in data_frame.iterrows():
        
        # keep track of the download progress
        if i % 100000 == 0:
            print("%s repo metadata files saved" % i)
        
        # don't save files we already have
        if 'repo_%s.json' % row['repo_id'] in current_files:
            continue
        
        recorded = False
        
        wait_time = 0        
        
        while not recorded:
            
            time.sleep(wait_time)
        
            date_string = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            
            try:
                # query the api
                url = 'https://api.github.com/repos/%s/%s' % (row['owner_login'], row['repo_name'])

                r = requests.get(url, headers = header)
                j = r.json()
                h = r.headers

                # handle abuse rate limiting
                if h['Status'] == "403 Forbidden":
                    print("%s: Hit rate limit. Retry at %s" % (h['Date'], h['X-RateLimit-Reset']))
                    wait_time = int(h['X-RateLimit-Reset']) - time.time() + 1
                    continue
                else:

                    filename = '../data/api_results/repo_metadata/repo_%s.json' % row['repo_id']
                    with open(filename, 'w') as readme_file:
                        json.dump(j, readme_file)

                    msg = "%s: downloaded readme for repo %s" % (date_string, row['repo_id'])
                    write_to_log(msg)
                    recorded = True
                    wait_time = 0

            except:
                raise
                msg = "%s: had trouble downloading readme for repo %s" % (date_string, row['repo_id'])
                write_to_log(msg)
                print(msg)

In [None]:
get_repo_data_from_df()

And that's a wrap. We have now downloaded metadatafor 1.25 million notebooks as well as the preferred readme and metadata for each repo containing these notebooks. Now on to [cleaning the repo metadata and readmes](5_repo_metadata_cleaning.ipynb).