# Notebook Metadata Cleaning

This notebook cleans and profiles the metadata that was collected with GitHub's Search AP on the 1.25 million or so Jupyter Notebook hosted publicly on GitHub. This metadata starts as a series of json files detailing the results of each search for notebooks and ends with a cleaned and deduplicated dataframe with the most relevant metadata for each notebook being saved to a csv file for later use.

In the end, we have 1,253,620 unique notebooks with data about them saved in `cleaned_nb_data.csv`. We are missing 43,304 notebooks due to the query not being able to return more than 1,000 results per query and having filesizes with more than 1,000 results for the filesize. Most of these are remarkably small files (~35,000 of them are empty 72 byte files). We are additionally missing 2 notebooks that were not returned with the query results. In total, we have notebook metadata on 1,253,620 of 1,296,926 possible notebooks, or 96.66% of all notebooks.

In [1]:
import os
import json
import datetime

import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

## Missing Ranges

Due to the nature of our search, we may have missed a few files. This would occur when more than 1,000 notebooks had exactly the same filesize.

In [1]:
with open ('../logs/nb_metadata_query_log.txt', 'r') as log:
    for l in log:
        if l.startswith('TOO MANY RESULTS'):
            print(l)

TOO MANY RESULTS: 72-72 bytes, 34895 results

TOO MANY RESULTS: 181-181 bytes, 3303 results

TOO MANY RESULTS: 301-301 bytes, 1032 results

TOO MANY RESULTS: 581-581 bytes, 1416 results

TOO MANY RESULTS: 582-582 bytes, 1300 results

TOO MANY RESULTS: 1134-1134 bytes, 1358 results



This is not too many, just 6 file sizes totaling about 43,000 files. Most of these are remarkably small (72 bytes) with 1134 bytes being the largest. Only this largest set seems to have any content. Everything below it (e.g. 582 bytes and below) are just empty files, or files with one empty cell, so it is not a problem that we missed these. However we should ote that we would have close to 1.3 million files if we kept these.

## Decide what data to collect

First, let's look at what kind of data we have access to

In [2]:
with open('data/notebooks/github_notebooks_8948_8988_p1.json') as f:
    test_file = json.load(f)
    print(test_file.keys())
    print("")
    print(test_file['items'][0].keys())
    print("")
    print(test_file['items'][0]['repository'].keys())
    print("")
    print(test_file['items'][0]['repository']['owner'].keys())

dict_keys(['total_count', 'incomplete_results', 'items'])

dict_keys(['name', 'path', 'sha', 'url', 'git_url', 'html_url', 'repository', 'score'])

dict_keys(['id', 'name', 'full_name', 'owner', 'private', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url'])

dict_keys(['login', 'id', 'avatar_url', 'gravatar_id', 'url', 'html_url', 'followers_url', 'following_url', 'gists_url', 'starred_url', 'subscriptions_url', 'organizati

Based on inspecting an example of the data included with the search results above, let's include the following data in our dataframe.

1. html_url
2. name
3. path
4. repository
    1. description
    2. fork
    3. html_url
    4. id
    5. name
    6. owner
        1. id
        2. html_url
        3. login
    7. private

## Move data from distributed JSON files to a single array

In [3]:
all_nbs = []

# get the names of all our data files
nb_search_files = [f for f in os.listdir('data/notebooks') if f[-5:] == '.json']

# a subset list that was used for debugging the following code
small_list = nb_search_files[0:100]

def write_to_log(msg):
    f = 'nb_metadata_cleaning_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

for j, f in enumerate(nb_search_files):
    
    # keep track of our progress
    if j % 1000 == 0:
        print("%s / %s data files processed" % (j, len(nb_search_files)))
    
    with open('data/notebooks/' + f, 'r') as json_file:
        file_components = f.split(".")[0].split('_')
        
        # get data from the filename
        min_filesize = file_components[2]
        max_filesize = file_components[3]
        query_page = file_components[4][1:]
        
        file_dict = json.load(json_file)
        
        # do a little tracking of where we may be missing data
        if 'incomplete_results' in file_dict:
            if file_dict['incomplete_results'] == True:
                msg = "%s has incomplete results" % f
                write_to_log(msg)
        
        
        if 'items' in file_dict:
            # track if we have no items in this file
            if len(file_dict['items']) == 0:
                msg = "%s has 0 items" % f
                write_to_log(msg)
            
            # of save all the data for each item
            else:
                for i in file_dict['items']:
                    nb_stats = {}
                    nb_stats['html_url'] = i['html_url']
                    nb_stats['name'] = i['name']
                    nb_stats['path'] = i['path']
                    nb_stats['repo_description'] = i['repository']['description']
                    nb_stats['repo_fork'] = i['repository']['fork']
                    nb_stats['repo_html_url'] = i['repository']['html_url']
                    nb_stats['repo_id'] = i['repository']['id']
                    nb_stats['repo_name'] = i['repository']['name']
                    nb_stats['owner_id'] = i['repository']['owner']['id']
                    nb_stats['owner_html_url'] = i['repository']['owner']['html_url']
                    nb_stats['owner_login'] = i['repository']['owner']['login']
                    nb_stats['repo_private'] = i['repository']['private']
                    nb_stats['min_filesize'] = min_filesize
                    nb_stats['max_filesize'] = max_filesize
                    nb_stats['query_page'] = query_page
                    
                    all_nbs.append(nb_stats)
                    
        else:
            msg = "%s has no items object" % f
            write_to_log(msg)    

print("")            
print("We have %s notebooks"% len(all_nbs))

0 / 13538 data files processed
1000 / 13538 data files processed
2000 / 13538 data files processed
3000 / 13538 data files processed
4000 / 13538 data files processed
5000 / 13538 data files processed
6000 / 13538 data files processed
7000 / 13538 data files processed
8000 / 13538 data files processed
9000 / 13538 data files processed
10000 / 13538 data files processed
11000 / 13538 data files processed
12000 / 13538 data files processed
13000 / 13538 data files processed

We have 1261705 notebooks


## Check For Missing Notebooks 
Now that we have the data loaded, we want to check if we have any query result files with missing, or fewer than expected notebooks described in them.

In [4]:
no_items = []
incomplete = []

with open('nb_metadata_cleaning_log.txt', 'r') as log:
    for l in log:
        parts = l.split()
        
        if parts[2] == '0':
            no_items.append(parts[0])
        
        # if only one incomplete flag, or have more than 100 results, or is the last in the line, don't worry
        elif parts[2] == 'incomplete':
            with open('data/' + parts[0], 'r') as json_file:                            
                file_dict = json.load(json_file)
                if len(file_dict['items']) < 100:
                    page = int(parts[0].split('_')[-1].split('.')[0][1:])
                    if not len(file_dict['items']) + 100 * (page-1) == file_dict['total_count']:
                        incomplete.append([parts[0], len(file_dict['items']), file_dict['total_count']])

In [5]:
print(no_items)
print(incomplete)

[]
[['github_notebooks_3846_3862_p4.json', 98, 688], ['github_notebooks_3846_3862_p4.json', 98, 688], ['github_notebooks_3846_3862_p4.json', 98, 688]]


It looks like we had one file that was missing two notebooks for some reason. Not bad. 

We'll also want to check our coverage. Did we really cover all the byte sizes, or did we skip a range? Based on the results below, it seems we did.

In [6]:
ranges = []
gaps = []

for f in nb_search_files:
    parts = f.split('_')
    if parts[4] == 'p1':
        start = int(parts[2])
        end = int(parts[3])
        ranges.append([start, end])
        
ranges.sort(key=lambda x: x[0])

for i, r in enumerate(ranges):
    if i < len(ranges) - 2:
        if ranges[i+1][0] != ranges[i][1] + 1:
            gaps.append([ranges[i][1], ranges[i+1][0]])

len(gaps)

0

## Create Dataframe & CSV

Now we can create a dataframe to clean the data more easily

In [7]:
df = pd.DataFrame(all_nbs)
df.shape

(1261705, 15)

First off, it would be good to know if we have any duplicates. 

It turns out we have quite a few, (~12,000 the first time through) which are likely from either the files getting re-commited at a larger file-size while we were doing the search, or for errors in the query results. We will go ahead and drop all the duplicates.

In [8]:
nb_counts = df['html_url'].value_counts()
len(nb_counts[nb_counts > 1])

11368

In [9]:
df_dedup = df.drop_duplicates(subset = df.columns[df.columns != 'query_page'], keep='first')
df_dedup.shape

(1253620, 15)

In [10]:
df_dedup.to_csv('../data/csv/nb_metadata.csv')

So now I have 1.25 million files that I'm going to want to start [downloading](2_nb_download.ipynb).