# Repo Metadata and Readme Cleaning

This notebook cleans the repo metadata and readme files downloaded for each repository containing at least one Jupyter notebook.

In [31]:
import os
import time
import json

import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline

## Check for Completeness

Now we should check that we have downloaded data for each repo, the readme for each repo, and that this data is complete.The first thing will be to see if we have a metadata file and a readme file for every repository in our list.

In [32]:
# pull in notebook metadata
df_nbs = pd.read_csv('../data/csv/nb_metadata.csv')
df_nbs.rename(columns = {'Unnamed: 0':'nb_id'}, inplace = True)

# create df of just the repos for each of our notebooks
repo_counts = df_nbs['repo_id'].value_counts()
df_repos = repo_counts.to_frame("num_nb")
df_repos['repo_id'] = df_repos.index 
df_repos.reset_index(drop=True, inplace=True)

print(df_repos.shape)
df_repos.head()

(193616, 2)


Unnamed: 0,num_nb,repo_id
0,10950,13769471
1,4483,79205031
2,4004,91523680
3,3636,68774322
4,3074,55120679


Let's turn this into a list of ids that we can quickly compare with the list of ids from both the metadata we actually downloaded and the readme files we downloaded.

In [33]:
expected_repos = df_repos['repo_id'].as_matrix()
np.sort(expected_repos)

array([    2058,    66233,    87831, ..., 97245373, 97246456, 97255702])

In [34]:
repo_metadata_ids = [int(f.split('.')[0].split('_')[1]) for f in os.listdir('../data/api_results/repo_metadata') if f[-5:] == '.json']
repo_readme_ids = [int(f.split('.')[0].split('_')[1]) for f in os.listdir('../data/readmes') if f[-5:] == '.json']

repo_metadata_ids.sort()
repo_readme_ids.sort()

print('We have %s repo metadata files' % len(repo_metadata_ids))
print('We have %s repo readme files' % len(repo_metadata_ids))

We have 193616 repo metadata files
We have 193616 repo readme files


So it looks like we have the same number of files as repos in our list. Now we should compare these lists to make sure the numbers are the same

In [35]:
repo_metadata_ids = np.array(repo_metadata_ids)
repo_readme_ids = np.array(repo_readme_ids)

In [36]:
missing_repo_metadata = np.setdiff1d(repo_metadata_ids,expected_repos, assume_unique=True)
len(missing_repo_metadata)

0

In [37]:
missing_repo_readmes = np.setdiff1d(expected_repos, repo_readme_ids,  assume_unique=True)
len(missing_repo_metadata)

0

So our first check is done. We have a metadata file and a readme file for each repo. Now to check that these have content in them.

## Metadata Content Completeness
Let's see what content we can inspect from the repo metadata.

In [20]:
with open('../data/api_results/repo_metadata/repo_%s.json' % repo_metadata_ids[0]) as f:
    test_file = json.load(f)
    print(test_file.keys())
    print('')
    print(test_file['owner'].keys())

dict_keys(['id', 'name', 'full_name', 'owner', 'private', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url', 'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size', 'stargazers_count', 'watchers_count', 'language', 'has_issues', 'has_projects', 'has_downloads', 'has_wiki', 'has_pages', 'forks_count', 'mirror_url', 'open_issues_count', 'forks', 'open_issues', 'watchers', 'de

Based on these values, I think we want:
1. id
2. owner id
3. ownder login
4. owner type
5. name
6. description
7. private
8. fork
9. html_url
10. language
11. forks_count
12. stargazers_count
13. watchers_count
13. subscribers_count
14. network_count
14. size
15. open_issues_count
16. topics
17. has_issues
18. has_wiki
19. has_pages
20. has_downloads
21. pushed_at
22. created_at
22. updated_at

We can compile this information from the separate .json files into one repo metadata dataframe that we can then save as a CSV file for later analysis.

In [21]:
all_repos = []

def write_to_log(msg):
    f = '../logs/repo_metadata_cleaning_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

for i, n in enumerate(repo_metadata_ids):
    
    # keep track of our progress
    if i % 10000 == 0:
        print("%s / %s repo data files processed" % (i, len(repo_metadata_ids)))
    
    with open('../data/api_results/repo_metadata/repo_%s.json' % n, 'r') as json_file:

        try:
            j = json.load(json_file)
            
            repo_stats = {}
            repo_stats['id'] = j['id']
            repo_stats['owner_id'] = j['owner']['id']
            repo_stats['owner_login'] = j['owner']['login']
            repo_stats['owner_type'] = j['owner']['type']
            repo_stats['name'] = j['name']
            repo_stats['description'] = j['description']
            repo_stats['private'] = j['private']
            repo_stats['fork'] = j['fork']
            repo_stats['html_url'] = j['html_url']
            repo_stats['language'] = j['language']
            repo_stats['forks_count'] = j['forks_count']
            repo_stats['stargazers_count'] = j['stargazers_count']
            repo_stats['watchers_count'] = j['watchers_count']
            repo_stats['subscribers_count'] = j['subscribers_count']
            repo_stats['network_count'] = j['network_count']
            repo_stats['size'] = j['size']
            repo_stats['open_issues_count'] = j['open_issues_count']
            repo_stats['has_issues'] = j['has_issues']
            repo_stats['has_wiki'] = j['has_wiki']
            repo_stats['has_pages'] = j['has_pages']
            repo_stats['has_downloads'] = j['has_downloads']
            repo_stats['pushed_at'] = j['pushed_at']
            repo_stats['created_at'] = j['created_at']
            repo_stats['updated_at'] = j['updated_at']
            
            all_repos.append(repo_stats)
            
        except:
            msg = "Repo %s metadata did not process" % n
            write_to_log(msg)
                            

print("")            
print("We have %s notebooks"% len(all_repos))

0 / 193616 repo data files processed
10000 / 193616 repo data files processed
20000 / 193616 repo data files processed
30000 / 193616 repo data files processed
40000 / 193616 repo data files processed
50000 / 193616 repo data files processed
60000 / 193616 repo data files processed
70000 / 193616 repo data files processed
80000 / 193616 repo data files processed
90000 / 193616 repo data files processed
100000 / 193616 repo data files processed
110000 / 193616 repo data files processed
120000 / 193616 repo data files processed
130000 / 193616 repo data files processed
140000 / 193616 repo data files processed
150000 / 193616 repo data files processed
160000 / 193616 repo data files processed
170000 / 193616 repo data files processed
180000 / 193616 repo data files processed
190000 / 193616 repo data files processed

We have 193026 notebooks


Okay, so it looks like we could not process 590 repos (193616 - 193026). Let's get a list of those.

In [23]:
missing_metadata_ids = []

with open('../logs/repo_metadata_cleaning_log.txt', 'r') as f:
    for l in f:
        missing_metadata_ids.append(int(l.split(' ')[1]))
        
len(missing_metadata_ids)

1180

In [24]:
with open('../data/api_results/repo_metadata/repo_%s.json' % missing_metadata_ids[400], 'r') as f:
    for l in f:
        print(l)

{"message": "Not Found", "documentation_url": "https://developer.github.com/v3"}


Yep, it looks like we are just missing these. Let me spot check by looking for the one of the repos

In [25]:
missing_repo_example = df_nbs[df_nbs.repo_id == missing_metadata_ids[100]]

In [26]:
missing_repo_example.repo_html_url

1007851    https://github.com/gdmarmerola/hyper-optim-thesis
1007852    https://github.com/gdmarmerola/hyper-optim-thesis
Name: repo_html_url, dtype: object

Yep, this link is broken. Let's see how many notebooks in total we are missing if we exclude these 590 repositories.

In [27]:
df_missing_repo_nbs = df_nbs[df_nbs.repo_id.isin(missing_metadata_ids)]
df_missing_repo_nbs.shape

(4161, 16)

Bummer, that is a fair number of notebooks, though not as bad as I feared it could be. We will just not analyze these files as it seems the repositories were renamed, moved, or deleted in between our initial notebook query, and this follow-up query to get repository metadata.

Now I should do the same pipeline for the readme files to see what data is missing.

## Metadata Content Completeness
Let's see what content we can load.

In [38]:
with open('../data/readmes/readme_%s.json' % repo_readme_ids[2]) as f:
    test_file = json.load(f)
    print(test_file.keys())

dict_keys(['name', 'path', 'sha', 'size', 'url', 'html_url', 'git_url', 'download_url', 'type', 'content', 'encoding', '_links'])


From this list I think we want:
1. name
2. path
3. size
4. html_url
5. type
6. content - base64 encoding

In [39]:
all_readmes = []

def write_to_log(msg):
    f = '../logs/repo_readme_cleaning_log.txt'
    log_file = open(f, "a")
    log_file.write(msg + "\n")
    log_file.close()

for i, n in enumerate(repo_readme_ids):
    
    # keep track of our progress
    if i % 10000 == 0:
        print("%s / %s repo readme files processed" % (i, len(repo_readme_ids)))
    
    with open('../data/readmes/readme_%s.json' % n, 'r') as json_file:

        try:
            j = json.load(json_file)
            
            readme_stats = {}
            readme_stats['repo_id'] = n
            readme_stats['name'] = j['name']
            readme_stats['path'] = j['path']
            readme_stats['size'] = j['size']
            readme_stats['html_url'] = j['html_url']
            readme_stats['type'] = j['type']
            readme_stats['content'] = j['content']
            
            all_readmes.append(readme_stats)
            
        except:
            msg = "Repo %s readme did not process" % n
            write_to_log(msg)
                            

print("")            
print("We have %s notebook readmes"% len(all_readmes))

0 / 193616 repo readme files processed
10000 / 193616 repo readme files processed
20000 / 193616 repo readme files processed
30000 / 193616 repo readme files processed
40000 / 193616 repo readme files processed
50000 / 193616 repo readme files processed
60000 / 193616 repo readme files processed
70000 / 193616 repo readme files processed
80000 / 193616 repo readme files processed
90000 / 193616 repo readme files processed
100000 / 193616 repo readme files processed
110000 / 193616 repo readme files processed
120000 / 193616 repo readme files processed
130000 / 193616 repo readme files processed
140000 / 193616 repo readme files processed
150000 / 193616 repo readme files processed
160000 / 193616 repo readme files processed
170000 / 193616 repo readme files processed
180000 / 193616 repo readme files processed
190000 / 193616 repo readme files processed

We have 142449 notebook readmes


In [40]:
missing_readme_ids = []

with open('../logs/repo_readme_cleaning_log.txt', 'r') as f:
    for l in f:
        missing_readme_ids.append(int(l.split(' ')[1]))
        
len(missing_readme_ids)

102334

Hmm, this is a much larger number of missing readmes. Let's take a closer look at these 51167 missing readmes. 

In [41]:
with open('../data/readmes/readme_%s.json' % missing_readme_ids[3000], 'r') as f:
    for l in f:
        print(l)

{"message": "Not Found", "documentation_url": "https://developer.github.com/v3"}


So It looks like we've got quite a few missing readmes. Let's check to see if this is due to the readme not being there, or because the repo does not actually have one.

In [42]:
missing_readme_example = df_nbs[df_nbs.repo_id == missing_readme_ids[1000]]
missing_readme_example.repo_html_url

169641     https://github.com/daveorstu/SFU-bootcamp-backup
411614     https://github.com/daveorstu/SFU-bootcamp-backup
1110447    https://github.com/daveorstu/SFU-bootcamp-backup
1229011    https://github.com/daveorstu/SFU-bootcamp-backup
Name: repo_html_url, dtype: object

Just spot checking a few examples, it looks like many of these are because the repo did not actually have a readme. That is encouraging, but we won't be able to tell how many of these missing readmes are due to the repo not having a readme, or because the repo was taken down inbetween when we downloaded the repo data.

## Download CSVs

In [29]:
df_repo_metadata = pd.DataFrame(all_repos)
df_readmes = pd.DataFrame(all_readmes)

In [30]:
df_repo_metadata.to_csv('../data/csv/repo_metadata.csv')
df_readmes.to_csv('../data/csv/repo_readme.csv')

And that's a wrap. We now have the core data about our nbs and their repos in CSV files. We can now build on these by [calculating more stats about each notebook](6_compute_nb_data.ipynb).