# Is Software Updated?

This notebook looks at references to GitHub URLs in papers available in the OA corpus from EuroPMC,
and identifies:

  * How many times the GitHub repositories have been updated since paper referencing them was released
  
Note that at present, we are not distinguishing between URLs referencing software *created* by the paper authors,
versus *used* by the authors, nor which software was created as a result of the work in the paper.

In [None]:
import pandas
import json
from datetime import datetime, timedelta
from collections import defaultdict
from github import Github
import numpy as np
import matplotlib.pyplot as plt

import process_eupmc
import process_urls

## File locations

In [None]:
# Directory containing the data
data_dir = '../data'

# File containing the list of matching papers
matching_papers = data_dir + '/' + 'eupmc_fulltext_html_urls.txt'

# File for the output
output_jsonfile = data_dir + '/' + 'dict_of_papers.json'

# Github Token
gh_token = '../secrets/github_token'

In [None]:
with open(gh_token, 'r') as f:
    github_token = f.read().rstrip()

## Use getpapers to download fulltext of papers

We currently do this outside of the notebook, and assume that the files are available locally.

The command we are using is:

>getpapers --query 'github' -x --limit 100 -o data

which queries EuPMC for all papers containing the term 'github' and returns the full text of the first 100 papers matching this into the directory 'data'

## Textmine each paper

In [None]:
# Get the list of subdirectories dumped by ContentMine
paper_ids = process_eupmc.get_pmcids(matching_papers)

In [None]:
# Process the papers and extract all the references to GitHub and Zenodo urls
papers_info = process_eupmc.process_papers(paper_ids, data_dir)

## Analyse GitHub repos to see frequency of commits

In [None]:
g = Github(github_token)
number_of_updates = defaultdict(int)
frequency_of_updates = defaultdict(int)

for p in papers_info:

    repos = []
    # The following removes references to the main github.com site
    # and also treats references to blobs / issues as references to the repo
    for gh_url in p.references['github']:
        words = gh_url.split('/')
        if len(words) > 4: #
            reponame = words[3] + '/' + words[4]
            if reponame not in repos:
                repos.append(reponame)            
    
    for repo in repos:
        print ("Processing: ", repo)
        code = g.get_repo(repo)
        # limit to commits since publication date
        since = datetime.strptime(p.pub_date, '%Y-%m-%d')
        days = (datetime.now() - since).days
        commits = code.get_commits()
        num_commits = 0
        commit_date = commits[num_commits].commit.author.date
        while commit_date > since:
            num_commits = num_commits + 1
            commit_date = commits[num_commits].commit.author.date
        print("Number of commits since publication: ", num_commits)
        commit_freq = num_commits / days
        print("Commit frequency: ", commit_freq, "commits/day since publication")
        number_of_updates[num_commits] +=1
        # I'm using the magic number 100 until I get a sense of the correct bins to use
        frequency_of_updates[int(100 * commit_freq)] +=1

## Plotting the results

We use a defaultdict so that we can easily zero entries with no data.