# Analyzing the RepoReapers dataset

[Home page](https://reporeapers.github.io/results/1.html) of the dataset.

Import libraries

In [1]:
import configparser
import json
import requests
import pandas as pd

from github import Github
from datetime import datetime, timedelta

Import RepoReaper dataset

In [2]:
df = pd.read_csv('../data/reporeaper.csv', header=0, sep=",", dtype={"stars": object})
print(df.shape)
df.head()

(1853195, 16)


Unnamed: 0,repository,language,architecture,community,continuous_integration,documentation,history,issues,license,size,unit_test,stars,scorebased_org,randomforest_org,scorebased_utl,randomforest_utl
0,matplotlib/matplotlib.github.com,Python,0.770463,2,0,0.014931,2.297872,0.212766,0,1575488,0.013242,5,1,1,1,1
1,NCIP/c3pr-docs,Java,0.997449,3,0,0.087444,1.434211,0.0,0,765164,0.0,0,0,0,1,0
2,AnXgotta/Sur,C++,0.714286,1,0,0.123698,0.0,0.0,0,2155,0.0,0,0,0,0,0
3,bigloupe/SoS-JobScheduler,Java,0.957573,3,1,0.315557,11.428571,0.0,1,657960,0.007257,1,1,0,1,0
4,barons/zf_shop,Ruby,0.381323,3,0,0.327179,0.0,0.0,1,472610,0.055335,0,1,0,1,1


In [3]:
df.loc[:, 'scorebased_org':'randomforest_utl'].sum()

scorebased_org       200336
randomforest_org     111106
scorebased_utl      1288683
randomforest_utl     446511
dtype: int64

Kinsman et al. used the subset `randomforest_utl` in their study. This subset comprises repositories classified as containing an *engineered software project* by a RandomForest classifier. In particular, for this subset, the classifier was trained on the "Utility" dataset. Such dataset was defined on the basis of the following definition of *repository containing an engineered software project*:

> A repository is said to contain an engineered software project if it is similar to repositories that have a fairly general-purpose utility to users other than the developers themselves. For instance, a repository containing a Chrome plug-in is considered to have a general-purpose utility, however, a repository containing a mobile application developed by a student as a course project may not considered to have a general-purpose utility.

First of all, I drop all rows having no information on the number of stars

In [4]:
df.drop(df.index[df["stars"] == "None"], inplace=True)
df["stars"] = df["stars"].astype(int)

Then I try different filtering schemas:

In [5]:
engineered = df['randomforest_utl'] == 1
at_least_2_stars = df['stars'] > 1
at_least_2_core_contributors = df['community'] > 1

- `engineered` + `stars > 1` + `contributors > 1`

In [6]:
df.loc[engineered & at_least_2_stars & at_least_2_core_contributors, :].shape

(70820, 16)

- `engineered` + `contributors > 1`

In [7]:
df.loc[engineered & at_least_2_core_contributors, :].shape

(144692, 16)

- `engineered` + `stars > 1`

In [8]:
df.loc[engineered & at_least_2_stars, :].shape

(183127, 16)

- `stars > 1` + `contributors > 1`

In [9]:
df.loc[at_least_2_stars & at_least_2_core_contributors, :].shape

(93255, 16)

---

# Filtering ideas from _"The Promises and Perils of Mining GitHub"_

In [10]:
config = configparser.ConfigParser()
config.read("../env.ini")
token_list = json.loads(config["GITHUB"]["TOKEN_LIST"])
token = token_list[0]

g = Github(token)

In [11]:
repo = g.get_repo("collab-uniba/behaviz_frontend")

## Number of commits

In [12]:
repo.get_commits().totalCount

52

## Life-span of the project

*Date of repo creation* VS *Date of last commit*

In [13]:
date_of_repo_creation = repo.created_at
print(date_of_repo_creation)

2021-04-12 13:23:07


In [14]:
commits = repo.get_commits()

In [15]:
last_commit = repo.get_commit(sha=commits[0].sha)
last_commit_date = last_commit.last_modified
print(last_commit_date)
last_commit_date = datetime.strptime(last_commit_date, "%a, %d %b %Y %X GMT")
print(last_commit_date)

Sat, 01 Jan 2022 22:58:45 GMT
2022-01-01 22:58:45


In [16]:
first_commit = repo.get_commit(sha=commits.reversed[0].sha)
first_commit_date = first_commit.last_modified
print(first_commit_date)
first_commit_date = datetime.strptime(first_commit_date, "%a, %d %b %Y %X GMT")
print(first_commit_date)

Wed, 14 Apr 2021 14:10:32 GMT
2021-04-14 14:10:32


In [17]:
repo_lifetime = last_commit_date - date_of_repo_creation
print(repo_lifetime)

264 days, 9:35:38


## Trivial repositories

I.e., repositories containing only a `README`, a `.gitignore`, or a `LICENSE`.

In [18]:
basic_files = {'README.md', '.gitignore', 'LICENSE'}
license_names = {license['name'] for license in requests.get("https://api.github.com/licenses").json()}
basic_files = basic_files.union(license_names)

difference = {file.path for file in repo.get_contents("")}.difference({'README.md', '.gitignore', 'LICENSE'})
len(difference) != 0

True

## Last commit N months past the public release of GitHub Acitons

In [19]:
GITHUB_ACTIONS_RELEASE_DATE = datetime(2019, 11, 1)
print(GITHUB_ACTIONS_RELEASE_DATE)

2019-11-01 00:00:00


In [20]:
months = 6
days_in_a_month = 30
OFFSET = timedelta(days=days_in_a_month * months)

OFFSET

datetime.timedelta(days=180)

In [21]:
time_past_GHA_release = last_commit_date - GITHUB_ACTIONS_RELEASE_DATE
time_past_GHA_release > OFFSET

True

---

# `get_workflows()`

In [22]:
list(repo.get_workflows())

[Workflow(url="https://api.github.com/repos/collab-uniba/behaviz_frontend/actions/workflows/11725727", name="DevelopBuild"),
 Workflow(url="https://api.github.com/repos/collab-uniba/behaviz_frontend/actions/workflows/11732063", name="ReleaseBuild")]

---