# Project Git Metrics for Landscape Analysis

Project git metrics for software landscape analysis related to Cytomining ecosystem.

## Setup

Set an environment variable named `LANDSCAPE_ANALYSIS_GH_TOKEN` to a [GitHub access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens). E.g.: `export LANDSCAPE_ANALYSIS_GH_TOKEN=token_here`

In [1]:
import os
from datetime import datetime

import pandas as pd
import pytz
from box import Box
from github import Auth, Github

# set github authorization and client
github_client = Github(
    auth=Auth.Token(os.environ.get("LANDSCAPE_ANALYSIS_GH_TOKEN")), per_page=100
)
# get the current datetime
tz = pytz.timezone("UTC")
current_datetime = datetime.now(tz)

In [2]:
# gather projects data
projects = Box.from_yaml(filename="data/projects.yaml").projects[:10]

# check the number of projects
print("number of projects: ", len(projects))

number of projects:  10


In [3]:
# show the keys available for the projects
projects[0].keys()

dict_keys(['category', 'homepage_url', 'name', 'repo_url'])

In [4]:
def try_to_detect_license(repo):
    """
    Tries to detect the license from GitHub API
    """

    try:
        return repo.get_license().license.spdx_id
    except:
        return None

In [5]:
def try_to_gather_commit_count(repo):
    """
    Tries to detect commit count of repo from GitHub API
    """

    try:
        return len(list(repo.get_commits()))
    except:
        return 0

In [6]:
def try_to_gather_most_recent_commit_date(repo):
    """
    Tries to detect most recent commit date of repo from GitHub API
    """

    try:
        return repo.pushed_at.replace(tzinfo=pytz.UTC)
    except:
        return None

In [7]:
df_projects = pd.DataFrame(
    # create a list of repo data records for a dataframe
    [
        {
            "Project Name": repo.name,
            "Project Homepage": repo.homepage,
            "Project Repo URL": repo.html_url,
            "GitHub Stars": repo.stargazers_count,
            "GitHub Forks": repo.forks_count,
            "GitHub Watchers": repo.subscribers_count,
            "GitHub Open Issues": repo.get_issues(state="open").totalCount,
            "GitHub Contributors": repo.get_contributors().totalCount,
            "GitHub License Type": try_to_detect_license(repo),
            "GitHub Description": repo.description,
            "GitHub Tags": [tag.name for tag in repo.get_tags()],
            "GitHub Detected Languages": repo.get_languages(),
            "Date Created": repo.created_at.replace(tzinfo=pytz.UTC),
            "Date Most Recent Commit": try_to_gather_most_recent_commit_date(repo),
            "Duration Created to Most Recent Commit": "",
            "Duration Most Recent Commit to Now": "",
            "Repository Size (KB)": repo.size,
            "GitHub Repo Archived": repo.archived,
        }
        # make a request for github repo data with pygithub
        for repo in [
            github_client.get_repo(project.repo_url.replace("https://github.com/", ""))
            for project in projects
        ]
    ]
)

# calculate time deltas
df_projects["Duration Created to Most Recent Commit"] = (
    df_projects["Date Most Recent Commit"] - df_projects["Date Created"]
)
df_projects["Duration Most Recent Commit to Now"] = (
    current_datetime - df_projects["Date Most Recent Commit"]
)

# show the result
df_projects

Unnamed: 0,Project Name,Project Homepage,Project Repo URL,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,GitHub Description,GitHub Tags,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived
0,pycytominer,https://pycytominer.readthedocs.io,https://github.com/cytomining/pycytominer,52,32,6,83,22,BSD-3-Clause,Python package for processing image-based prof...,"[v0.2.0, v0.1.5, v0.1]","{'Python': 373578, 'Jupyter Notebook': 16489, ...",2019-07-03 18:22:51+00:00,2023-10-11 13:58:57+00:00,1560 days 19:36:06,1 days 08:37:10.420587,721073,False
1,CytoSnake,https://cytosnake.readthedocs.io,https://github.com/WayScience/CytoSnake,3,3,0,35,3,CC-BY-4.0,Orchestrating high-dimensional cell morphology...,"[v0.0.2, v0.0.1]",{'Python': 132842},2022-02-15 18:02:45+00:00,2023-09-01 23:10:17+00:00,563 days 05:07:32,40 days 23:25:50.420587,780,False
2,CytoTable,https://cytomining.github.io/CytoTable/,https://github.com/cytomining/CytoTable,3,4,4,43,4,BSD-3-Clause,Transform data for processing image-based prof...,"[v0.0.2, v0.0.1]",{'Python': 157082},2022-09-08 15:46:25+00:00,2023-10-11 21:41:55+00:00,398 days 05:55:30,1 days 00:54:12.420587,6817,False
3,IDR_stream,,https://github.com/WayScience/IDR_stream,4,2,2,2,2,BSD-3-Clause,Software for feature extraction from IDR image...,[],"{'Jupyter Notebook': 311010, 'Python': 88583}",2022-08-09 21:16:48+00:00,2023-02-24 22:08:54+00:00,199 days 00:52:06,230 days 00:27:13.420587,37026,False
4,pandas,https://pandas.pydata.org,https://github.com/pandas-dev/pandas,39990,16795,1121,3638,411,BSD-3-Clause,Flexible and powerful data analysis / manipula...,"[v2.2.0dev0, v2.2.0.dev0, v2.1.1, v2.1.0, v2.1...","{'Python': 20312472, 'Cython': 1276599, 'HTML'...",2010-08-24 01:37:33+00:00,2023-10-12 22:34:01+00:00,4797 days 20:56:28,0 days 00:02:06.420587,334958,False
5,numpy,https://numpy.org,https://github.com/numpy/numpy,24715,8640,596,2193,435,BSD-3-Clause,The fundamental package for scientific computi...,"[with_maskna, v2.0.0.dev0, v1.26.0, v1.26.0rc1...","{'Python': 10457640, 'C': 6220070, 'C++': 2057...",2010-09-13 23:02:39+00:00,2023-10-12 21:29:03+00:00,4776 days 22:26:24,0 days 01:07:04.420587,131800,False
6,anndata,http://anndata.readthedocs.io,https://github.com/scverse/anndata,450,138,15,221,43,BSD-3-Clause,Annotated data.,"[0.11.0.dev0, 0.10.2, 0.10.1, 0.10.0, 0.10.0rc...",{'Python': 658028},2017-08-11 14:10:06+00:00,2023-10-12 16:21:47+00:00,2253 days 02:11:41,0 days 06:14:20.420587,4092,False
7,CellProfiler,http://cellprofiler.org,https://github.com/CellProfiler/CellProfiler,802,363,43,270,66,NOASSERTION,An open-source application for biological imag...,"[v4.2.6, v4.2.6rc5, v4.2.6rc4, v4.2.6rc3, v4.2...","{'Python': 5811419, 'HTML': 30638}",2011-04-05 12:10:12+00:00,2023-10-12 12:17:12+00:00,4573 days 00:07:00,0 days 10:18:55.420587,129888,False
8,DeepProfiler,,https://github.com/cytomining/DeepProfiler,83,37,13,27,11,NOASSERTION,Morphological profiling using deep learning,"[v0.5.0, v0.3.1, v0.3.0, v0.2.0, v0.1.0]",{'Python': 233500},2016-11-10 21:33:35+00:00,2023-06-15 13:17:09+00:00,2407 days 15:43:34,119 days 09:18:58.420587,4830,False
9,ImageJ,http://imagej.org,https://github.com/imagej/ImageJ,378,173,46,48,3,NOASSERTION,Public domain software for processing and anal...,"[v1.54f, v1.54e, v1.54d, v1.54c, v1.54b, v1.54...","{'Java': 6070304, 'HTML': 343806, 'ImageJ Macr...",2011-08-01 12:29:11+00:00,2023-10-07 17:10:36+00:00,4450 days 04:41:25,5 days 05:25:31.420587,32411,False


In [8]:
# filter the results
df_projects = df_projects[
    # filter projects which are < 50 KB
    df_projects["Repository Size (KB)"]
    >= 50
    # filter projects which have been archived
    & ~df_projects["GitHub Repo Archived"]
]
df_projects.tail()

Unnamed: 0,Project Name,Project Homepage,Project Repo URL,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,GitHub Description,GitHub Tags,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived
5,numpy,https://numpy.org,https://github.com/numpy/numpy,24715,8640,596,2193,435,BSD-3-Clause,The fundamental package for scientific computi...,"[with_maskna, v2.0.0.dev0, v1.26.0, v1.26.0rc1...","{'Python': 10457640, 'C': 6220070, 'C++': 2057...",2010-09-13 23:02:39+00:00,2023-10-12 21:29:03+00:00,4776 days 22:26:24,0 days 01:07:04.420587,131800,False
6,anndata,http://anndata.readthedocs.io,https://github.com/scverse/anndata,450,138,15,221,43,BSD-3-Clause,Annotated data.,"[0.11.0.dev0, 0.10.2, 0.10.1, 0.10.0, 0.10.0rc...",{'Python': 658028},2017-08-11 14:10:06+00:00,2023-10-12 16:21:47+00:00,2253 days 02:11:41,0 days 06:14:20.420587,4092,False
7,CellProfiler,http://cellprofiler.org,https://github.com/CellProfiler/CellProfiler,802,363,43,270,66,NOASSERTION,An open-source application for biological imag...,"[v4.2.6, v4.2.6rc5, v4.2.6rc4, v4.2.6rc3, v4.2...","{'Python': 5811419, 'HTML': 30638}",2011-04-05 12:10:12+00:00,2023-10-12 12:17:12+00:00,4573 days 00:07:00,0 days 10:18:55.420587,129888,False
8,DeepProfiler,,https://github.com/cytomining/DeepProfiler,83,37,13,27,11,NOASSERTION,Morphological profiling using deep learning,"[v0.5.0, v0.3.1, v0.3.0, v0.2.0, v0.1.0]",{'Python': 233500},2016-11-10 21:33:35+00:00,2023-06-15 13:17:09+00:00,2407 days 15:43:34,119 days 09:18:58.420587,4830,False
9,ImageJ,http://imagej.org,https://github.com/imagej/ImageJ,378,173,46,48,3,NOASSERTION,Public domain software for processing and anal...,"[v1.54f, v1.54e, v1.54d, v1.54c, v1.54b, v1.54...","{'Java': 6070304, 'HTML': 343806, 'ImageJ Macr...",2011-08-01 12:29:11+00:00,2023-10-07 17:10:36+00:00,4450 days 04:41:25,5 days 05:25:31.420587,32411,False


In [9]:
# negate this duration value for sorting descendingly,
# with projects that have been more recently changed sorting to the top
df_projects["Negative Duration Most Recent Commit to Now"] = -df_projects[
    "Duration Most Recent Commit to Now"
]
df_projects = df_projects.sort_values(
    by=[
        "GitHub Stars",
        "GitHub Watchers",
        "GitHub Contributors",
        "GitHub Forks",
        "GitHub Open Issues",
        "Negative Duration Most Recent Commit to Now",
        "Duration Created to Most Recent Commit",
    ],
    ascending=False,
)
df_projects

Unnamed: 0,Project Name,Project Homepage,Project Repo URL,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,GitHub Description,GitHub Tags,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived,Negative Duration Most Recent Commit to Now
4,pandas,https://pandas.pydata.org,https://github.com/pandas-dev/pandas,39990,16795,1121,3638,411,BSD-3-Clause,Flexible and powerful data analysis / manipula...,"[v2.2.0dev0, v2.2.0.dev0, v2.1.1, v2.1.0, v2.1...","{'Python': 20312472, 'Cython': 1276599, 'HTML'...",2010-08-24 01:37:33+00:00,2023-10-12 22:34:01+00:00,4797 days 20:56:28,0 days 00:02:06.420587,334958,False,-1 days +23:57:53.579413
5,numpy,https://numpy.org,https://github.com/numpy/numpy,24715,8640,596,2193,435,BSD-3-Clause,The fundamental package for scientific computi...,"[with_maskna, v2.0.0.dev0, v1.26.0, v1.26.0rc1...","{'Python': 10457640, 'C': 6220070, 'C++': 2057...",2010-09-13 23:02:39+00:00,2023-10-12 21:29:03+00:00,4776 days 22:26:24,0 days 01:07:04.420587,131800,False,-1 days +22:52:55.579413
7,CellProfiler,http://cellprofiler.org,https://github.com/CellProfiler/CellProfiler,802,363,43,270,66,NOASSERTION,An open-source application for biological imag...,"[v4.2.6, v4.2.6rc5, v4.2.6rc4, v4.2.6rc3, v4.2...","{'Python': 5811419, 'HTML': 30638}",2011-04-05 12:10:12+00:00,2023-10-12 12:17:12+00:00,4573 days 00:07:00,0 days 10:18:55.420587,129888,False,-1 days +13:41:04.579413
6,anndata,http://anndata.readthedocs.io,https://github.com/scverse/anndata,450,138,15,221,43,BSD-3-Clause,Annotated data.,"[0.11.0.dev0, 0.10.2, 0.10.1, 0.10.0, 0.10.0rc...",{'Python': 658028},2017-08-11 14:10:06+00:00,2023-10-12 16:21:47+00:00,2253 days 02:11:41,0 days 06:14:20.420587,4092,False,-1 days +17:45:39.579413
9,ImageJ,http://imagej.org,https://github.com/imagej/ImageJ,378,173,46,48,3,NOASSERTION,Public domain software for processing and anal...,"[v1.54f, v1.54e, v1.54d, v1.54c, v1.54b, v1.54...","{'Java': 6070304, 'HTML': 343806, 'ImageJ Macr...",2011-08-01 12:29:11+00:00,2023-10-07 17:10:36+00:00,4450 days 04:41:25,5 days 05:25:31.420587,32411,False,-6 days +18:34:28.579413
8,DeepProfiler,,https://github.com/cytomining/DeepProfiler,83,37,13,27,11,NOASSERTION,Morphological profiling using deep learning,"[v0.5.0, v0.3.1, v0.3.0, v0.2.0, v0.1.0]",{'Python': 233500},2016-11-10 21:33:35+00:00,2023-06-15 13:17:09+00:00,2407 days 15:43:34,119 days 09:18:58.420587,4830,False,-120 days +14:41:01.579413
0,pycytominer,https://pycytominer.readthedocs.io,https://github.com/cytomining/pycytominer,52,32,6,83,22,BSD-3-Clause,Python package for processing image-based prof...,"[v0.2.0, v0.1.5, v0.1]","{'Python': 373578, 'Jupyter Notebook': 16489, ...",2019-07-03 18:22:51+00:00,2023-10-11 13:58:57+00:00,1560 days 19:36:06,1 days 08:37:10.420587,721073,False,-2 days +15:22:49.579413
3,IDR_stream,,https://github.com/WayScience/IDR_stream,4,2,2,2,2,BSD-3-Clause,Software for feature extraction from IDR image...,[],"{'Jupyter Notebook': 311010, 'Python': 88583}",2022-08-09 21:16:48+00:00,2023-02-24 22:08:54+00:00,199 days 00:52:06,230 days 00:27:13.420587,37026,False,-231 days +23:32:46.579413
2,CytoTable,https://cytomining.github.io/CytoTable/,https://github.com/cytomining/CytoTable,3,4,4,43,4,BSD-3-Clause,Transform data for processing image-based prof...,"[v0.0.2, v0.0.1]",{'Python': 157082},2022-09-08 15:46:25+00:00,2023-10-11 21:41:55+00:00,398 days 05:55:30,1 days 00:54:12.420587,6817,False,-2 days +23:05:47.579413
1,CytoSnake,https://cytosnake.readthedocs.io,https://github.com/WayScience/CytoSnake,3,3,0,35,3,CC-BY-4.0,Orchestrating high-dimensional cell morphology...,"[v0.0.2, v0.0.1]",{'Python': 132842},2022-02-15 18:02:45+00:00,2023-09-01 23:10:17+00:00,563 days 05:07:32,40 days 23:25:50.420587,780,False,-41 days +00:34:09.579413


In [11]:
# export to parquet for later use
df_projects.to_parquet("data/project-github-metrics.parquet")