# Project Git Metrics for Landscape Analysis

Project git metrics for software landscape analysis related to Cytomining ecosystem.

## Setup

Set an environment variable named `LANDSCAPE_ANALYSIS_GH_TOKEN` to a [GitHub access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens). E.g.: `export LANDSCAPE_ANALYSIS_GH_TOKEN=token_here`

In [1]:
import os
from datetime import datetime

import pandas as pd
import pytz
from box import Box
from github import Auth, Github

# set github authorization and client
github_client = Github(
    auth=Auth.Token(os.environ.get("LANDSCAPE_ANALYSIS_GH_TOKEN")), per_page=100
)
# get the current datetime
tz = pytz.timezone("UTC")
current_datetime = datetime.now(tz)

In [2]:
# gather projects data
projects = Box.from_yaml(filename="data/projects.yaml").projects

# check the number of projects
print("number of projects: ", len(projects))

number of projects:  1239


In [3]:
# show the keys available for the projects
projects[0].keys()

dict_keys(['category', 'homepage_url', 'name', 'repo_url'])

In [4]:
def try_to_detect_license(repo):
    """
    Tries to detect the license from GitHub API
    """

    try:
        return repo.get_license().license.spdx_id
    except:
        return None

In [5]:
def try_to_gather_commit_count(repo):
    """
    Tries to detect commit count of repo from GitHub API
    """

    try:
        return len(list(repo.get_commits()))
    except:
        return 0

In [6]:
def try_to_gather_most_recent_commit_date(repo):
    """
    Tries to detect most recent commit date of repo from GitHub API
    """

    try:
        return repo.pushed_at.replace(tzinfo=pytz.UTC)
    except:
        return None

In [7]:
df_projects = pd.DataFrame(
    # create a list of repo data records for a dataframe
    [
        {
            "Project Name": repo.name,
            "Project Homepage": repo.homepage,
            "Project Repo URL": repo.html_url,
            "Project Landscape Category": project.category,
            "GitHub Stars": repo.stargazers_count,
            "GitHub Forks": repo.forks_count,
            "GitHub Watchers": repo.subscribers_count,
            "GitHub Open Issues": repo.get_issues(state="open").totalCount,
            "GitHub Contributors": repo.get_contributors().totalCount,
            "GitHub License Type": try_to_detect_license(repo),
            "GitHub Description": repo.description,
            "GitHub Topics": repo.topics,
            "GitHub Detected Languages": repo.get_languages(),
            "Date Created": repo.created_at.replace(tzinfo=pytz.UTC),
            "Date Most Recent Commit": try_to_gather_most_recent_commit_date(repo),
            # placeholders for later datetime calculations
            "Duration Created to Most Recent Commit": "",
            "Duration Created to Now": "",
            "Duration Most Recent Commit to Now": "",
            "Repository Size (KB)": repo.size,
            "GitHub Repo Archived": repo.archived,
        }
        # make a request for github repo data with pygithub
        for project, repo in [
            (
                project,
                github_client.get_repo(
                    project.repo_url.replace("https://github.com/", "")
                ),
            )
            for project in projects
        ]
    ]
)

# calculate time deltas
df_projects["Duration Created to Most Recent Commit"] = (
    df_projects["Date Most Recent Commit"] - df_projects["Date Created"]
)
df_projects["Duration Created to Now"] = current_datetime - df_projects["Date Created"]
df_projects["Duration Most Recent Commit to Now"] = (
    current_datetime - df_projects["Date Most Recent Commit"]
)

# show the result
df_projects

Following Github server redirection from /repos/theislab/scanpy to /repositories/80342493
Following Github server redirection from /repos/YosefLab/scvi-tools to /repositories/102567256
Following Github server redirection from /repos/shenorrLab/bseqsc to /repositories/62131343
Request GET /repos/Teichlab/sctkr/languages failed with 403: Forbidden
Setting next backoff to 1528.109756s


Unnamed: 0,Project Name,Project Homepage,Project Repo URL,Project Landscape Category,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,GitHub Description,GitHub Topics,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Created to Now,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived
0,pycytominer,https://pycytominer.readthedocs.io,https://github.com/cytomining/pycytominer,"[loi-focus, cytomining-ecosystem]",52,32,6,83,22,BSD-3-Clause,Python package for processing image-based prof...,"[carpenter-lab, cellprofiler, cytominer, image...","{'Python': 373578, 'Jupyter Notebook': 16489, ...",2019-07-03 18:22:51+00:00,2023-10-11 13:58:57+00:00,1560 days 19:36:06,1563 days 01:41:12.235293,2 days 06:05:06.235293,721073,False
1,CytoSnake,https://cytosnake.readthedocs.io,https://github.com/WayScience/CytoSnake,"[loi-focus, cytomining-ecosystem]",3,3,0,35,3,CC-BY-4.0,Orchestrating high-dimensional cell morphology...,"[cell-morphology, microscopy-images, pipeline,...",{'Python': 132842},2022-02-15 18:02:45+00:00,2023-09-01 23:10:17+00:00,563 days 05:07:32,605 days 02:01:18.235293,41 days 20:53:46.235293,780,False
2,CytoTable,https://cytomining.github.io/CytoTable/,https://github.com/cytomining/CytoTable,"[loi-focus, cytomining-ecosystem]",3,4,4,41,4,BSD-3-Clause,Transform data for processing image-based prof...,"[cellprofiler, python, single-cell-analysis, w...",{'Python': 154533},2022-09-08 15:46:25+00:00,2023-10-13 13:46:51+00:00,399 days 22:00:26,400 days 04:17:38.235293,0 days 06:17:12.235293,6829,False
3,IDR_stream,,https://github.com/WayScience/IDR_stream,"[loi-focus, cytomining-ecosystem]",4,2,2,2,2,BSD-3-Clause,Software for feature extraction from IDR image...,[],"{'Jupyter Notebook': 311010, 'Python': 88583}",2022-08-09 21:16:48+00:00,2023-02-24 22:08:54+00:00,199 days 00:52:06,429 days 22:47:15.235293,230 days 21:55:09.235293,37026,False
4,pandas,https://pandas.pydata.org,https://github.com/pandas-dev/pandas,[cytomining-ecosystem-relevant-open-source],40001,16798,1121,3640,411,BSD-3-Clause,Flexible and powerful data analysis / manipula...,"[alignment, data-analysis, data-science, flexi...","{'Python': 20322774, 'Cython': 1277471, 'HTML'...",2010-08-24 01:37:33+00:00,2023-10-13 20:03:44+00:00,4798 days 18:26:11,4798 days 18:26:30.235293,0 days 00:00:19.235293,334934,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1234,Marmoset_FetalBrain_singlecellRNAseq,,https://github.com/parulvarma123/Marmoset_Feta...,[related-tools-github-query-result],1,0,1,0,1,,This repository contains scripts that have bee...,[],{'R': 18458},2022-04-21 18:56:18+00:00,2022-04-21 22:53:17+00:00,0 days 03:56:59,540 days 01:07:45.235293,539 days 21:10:46.235293,7,False
1235,Myotome-volume-Nucleus-count-and-Color-analysis,,https://github.com/peggyscshu/Myotome-volume-N...,[related-tools-github-query-result],1,0,1,6,1,Apache-2.0,The tools in this repository are designed to a...,[],"{'ImageJ Macro': 17729, 'Python': 807}",2022-04-18 09:28:35+00:00,2022-04-27 08:42:53+00:00,8 days 23:14:18,543 days 10:35:28.235293,534 days 11:21:10.235293,103,False
1236,scRNAseq_PBMCs,,https://github.com/jwiarda/scRNAseq_PBMCs,[related-tools-github-query-result],1,1,3,0,2,,Scripts & figures used to finish scRNA-seq ana...,[],{},2020-06-22 20:49:09+00:00,2021-05-20 21:00:49+00:00,332 days 00:11:40,1207 days 23:14:54.235293,875 days 23:03:14.235293,4101,False
1237,thymus_spatial_atlas,,https://github.com/Teichlab/thymus_spatial_atlas,[related-tools-github-query-result],1,0,0,0,1,,general repo that holds all analysis and figur...,"[bioinformatics-tool, common-coordinate-framwo...","{'Jupyter Notebook': 165406342, 'Python': 1268...",2023-05-01 11:53:13+00:00,2023-09-22 13:04:47+00:00,144 days 01:11:34,165 days 08:10:50.235293,21 days 06:59:16.235293,395652,False


In [28]:
# filter the results
df_projects = df_projects[
    # filter projects which are < 50 KB
    df_projects["Repository Size (KB)"]
    >= 50
    # filter projects which have been archived
    & ~df_projects["GitHub Repo Archived"]
][  # filter projects which have no detected programming languages
    df_projects["GitHub Detected Languages"].str.len() > 0
]
df_projects.tail()

Unnamed: 0,Project Name,Project Homepage,Project Repo URL,Project Landscape Category,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,GitHub Description,GitHub Topics,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Created to Now,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived
1233,Vaccine-associated-enhanced-respiratory-pathol...,,https://github.com/Berlin-Hamster-Single-Cell-...,[related-tools-github-query-result],1,0,0,1,1,GPL-3.0,This is the GitHub repository providing access...,[],{'R': 47789},2022-05-16 11:21:45+00:00,2022-07-17 10:27:24+00:00,61 days 23:05:39,515 days 08:42:18.235293,453 days 09:36:39.235293,60,False
1234,Marmoset_FetalBrain_singlecellRNAseq,,https://github.com/parulvarma123/Marmoset_Feta...,[related-tools-github-query-result],1,0,1,0,1,,This repository contains scripts that have bee...,[],{'R': 18458},2022-04-21 18:56:18+00:00,2022-04-21 22:53:17+00:00,0 days 03:56:59,540 days 01:07:45.235293,539 days 21:10:46.235293,7,False
1235,Myotome-volume-Nucleus-count-and-Color-analysis,,https://github.com/peggyscshu/Myotome-volume-N...,[related-tools-github-query-result],1,0,1,6,1,Apache-2.0,The tools in this repository are designed to a...,[],"{'ImageJ Macro': 17729, 'Python': 807}",2022-04-18 09:28:35+00:00,2022-04-27 08:42:53+00:00,8 days 23:14:18,543 days 10:35:28.235293,534 days 11:21:10.235293,103,False
1237,thymus_spatial_atlas,,https://github.com/Teichlab/thymus_spatial_atlas,[related-tools-github-query-result],1,0,0,0,1,,general repo that holds all analysis and figur...,"[bioinformatics-tool, common-coordinate-framwo...","{'Jupyter Notebook': 165406342, 'Python': 1268...",2023-05-01 11:53:13+00:00,2023-09-22 13:04:47+00:00,144 days 01:11:34,165 days 08:10:50.235293,21 days 06:59:16.235293,395652,False
1238,scMultiR,,https://github.com/tagtag/scMultiR,[related-tools-github-query-result],1,1,3,0,1,,This is a R source code by which we can perfor...,[],{'R': 31551},2021-09-12 13:13:07+00:00,2022-04-09 23:23:44+00:00,209 days 10:10:37,761 days 06:50:56.235293,551 days 20:40:19.235293,24,False


In [29]:
# negate this duration value for sorting descendingly,
# with projects that have been more recently changed sorting to the top
df_projects["Negative Duration Most Recent Commit to Now"] = -df_projects[
    "Duration Most Recent Commit to Now"
]
df_projects = df_projects.sort_values(
    by=[
        "GitHub Stars",
        "GitHub Watchers",
        "GitHub Contributors",
        "GitHub Forks",
        "GitHub Open Issues",
        "Negative Duration Most Recent Commit to Now",
        "Duration Created to Most Recent Commit",
    ],
    ascending=False,
)
df_projects

Unnamed: 0,Project Name,Project Homepage,Project Repo URL,Project Landscape Category,GitHub Stars,GitHub Forks,GitHub Watchers,GitHub Open Issues,GitHub Contributors,GitHub License Type,...,GitHub Topics,GitHub Detected Languages,Date Created,Date Most Recent Commit,Duration Created to Most Recent Commit,Duration Created to Now,Duration Most Recent Commit to Now,Repository Size (KB),GitHub Repo Archived,Negative Duration Most Recent Commit to Now
4,pandas,https://pandas.pydata.org,https://github.com/pandas-dev/pandas,[cytomining-ecosystem-relevant-open-source],40001,16798,1121,3640,411,BSD-3-Clause,...,"[alignment, data-analysis, data-science, flexi...","{'Python': 20322774, 'Cython': 1277471, 'HTML'...",2010-08-24 01:37:33+00:00,2023-10-13 20:03:44+00:00,4798 days 18:26:11,4798 days 18:26:30.235293,0 days 00:00:19.235293,334934,False,-1 days +23:59:40.764707
5,numpy,https://numpy.org,https://github.com/numpy/numpy,[cytomining-ecosystem-relevant-open-source],24720,8644,595,2196,435,BSD-3-Clause,...,"[numpy, python]","{'Python': 10457640, 'C': 6220070, 'C++': 2057...",2010-09-13 23:02:39+00:00,2023-10-13 19:01:51+00:00,4777 days 19:59:12,4777 days 21:01:24.235293,0 days 01:02:12.235293,131862,False,-1 days +22:57:47.764707
14,arrow,https://arrow.apache.org/,https://github.com/apache/arrow,[cytomining-ecosystem-relevant-open-source],12603,3094,351,3902,366,Apache-2.0,...,[arrow],"{'C++': 26858030, 'Java': 7353737, 'Go': 56198...",2016-02-17 08:00:23+00:00,2023-10-13 17:57:49+00:00,2795 days 09:57:26,2795 days 12:03:40.235293,0 days 02:06:14.235293,170891,False,-1 days +21:53:45.764707
16,duckdb,http://www.duckdb.org,https://github.com/duckdb/duckdb,[cytomining-ecosystem-relevant-open-source],12351,1154,156,321,253,MIT,...,"[analytics, database, embedded-database, olap,...","{'C++': 33575975, 'C': 1761733, 'Python': 1407...",2018-06-26 15:04:45+00:00,2023-10-13 13:57:33+00:00,1934 days 22:52:48,1935 days 04:59:18.235293,0 days 06:06:30.235293,226865,False,-1 days +17:53:29.764707
15,parquet-mr,,https://github.com/apache/parquet-mr,[cytomining-ecosystem-relevant-open-source],2176,1332,95,130,189,Apache-2.0,...,"[big-data, java, parquet]","{'Java': 5920194, 'Shell': 14860, 'Python': 14...",2014-06-10 07:00:07+00:00,2023-10-13 15:48:32+00:00,3412 days 08:48:25,3412 days 13:03:56.235293,0 days 04:15:31.235293,18492,False,-1 days +19:44:28.764707
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151,ImagingCells,,https://github.com/jesnyder/ImagingCells,[related-tools-github-query-result],0,0,1,0,1,,...,[],{'Jupyter Notebook': 1523},2018-06-15 19:50:00+00:00,2018-08-31 19:21:33+00:00,76 days 23:31:33,1946 days 00:14:03.235293,1869 days 00:42:30.235293,1,False,-1870 days +23:17:29.764707
224,course-bia,,https://github.com/denzf/course-bia,[related-tools-github-query-result],0,0,1,0,1,MIT,...,[],{'Python': 6010},2018-01-28 21:58:13+00:00,2018-01-24 03:22:19+00:00,-5 days +05:24:06,2083 days 22:05:50.235293,2088 days 16:41:44.235293,203,False,-2089 days +07:18:15.764707
156,Cell-virulence-Detection-using-Image-Processing,,https://github.com/arushigupta148/Cell-virulen...,[related-tools-github-query-result],0,0,0,0,1,,...,[],{'Python': 12989},2018-12-27 08:27:06+00:00,2019-05-12 23:09:02+00:00,136 days 14:41:56,1751 days 11:36:57.235293,1614 days 20:55:01.235293,1751,False,-1615 days +03:04:58.764707
160,Image-analysis,,https://github.com/dguin/Image-analysis,[related-tools-github-query-result],0,0,0,0,1,,...,[],{'MATLAB': 40670},2018-10-13 18:53:42+00:00,2018-10-13 19:32:22+00:00,0 days 00:38:40,1826 days 01:10:21.235293,1826 days 00:31:41.235293,26,False,-1827 days +23:28:18.764707


In [30]:
# export to parquet for later use
df_projects.to_parquet("data/project-github-metrics.parquet")