# Seek GitHub Projects for Landscape Analysis

Seeking GitHub project and other data entries for software landscape analysis related to Cytomining ecosystem.

## Setup

Set an environment variable named `LANDSCAPE_ANALYSIS_GH_TOKEN` to a [GitHub access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens). E.g.: `export LANDSCAPE_ANALYSIS_GH_TOKEN=token_here`

In [1]:
import os
from datetime import datetime

import numpy as np
import pandas as pd
from box import Box
from github import Auth, Github

# set github authorization and client
github_client = Github(
    auth=Auth.Token(os.environ.get("LANDSCAPE_ANALYSIS_GH_TOKEN")), per_page=100
)

In [2]:
# gather projects data
queries = Box.from_yaml(filename="data/queries.yaml").queries

# observe the queries
queries.to_list()

['single-cell image morphology',
 'cell image morphology',
 'high-dimensional cell morphology',
 'cell image-based profiling',
 'biological image analysis',
 'analyzing scientific images',
 'single-cell analysis']

In [3]:
# setup a reference for target project urls to ignore as additions to avoid duplication
target_project_html_urls = [
    project["repo_url"]
    for project in Box.from_yaml(filename="data/target-projects.yaml").projects
]
target_project_html_urls

['https://github.com/cytomining/pycytominer',
 'https://github.com/WayScience/CytoSnake',
 'https://github.com/cytomining/CytoTable',
 'https://github.com/WayScience/IDR_stream',
 'https://github.com/pandas-dev/pandas',
 'https://github.com/numpy/numpy',
 'https://github.com/scverse/anndata',
 'https://github.com/CellProfiler/CellProfiler',
 'https://github.com/cytomining/DeepProfiler',
 'https://github.com/imagej/ImageJ',
 'https://github.com/qupath/qupath',
 'https://github.com/napari/napari',
 'https://github.com/menchelab/BioProfiling.jl',
 'https://github.com/AltschulerWu-Lab/phenoripper',
 'https://github.com/apache/arrow',
 'https://github.com/apache/parquet-mr',
 'https://github.com/duckdb/duckdb',
 'https://github.com/snakemake/snakemake',
 'https://github.com/Parsl/parsl']

In [4]:
# gather repo data from GitHub based on the results of search queries
results = [
    {
        "name": result.name,
        "homepage_url": result.homepage,
        "repo_url": result.html_url,
        "category": ["related-tools-github-query-result"],
    }
    for query in queries
    for result in github_client.search_repositories(
        query=query, sort="stars", order="desc"
    )
    if result.html_url not in target_project_html_urls
]
len(results)

1137

In [5]:
# read and display rough content of scRNA-Tools content
df_scrna_tools = pd.read_csv("data/scRNA-Tools-tableExport-2023-10-12.csv")
print(df_scrna_tools.shape)

# replace none-like values for citations with 0's for the purpose of sorting
df_scrna_tools["Citations"] = (
    df_scrna_tools["Citations"].replace("-", "0").replace("'-", "0")
).astype("int64")

# drop rows where we don't have a repository
df_scrna_tools = df_scrna_tools.dropna(subset=["Code"])

df_scrna_tools["name"] = df_scrna_tools["Name"]
df_scrna_tools["repo_url"] = df_scrna_tools["Code"]

df_scrna_tools["category"] = np.tile(
    ["cytomining-ecosystem-adjacent-tools"], (len(df_scrna_tools), 1)
).tolist()

# filter results to only those with a github link and sort values by number of citations
df_scrna_tools = df_scrna_tools[
    df_scrna_tools["Code"].str.contains("https://github.com")
].sort_values(by=["Citations"], ascending=False)

# show a previow of the results
df_scrna_tools.head(5)[["name", "repo_url", "category"]]

(1602, 9)


Unnamed: 0,name,repo_url,category
1455,STAR,https://github.com/alexdobin/STAR,[cytomining-ecosystem-adjacent-tools]
1344,Seurat,https://github.com/satijalab/seurat,[cytomining-ecosystem-adjacent-tools]
701,Monocle,https://github.com/cole-trapnell-lab/monocle-r...,[cytomining-ecosystem-adjacent-tools]
601,kallisto,https://github.com/pachterlab/kallisto,[cytomining-ecosystem-adjacent-tools]
901,salmon,https://github.com/COMBINE-lab/salmon,[cytomining-ecosystem-adjacent-tools]


In [6]:
# convert top 100 results to projects-like dataset
df_scrna_tools_records = df_scrna_tools.head(100)[
    ["name", "repo_url", "category"]
].to_dict(orient="records")
df_scrna_tools_records[:5]

[{'name': 'STAR',
  'repo_url': 'https://github.com/alexdobin/STAR',
  'category': ['cytomining-ecosystem-adjacent-tools']},
 {'name': 'Seurat',
  'repo_url': 'https://github.com/satijalab/seurat',
  'category': ['cytomining-ecosystem-adjacent-tools']},
 {'name': 'Monocle',
  'repo_url': 'https://github.com/cole-trapnell-lab/monocle-release',
  'category': ['cytomining-ecosystem-adjacent-tools']},
 {'name': 'kallisto',
  'repo_url': 'https://github.com/pachterlab/kallisto',
  'category': ['cytomining-ecosystem-adjacent-tools']},
 {'name': 'salmon',
  'repo_url': 'https://github.com/COMBINE-lab/salmon',
  'category': ['cytomining-ecosystem-adjacent-tools']}]

In [7]:
# append results from both datasets together
results = df_scrna_tools_records + results

In [8]:
# filter the list of results to uniques
seen_url = set()
results = [
    result
    for result in results
    # check whether we have seen the result yet
    if result["repo_url"] not in seen_url
    # always returns None, so evals to True and adds to list
    and not seen_url.add(result["repo_url"])
]
len(results)

1220

In [9]:
# append target projects to the results
results = Box.from_yaml(filename="data/target-projects.yaml").projects + results

In [10]:
# export the results to a yaml file for later processing
Box({"projects": results}).to_yaml("data/projects.yaml")