# OSS Insight + CHAOSS Metrics Analysis Notebook

This notebook provides a full tutorial and report on how to:

- Query the OSS Insight API for GitHub repositories  
- Transform those JSON API responses into pandas DataFrames  
- Map each retrieved metric to its corresponding **CHAOSS community health metrics**  
- Produce **interactive visualizations** using Bokeh  
- Understand what each visualization means in terms of project health  
- Apply all steps to a chosen repository: **deeplabcut/deeplabcut**

The goal is to replicate what tools like **GrimoireLab**, **Augur**, or **GHViz** provide,
but directly inside a Python notebook, using OSS Insight as the data source.

By the end of this report, you will have:
- A reusable workflow  
- A complete visualization dashboard  
- A blueprint to build your own open-source analytics system 

## 0. Initialize Notebook Environment
We load the required Python libraries.  
If something is missing, uncomment the pip install commands.

We will use:
- `requests` to call the OSS Insight REST API  
- `pandas` to manipulate tabular data  
- `bokeh` for fully interactive plots (zoom, hover, pan) 

In [76]:
# !pip install requests pandas bokeh

import requests
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import HoverTool

output_notebook()


Explanation: The OSS Insight API base URL is `https://api.ossinsight.io/v1`. The API provides various endpoints for repository data
ossinsight.io. We will construct requests to these endpoints for our target repository. Bokeh’s `output_notebook()` is called to enable interactive plot display within the notebook.

# 1. Select Repository  
This is the only value you need to change if you want to analyze another GitHub project.

The format must be:  
`owner = "ORG_OR_USER"`  
`repo  = "REPOSITORY_NAME"`  


In [77]:
repo_list = [
    {"owner": "scala", "repo": "scala"},
    {"owner": "deeplabcut", "repo": "deeplabcut"},
    {"owner": "arkworks-rs", "repo": "snark"},
    {"owner": "JuliaMolSim", "repo": "DFTK.jl"},
    {"owner": "epfl-dlab", "repo": "aiflows"},
]
selected_repo = repo_list[4]

In [78]:
output = {}

owner = selected_repo["owner"]
repo  = selected_repo["repo"]


def get_repo_info(owner, repo):
    url = f"https://api.ossinsight.io/gh/repo/{owner}/{repo}"
    res = requests.get(url)
    res.raise_for_status()
    return res.json()["data"]

repo_info = get_repo_info(owner, repo)
repo_id = repo_info["id"]

print("Repo:", repo_info["full_name"])
print("Repo ID:", repo_id)
print("Stars (live GitHub count):", repo_info["stargazers_count"])
output["stars"] = repo_info["stargazers_count"]
output["repo"] = repo_info["full_name"]

Repo: epfl-dlab/aiflows
Repo ID: 674002978
Stars (live GitHub count): 271


# 2. STAR HISTORY  
### CHAOSS Relation  
Although stars are *not* an official CHAOSS metric, they are widely used as a **signal of growth and adoption**, which falls under the CHAOSS *Growth-Maturity-Decline* model.

### Why it matters  
- Shows overall interest in the project  
- Detects periods of major attention (papers, releases, conferences, viral posts)  
- Indicates growth trajectories (accelerating, stagnant, declining)

### OSS Insight Endpoint  
`/stargazers/history?period=month`  
This gives a **monthly timeline** of cumulative stars.


In [79]:
def get_star_history(repo_id: int) -> pd.DataFrame:
    url = f"https://api.ossinsight.io/q/analyze-stars-history?repoId={repo_id}"
    res = requests.get(url)
    res.raise_for_status()
    data = res.json()["data"]
    df = pd.DataFrame(data)
    if df.empty:
        return df
    
    df["event_month"] = pd.to_datetime(df["event_month"])
    df = df.rename(columns={"event_month": "date", "total": "stargazers"})
    return df

df_stars = get_star_history(repo_id)

if df_stars.empty:
    print("No star history data returned by OSS Insight for this repository.")
else:
    display(df_stars.tail())



Unnamed: 0,repo_id,date,stargazers
17,674002978,2025-03-01,247
18,674002978,2025-04-01,248
19,674002978,2025-05-01,249
20,674002978,2025-06-01,250
21,674002978,2025-08-01,257


## Plot: Star Growth Curve  
Interpreting this chart:

- A steady positive slope → healthy long-term interest  
- Sharp jumps → major announcements/releases  
- Plateaus → stagnation or reduced visibility  
- Drops → stars removed (rare but possible)

Bokeh allows:
- scroll to zoom  
- click-drag to pan  
- hover to see exact values  


In [80]:
if df_stars.empty:
    print("No star history data returned for this repository.")
else:
    p = figure(
        title=f"Stargazers Over Time – {owner}/{repo}",
        x_axis_type='datetime',
        width=900, height=420,
        tools="pan,wheel_zoom,box_zoom,reset"
    )

    p.line(df_stars['date'], df_stars['stargazers'], line_width=2)
    p.circle(df_stars['date'], df_stars['stargazers'], size=3)

    p.add_tools(HoverTool(
        tooltips=[("Date", "@x{%F}"), ("Stars", "@y")],
        formatters={"@x": "datetime"}
    ))

    p.xaxis.axis_label = "Date"
    p.yaxis.axis_label = "Cumulative Stars"
    show(p)



# 3. COMMIT ACTIVITY (Time Distribution)

## CHAOSS Mapping
**Code Development - Commit Activity**

Understanding *when* developers contribute helps in:
- Identifying global contributor timezones
- Detecting "crunch times" or unhealthy work hours (e.g., weekends/late nights)
- Planning meetings and support hours

### OSS Insight Endpoint
`/q/analyze-commits-time-distribution?period=last_1_year`
This provides a heatmap of commit pushes by **Day of Week** and **Hour of Day**.

In [81]:
url_commits = f"https://api.ossinsight.io/q/analyze-commits-time-distribution?repoId={repo_id}&period=last_1_year"
res_commits = requests.get(url_commits).json()
df_commits = pd.DataFrame(res_commits['data'])

if not df_commits.empty:
    days = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
    df_commits["day_name"] = df_commits["dayofweek"].apply(lambda x: days[x])
    df_commits["hour_str"] = df_commits["hour"].astype(str)
    display(df_commits.head())

In [82]:
from bokeh.models import LinearColorMapper, BasicTicker, ColorBar
from bokeh.transform import transform

if df_commits.empty:
    print("No commit data found.")
else:
    days = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
    hours = [str(x) for x in range(24)]
    
    mapper = LinearColorMapper(palette="Viridis256", low=df_commits.pushes.min(), high=df_commits.pushes.max())

    p_commits = figure(title=f"Commit Time Distribution (Last 1 Year)",
               x_range=hours, y_range=list(reversed(days)),
               x_axis_location="above", width=900, height=400,
               tools="hover,save,pan,box_zoom,reset,wheel_zoom")

    p_commits.rect(x="hour_str", y="day_name", width=1, height=1, source=df_commits,
            fill_color=transform('pushes', mapper), line_color=None)

    p_commits.add_tools(HoverTool(
        tooltips=[
            ('Day', '@day_name'),
            ('Hour', '@hour'),
            ('Pushes', '@pushes'),
        ]
    ))
    
    color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
                         ticker=BasicTicker(desired_num_ticks=len(days)))

    p_commits.add_layout(color_bar, 'right')
    p_commits.axis.axis_line_color = None
    p_commits.axis.major_tick_line_color = None
    p_commits.axis.major_label_text_font_size = "10pt"
    p_commits.axis.major_label_standoff = 0
    p_commits.xaxis.major_label_orientation = 0
    
    show(p_commits)

No commit data found.


# 4. PULL REQUEST HEALTH

## CHAOSS Mapping
**Code Development - Code Review Efficiency**

Analyzing PRs gives insight into the *velocity* and *quality* of the development process.

### Metrics
1.  **PR Overview**: High-level volume stats.
2.  **PR Size**: Smaller PRs are easier to review and merge (Good practice).
3.  **Time to Merge**: How long does a contributor wait? Long wait times discourage contribution.

### OSS Insight Endpoints
- `/q/analyze-repo-pr-overview`
- `/q/analyze-pull-requests-size-per-month`
- `/q/analyze-pull-request-open-to-merged`

In [83]:
# 4.1 PR Overview
url_pr_overview = f"https://api.ossinsight.io/q/analyze-repo-pr-overview?repoId={repo_id}"
res_overview = requests.get(url_pr_overview).json()
data_overview = res_overview['data'][0] if res_overview['data'] else {}
print(data_overview)

if data_overview:
    print(f"Total PRs:       {data_overview.get('pull_requests')}")
    print(f"PR Creators:     {data_overview.get('pull_request_creators')}")
    print(f"Total Reviews:   {data_overview.get('pull_request_reviews')}")
    print(f"Reviewers:       {data_overview.get('pull_request_reviewers')}")
else:
    print("No PR overview data found.")

output['pull_requests'] = data_overview.get('pull_requests')
output['pull_request_creators'] = data_overview.get('pull_request_creators')

{'repo_id': 674002978, 'pull_requests': 8, 'pull_request_creators': 3, 'pull_request_reviews': 1, 'pull_request_reviewers': 1}
Total PRs:       8
PR Creators:     3
Total Reviews:   1
Reviewers:       1


In [84]:
# 4.2 PR Size History
url_pr_size = f"https://api.ossinsight.io/q/analyze-pull-requests-size-per-month?repoId={repo_id}"
res_size = requests.get(url_pr_size).json()
df_pr_size = pd.DataFrame(res_size['data'])

if not df_pr_size.empty:
    df_pr_size["date"] = pd.to_datetime(df_pr_size["event_month"])
    
    sizes = ['xs', 's', 'm', 'l', 'xl', 'xxl']
    colors = ["#e8f5e9", "#c8e6c9", "#a5d6a7", "#81c784", "#66bb6a", "#4caf50"] 
    
    p_size = figure(title="PR Size Distribution Over Time",
                x_axis_type="datetime", width=900, height=400,
                tools="pan,wheel_zoom,box_zoom,reset")
    
    p_size.vbar_stack(sizes, x='date', width=20 * 24 * 3600 * 1000, source=df_pr_size,
                  color=colors, legend_label=sizes)
    
    p_size.legend.location = "top_left"
    p_size.legend.orientation = "horizontal"
    p_size.yaxis.axis_label = "Number of PRs"
    
    show(p_size)

In [85]:
# 4.3 Time to Merge
url_merge = f"https://api.ossinsight.io/q/analyze-pull-request-open-to-merged?repoId={repo_id}"
res_merge = requests.get(url_merge).json()
df_merge = pd.DataFrame(res_merge['data'])

if not df_merge.empty:
    df_merge["date"] = pd.to_datetime(df_merge["event_month"])
    
    p_merge = figure(title="Median PR Merge Time (Hours)",
                x_axis_type="datetime", width=900, height=400,
                tools="pan,wheel_zoom,box_zoom,reset")
    
    p_merge.line(df_merge['date'], df_merge['p50'], line_width=2, color="navy", legend_label="Median Merge Time")
    p_merge.circle(df_merge['date'], df_merge['p50'], size=4, color="navy")
    
    p_merge.add_tools(HoverTool(
        tooltips=[("Date", "@x{%F}"), ("Median Hours", "@y{0.0}")],
        formatters={"@x": "datetime"}
    ))
    
    p_merge.yaxis.axis_label = "Hours to Merge"
    
    show(p_merge)



In [86]:
output["merge_time_1m"] = df_merge.iloc[-1:]["p50"].mean()
output["merge_time_6m"] = df_merge.iloc[-6:]["p50"].mean()
output["merge_time_12m"] = df_merge.iloc[-12:]["p50"].mean()
output

{'stars': 271,
 'repo': 'epfl-dlab/aiflows',
 'pull_requests': 8,
 'pull_request_creators': 3,
 'merge_time_1m': np.float64(10.26638889),
 'merge_time_6m': np.float64(5.378333335),
 'merge_time_12m': np.float64(5.378333335)}

# 5. ISSUE HEALTH

## CHAOSS Mapping
**Issue Resolution & Responsiveness**

Issues are the primary channel for user feedback and bug reports.
- **Response Time**: How fast does the community acknowledge a problem?
- **Resolution Rate**: Are issues being closed, or are they piling up?

### OSS Insight Endpoints
- `/q/analyze-repo-issue-overview`
- `/q/analyze-issue-open-to-first-responded`
- `/q/analyze-issue-opened-and-closed`

In [87]:
# 5.0 Issue Overview
url_issue_overview = f"https://api.ossinsight.io/q/analyze-repo-issue-overview?repoId={repo_id}"
res_issue_overview = requests.get(url_issue_overview).json()
data_issue_overview = res_issue_overview['data'][0] if res_issue_overview['data'] else {}

if data_issue_overview:
    print(f"Total Issues:       {data_issue_overview.get('issues')}")
    print(f"Issue Creators:     {data_issue_overview.get('issue_creators')}")
    print(f"Total Comments:     {data_issue_overview.get('issue_comments')}")
    print(f"Commenters:         {data_issue_overview.get('issue_commenters')}")
else:
    print("No Issue overview data found.")

output["issues"] = data_issue_overview.get('issues')
output["issue_creators"] = data_issue_overview.get('issue_creators')


Total Issues:       2
Issue Creators:     2
Total Comments:     11
Commenters:         5


In [88]:
# 5.1 Issue Response Time
url_issue_resp = f"https://api.ossinsight.io/q/analyze-issue-open-to-first-responded?repoId={repo_id}"
res_resp = requests.get(url_issue_resp).json()
df_issue_resp = pd.DataFrame(res_resp['data'])

if not df_issue_resp.empty:
    df_issue_resp["date"] = pd.to_datetime(df_issue_resp["event_month"])
    
    p_resp = figure(title="Median Issue Response Time (Hours)",
                x_axis_type="datetime", width=900, height=400,
                tools="pan,wheel_zoom,box_zoom,reset")
    
    p_resp.line(df_issue_resp['date'], df_issue_resp['p50'], line_width=2, color="firebrick", legend_label="Median Response Time")
    p_resp.circle(df_issue_resp['date'], df_issue_resp['p50'], size=4, color="firebrick")
    
    p_resp.add_tools(HoverTool(
        tooltips=[("Date", "@x{%F}"), ("Median Hours", "@y{0.0}")],
        formatters={"@x": "datetime"}
    ))
    
    p_resp.yaxis.axis_label = "Hours to First Response"
    
    show(p_resp)



In [89]:
# 5.2 Issues Opened vs Closed
url_issue_oc = f"https://api.ossinsight.io/q/analyze-issue-opened-and-closed?repoId={repo_id}"
res_oc = requests.get(url_issue_oc).json()
df_issue_oc = pd.DataFrame(res_oc['data'])

if not df_issue_oc.empty:
    df_issue_oc["date"] = pd.to_datetime(df_issue_oc["event_month"])
    
    p_oc = figure(title="Issues Opened vs Closed",
                x_axis_type="datetime", width=900, height=400,
                tools="pan,wheel_zoom,box_zoom,reset")
    
    p_oc.line(df_issue_oc['date'], df_issue_oc['opened'], line_width=2, color="green", legend_label="Opened")
    p_oc.line(df_issue_oc['date'], df_issue_oc['closed'], line_width=2, color="red", legend_label="Closed")
    
    p_oc.legend.location = "top_left"
    p_oc.yaxis.axis_label = "Count"
    
    show(p_oc)

In [90]:
output["closed_issues_ratio"] = df_issue_oc.iloc[:]["closed"].mean() / (df_issue_oc.iloc[:]["closed"] + df_issue_oc.iloc[:]["opened"]).mean()
print(output)

#df_issue_oc

{'stars': 271, 'repo': 'epfl-dlab/aiflows', 'pull_requests': 8, 'pull_request_creators': 3, 'merge_time_1m': np.float64(10.26638889), 'merge_time_6m': np.float64(5.378333335), 'merge_time_12m': np.float64(5.378333335), 'issues': 2, 'issue_creators': 2, 'closed_issues_ratio': np.float64(0.6666666666666666)}


# 6. RECENT TRENDING CONTRIBUTORS (Last 28 Days)

## CHAOSS Mapping  
- **Contribution Distribution**  
- **Bus Factor / Elephant Factor**  
- **Maintainer Overload Detection**  

It’s useful to see who the top contributors are and how contributions are distributed among individuals. If one or two people contribute the majority of code, the project might have a sustainability risk.

**Note:** This section shows **recent trending contributors** based on activity in the **last 28 days** compared to the previous period. It highlights currently active members rather than all-time top contributors.

### OSS Insight Endpoint  
`/q/analyze-people-code-pr-contribution-rank`
Lists recent trending contributors based on PR activity (last month vs previous month).

In [91]:
url_list = f"https://api.ossinsight.io/q/analyze-people-code-pr-contribution-rank?repoId={repo_id}&excludeBots=true"
res_list = requests.get(url_list).json()

df_top = pd.DataFrame(res_list['data'])
# Columns: actor_login, changes, is_new_contributor, last_month_events, last_2nd_month_events, proportion, row_num

if not df_top.empty:
    print("Top Trending Contributors (Last 28 Days):")
    display(df_top[['row_num', 'actor_login', 'last_month_events', 'proportion']].head(10))
else:
    print("No trending contributor data found.")

No trending contributor data found.


# 7. RECENT TRENDING ISSUE PARTICIPANTS (Last 28 Days)

## CHAOSS Mapping  
- **Issue Participants**  
- **Community Engagement**  
- **User Base Growth**  

Not all contributions are code – a thriving community also involves users who file bug reports or feature requests. The number of issue creators reflects community engagement from users and stakeholders of the project.

**Note:** This section shows **recent trending issue participants** (commenters) based on activity in the **last 28 days**.

### OSS Insight Endpoint  
`/q/analyze-people-issue-comment-contribution-rank`
Lists recent trending issue participants (commenters) based on activity.

In [92]:
url_ic = f"https://api.ossinsight.io/q/analyze-people-issue-comment-contribution-rank?repoId={repo_id}&excludeBots=true"
res_ic = requests.get(url_ic).json()

df_ic = pd.DataFrame(res_ic['data'])
# Columns: actor_login, changes, is_new_contributor, last_month_events, last_2nd_month_events, proportion, row_num

if not df_ic.empty:
    print("Top Trending Issue Participants (Last 28 Days):")
    display(df_ic[['row_num', 'actor_login', 'last_month_events', 'proportion']].head(10))
else:
    print("No trending issue participant data found.")

No trending issue participant data found.


# 8. CONTRIBUTOR GEOGRAPHY

## CHAOSS Mapping  
**Diversity & Inclusion – Geographic Diversity**

Diversity in geographic location is an important aspect of an inclusive open source community. A globally distributed contributor base suggests:
- higher inclusivity  
- resilience across time zones  
- broader adoption  

If all or most contributors hail from one region, the project might be missing perspectives from other parts of the world.

### OSS Insight Endpoint  
`/q/analyze-pull-request-creators-map`
Returns the count of PR creators by country. Note: This data relies on what users list in their GitHub profiles.

In [93]:
def get_geo_distribution(repo_id: int, metric_type: str = "pr_creators") -> pd.DataFrame:
    endpoints = {
        "pr_creators": f"https://api.ossinsight.io/q/analyze-pull-request-creators-map?repoId={repo_id}",
        "stargazers": f"https://api.ossinsight.io/q/analyze-stars-map?repoId={repo_id}&period=all_times",
        "issue_creators": f"https://api.ossinsight.io/q/analyze-issue-creators-map?repoId={repo_id}"
    }
    url = endpoints.get(metric_type)
    if not url: return pd.DataFrame()
    
    res = requests.get(url).json()
    return pd.DataFrame(res['data'])

df_geo = get_geo_distribution(repo_id, "pr_creators")
if not df_geo.empty:
    display(df_geo.head())

## Plot: Top 5 Contributor Countries


In [94]:
if not df_geo.empty:
    top10 = df_geo.head(10)
    countries = top10['country_or_area'].astype(str).tolist()
    counts = top10['count'].tolist()

    p_geo = figure(title="Top 10 Contributor Countries (PR Creators)",
                 x_range=countries, width=800, height=400,
                 tools="hover,pan,wheel_zoom,box_zoom,reset")

    p_geo.vbar(x=countries, top=counts, width=0.6, color="teal")

    p_geo.xaxis.axis_label = "Country"
    p_geo.yaxis.axis_label = "Number of Contributors"
    p_geo.y_range.start = 0

    p_geo.add_tools(HoverTool(tooltips=[("Country", "@x"), ("Contributors", "@top")]))

    show(p_geo)

# 9. ORGANIZATIONAL AFFILIATION OF CONTRIBUTORS

## CHAOSS Mapping
- **Organizational Diversity**
- **Elephant Factor** (organization-level)
- **Corporate Governance Risk**

Another important metric is the distribution of contributions across organizations or employers. If one company employs the majority of contributors, the project might have an “Elephant Factor” risk – meaning the project heavily depends on one organization. On the other hand, participation from many organizations indicates healthy organizational diversity.

A project maintained by a single company can be healthy but might be less community-driven. A project with contributors from numerous organizations is more likely to have balanced governance and resilience if individuals leave.

### OSS Insight Endpoints
- `/q/analyze-pull-request-creators-company`
- `/q/analyze-issue-creators-company`
Returns organizations of PR creators and Issue creators. This data comes from GitHub user profiles.

In [95]:
# 10.1 PR Creators Company
print("--- PR Creators by Company ---")
url_org_pr = f"https://api.ossinsight.io/q/analyze-pull-request-creators-company?repoId={repo_id}"
res_org_pr = requests.get(url_org_pr).json()
df_org_pr = pd.DataFrame(res_org_pr['data'])
if not df_org_pr.empty:
    display(df_org_pr.head(10))

# 10.2 Issue Creators Company
print("\n--- Issue Creators by Company ---")
url_org_issue = f"https://api.ossinsight.io/q/analyze-issue-creators-company?repoId={repo_id}"
res_org_issue = requests.get(url_org_issue).json()
df_org_issue = pd.DataFrame(res_org_issue['data'])
if not df_org_issue.empty:
    display(df_org_issue.head(10))

--- PR Creators by Company ---

--- Issue Creators by Company ---


Unnamed: 0,company_name,issue_creators,proportion
0,plurigrid,1,0.5
1,google,1,0.5


# 10. Conclusions and Insights

This notebook provides a full CHAOSS-style analysis using **only OSS Insight data**, without GitHub API quotas or tokens.

### You have seen metrics for:
- Project attention (Stars)
- Code contributor community
- New contributor momentum
- User engagement (Issue creators)
- Geographic diversity
- Organizational ecosystem
- Contributor concentration (risk analysis)

### This forms the foundation for:
- Open source health dashboards  
- Research reports  
- FAIR software sustainability assessments  
- Governance analysis  
- Community management frameworks  

You may now:
- change the repository  
- extend the notebook  
- add GitHub API metrics (review time, commit frequency)  
- combine with GrimoireLab or Augur

In [96]:
output_file = 'metrics.csv'
df = pd.read_csv(output_file)
df = pd.concat([df, pd.DataFrame([output])], ignore_index=True)
# df = pd.DataFrame([output])

cols = ['repo', 'stars', 'pull_requests', 'pull_request_creators',  'merge_time_12m', 'merge_time_6m', 'merge_time_1m', 'issues', 'issue_creators', 'closed_issues_ratio']
df[cols]

Unnamed: 0,stars,repo,pull_requests,pull_request_creators,merge_time_1m,merge_time_6m,merge_time_12m,issues,issue_creators,closed_issues_ratio
0,14424,scala/scala,9927,752,0.044444,79.961343,116.260046,62,55,0.487395
1,5373,DeepLabCut/DeepLabCut,985,162,14.310556,518.629306,276.141852,1951,1122,0.497553
2,887,arkworks-rs/snark,265,51,0.278889,643.922361,359.888727,143,34,0.376344
3,499,JuliaMolSim/DFTK.jl,819,54,21.8075,75.516944,150.123657,268,42,0.438515
4,271,epfl-dlab/aiflows,8,3,10.266389,5.378333,5.378333,2,2,0.666667


In [97]:
df.to_csv(output_file, index=False)