
# Industrial Relevance Mapping – Analysis Notebook
_Last generated: 2025-10-24 07:46_

This notebook maps **open-source repos** to **industrial relevance** and **tech-core strength**.

## Files
- **Input CSV (default)**: `/mnt/data/repos_top10.csv`
- **Exports**: scatter plot to `/mnt/data/tech_vs_industry_scatter.png`

> You can replace the CSV with your larger dataset any time. Required columns:
`full_name, language, stargazers_count, activity_from_cutoff, contributors_from_cutoff, issue_close_rate, pr_merge_rate, score`


In [None]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from datetime import datetime

INPUT_CSV = "/mnt/data/repos_top10.csv"  # change to your CSV path
df = pd.read_csv(INPUT_CSV)
df.head()


In [None]:

def compute_scores(df: pd.DataFrame):
    df = df.copy()
    numeric_cols = [
        "stargazers_count","activity_from_cutoff","contributors_from_cutoff",
        "issue_close_rate","pr_merge_rate","score"
    ]
    for c in numeric_cols:
        df[c] = pd.to_numeric(df[c], errors="coerce")

    scaler = MinMaxScaler()
    norm_cols = ["stargazers_count","activity_from_cutoff","contributors_from_cutoff","issue_close_rate","pr_merge_rate"]
    df[[f"norm_{c}" for c in norm_cols]] = scaler.fit_transform(df[norm_cols])

    # ---- Heuristics (edit weights as needed) ----
    df["industrial_linkage_score"] = (
        0.35*df["norm_contributors_from_cutoff"]
        +0.20*df["norm_activity_from_cutoff"]
        +0.20*df["norm_issue_close_rate"]
        +0.15*df["norm_pr_merge_rate"]
        +0.10*df["norm_stargazers_count"]
    )

    lang_prior = {"Python": 1.0, "C++": 0.9, "C#": 0.85, "Unknown": 0.7}
    df["lang_prior"] = df["language"].map(lang_prior).fillna(0.75)
    df["tech_core_score"] = (
        0.45*df["norm_stargazers_count"]
        +0.20*df["norm_pr_merge_rate"]
        +0.20*df["norm_issue_close_rate"]
        +0.15*df["lang_prior"]
    )

    df["industrial_x_tech"] = 0.5*df["industrial_linkage_score"] + 0.5*df["tech_core_score"]
    return df

scored = compute_scores(df)
scored.sort_values("industrial_x_tech", ascending=False).head(10)


In [None]:

def plot_scatter(df_scored: pd.DataFrame, x="tech_core_score", y="industrial_linkage_score", label_col="full_name", title="Tech vs Industrial Map"):
    plt.figure()
    plt.scatter(df_scored[x], df_scored[y])
    for _, r in df_scored.iterrows():
        plt.annotate(r[label_col], (r[x], r[y]), xytext=(5,5), textcoords="offset points", fontsize=7)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title(title)
    out_png = "/mnt/data/tech_vs_industry_scatter.png"
    plt.tight_layout()
    plt.savefig(out_png, dpi=160)
    plt.close()
    return out_png

out_path = plot_scatter(scored)
out_path


In [None]:

rank_by_ind = scored.sort_values("industrial_linkage_score", ascending=False)
rank_by_tech = scored.sort_values("tech_core_score", ascending=False)
rank_by_combo = scored.sort_values("industrial_x_tech", ascending=False)

rank_by_ind.head(10), rank_by_tech.head(10), rank_by_combo.head(10)



## Next Steps
- Replace the CSV with your expanded dataset (e.g., 100–500 repos).
- (Optional) Add columns like `topics`, `org_type_fork_share`, `corp_contributor_ratio` and extend the scoring.
- Export charts/tables for reports.
