
# Rocket — Cutting‑Edge Matching & Team Formation (End‑to‑End)

This notebook implements a production‑minded prototype for **networking + team formation**, with:
- **Rich intake** (DOB→age, location & time zone, availability, energy, collaboration style, skills **have**/**want**, interests, role, seniority, years_exp, personality TIPI; free‑text sections for Human/Professional/Contributor/Interests/Reason).
- **Semantic embeddings** (Sentence‑Transformers if installed, TF‑IDF fallback) for content & skills.
- **Two skills modes**: **similar** (peer discovery) and **complementary** (Hungarian/coverage).
- **Hybrid matching**: content+skills+graph (Personalized PageRank) + CF + personality + social‑fit (energy/collab/tz/availability) with **reciprocity**.
- **Diversification**: **MMR** and **DPP‑greedy** options.
- **Team formation**: greedy submodular objective for coverage+compatibility+diversity under constraints.
- **Adaptive learning**: small **learning‑to‑rank** stub to update weights from accept/decline outcomes.
- **Demo cohort** (100 users) + utilities. Final cell prints **ALL names**.


## Setup (optional installs if running locally)

In [1]:

# !pip install numpy pandas scikit-learn networkx geopy scipy
# !pip install sentence-transformers  # for SBERT embeddings
# !pip install spacy keybert pdfplumber python-docx  # for richer extraction (optional)
# !python -m spacy download en_core_web_sm


In [2]:

import numpy as np, pandas as pd, random, math, importlib
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Tuple
from datetime import date, datetime
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from geopy.distance import geodesic
import networkx as nx

np.random.seed(101); random.seed(101)


## Embedding utilities — SBERT if available, TF‑IDF fallback

In [3]:

class Embedder:
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.sbert_ok = False
        self.model = None
        if importlib.util.find_spec("sentence_transformers") is not None:
            try:
                from sentence_transformers import SentenceTransformer
                self.model = SentenceTransformer(model_name)
                self.sbert_ok = True
            except Exception:
                self.sbert_ok = False
        self.tfidf = None

    def fit(self, corpus: List[str]):
        if self.sbert_ok:
            return self
        self.tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=1)
        self.tfidf.fit(corpus)
        return self

    def encode(self, items: List[str]) -> np.ndarray:
        if self.sbert_ok:
            return np.array(self.model.encode(items, show_progress_bar=False, normalize_embeddings=True))
        X = self.tfidf.transform(items)
        X = X.astype(np.float64)
        norms = np.sqrt((X.power(2)).sum(axis=1))
        norms[norms==0] = 1.0
        return (X / norms).toarray()

def cos_sim_mat(A: np.ndarray) -> np.ndarray:
    return (A @ A.T).astype(float)


## Utilities — DOB → Age / Bands

In [4]:

def parse_dob(dob_str: str) -> date:
    return datetime.strptime(dob_str, "%Y-%m-%d").date()

def compute_age(dob: date, today: Optional[date] = None) -> int:
    today = today or date.today()
    years = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
    return max(0, years)

def age_band(age: int) -> str:
    for lo, hi in [(18,24),(25,34),(35,44),(45,54)]:
        if lo <= age <= hi: return f"{lo}-{hi}"
    return "55+"


## Personality — TIPI (fast intake)

In [5]:

@dataclass
class BigFive:
    O: float; C: float; E: float; A: float; N: float

def clip01(x): 
    import numpy as np
    return float(np.clip(x, 0.0, 1.0))

TIPI_KEY = {
    1: ("E", False), 2: ("A", True), 3: ("C", False), 4: ("N", False), 5: ("O", False),
    6: ("E", True),  7: ("A", False),8: ("C", True),  9: ("N", True),  10:("O", True)
}

def score_tipi(responses_1to7):
    import numpy as np
    assert len(responses_1to7)==10
    r = np.array(responses_1to7, dtype=float)
    r01 = (r-1)/6.0
    traits = {"O":[], "C":[], "E":[], "A":[], "N":[]}
    for i,val in enumerate(r01, start=1):
        trait, rev = TIPI_KEY[i]
        traits[trait].append(1.0-val if rev else val)
    return BigFive(*(clip01(np.mean(traits[t])) for t in ["O","C","E","A","N"]))

def bigfive_cosine(u: BigFive, v: BigFive) -> float:
    import numpy as np
    a = np.array([u.O,u.C,u.E,u.A,u.N])
    b = np.array([v.O,v.C,v.E,v.A,v.N])
    return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))


## Intake schema + normalizer (with 5 interview sections)

In [6]:

INTAKE_FIELDS = [
    "name","dob","location_city","location_country","lat","lon","tz_offset",
    "availability_hours","energy_1to5","collab_style",
    "role","seniority","years_exp",
    "skills_have","skills_want","interests",
    "human","professional","contributor","interests_long","reason"
]

def parse_comma_list(s: str) -> List[str]:
    return [x.strip() for x in (s or "").split(",") if x.strip()]

def normalize_intake(row: Dict[str, Any]) -> Dict[str, Any]:
    def wclip(t): 
        ws = (t or "").split()
        return " ".join(ws[:250])
    dob_str = row.get("dob","1989-01-01")
    try:
        dob = parse_dob(dob_str)
    except Exception:
        dob = date(1989,1,1); dob_str="1989-01-01"
    age_val = compute_age(dob)
    return {
        "name": row.get("name","Unnamed"),
        "dob": dob_str, "age": age_val, "age_band": age_band(age_val),
        "location_city": row.get("location_city",""), "location_country": row.get("location_country",""),
        "lat": float(row.get("lat", 43.6532)), "lon": float(row.get("lon", -79.3832)),
        "tz_offset": int(row.get("tz_offset", -5)),
        "availability_hours": row.get("availability_hours","5-10"),
        "energy_1to5": int(row.get("energy_1to5",3)),
        "collab_style": row.get("collab_style","hybrid"),
        "role": row.get("role","Undecided"),
        "seniority": row.get("seniority","Mid"),
        "years_exp": int(row.get("years_exp",3)),
        "skills_have": ", ".join(parse_comma_list(row.get("skills_have",""))[:24]),
        "skills_want": ", ".join(parse_comma_list(row.get("skills_want",row.get("interests","")))[:24]),
        "interests": ", ".join(parse_comma_list(row.get("interests",""))[:24]),
        "human": wclip(row.get("human","")),
        "professional": wclip(row.get("professional","")),
        "contributor": wclip(row.get("contributor","")),
        "interests_long": wclip(row.get("interests_long","")),
        "reason": wclip(row.get("reason","")),
    }


## Feature builders — content, geo, experience, role, social‑fit

In [7]:

def build_text_matrix(df: pd.DataFrame, embedder):
    corpus = (df['interests'].fillna('') + " ; " + df['skills_have'].fillna('') + " ; " + df['professional'].fillna('')).tolist()
    embedder.fit(corpus)
    X = embedder.encode(corpus)
    S = (X @ X.T)
    S = (S - S.min())/(S.max()-S.min()+1e-9)
    return S

def geo_similarity(df: pd.DataFrame, decay_km: float = 2500.0) -> np.ndarray:
    n = len(df); S = np.zeros((n,n), dtype=float)
    coords = list(zip(df['lat'], df['lon']))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            d_km = geodesic(coords[i], coords[j]).km
            S[i,j] = np.exp(-d_km/decay_km)
    if S.max()>0: S = S/S.max()
    return S

def experience_compatibility(years: List[int], sweet_spot: float = 3.0) -> np.ndarray:
    years = np.array(years); n=len(years); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            gap = abs(years[i]-years[j])
            S[i,j] = np.exp(-((gap-sweet_spot)**2)/(2*(sweet_spot**2)))
    if S.max()>0: S = S/S.max()
    return S

ROLE_COMP = {
    "Founder": {"Engineer": 1.0, "Designer": 1.0, "Researcher": 0.8, "Founder": 0.2, "Writer":0.6, "Scientist":0.7, "Creator":0.8},
    "Engineer": {"Founder": 1.0, "Designer": 0.7, "Engineer": 0.2, "Researcher": 0.6, "Writer":0.6, "Scientist":0.8, "Creator":0.7},
    "Designer": {"Founder": 1.0, "Engineer": 0.7, "Designer": 0.2, "Researcher": 0.5, "Writer":0.6, "Scientist":0.5, "Creator":0.9},
    "Researcher": {"Founder": 0.8, "Engineer": 0.7, "Designer": 0.5, "Researcher": 0.3, "Writer":0.5, "Scientist":0.9, "Creator":0.6},
    "Writer": {"Founder":0.8, "Engineer":0.6, "Designer":0.7, "Researcher":0.5, "Writer":0.2, "Scientist":0.5, "Creator":0.9},
    "Scientist":{"Founder":0.9, "Engineer":0.9, "Designer":0.5, "Researcher":0.8, "Writer":0.5, "Scientist":0.2, "Creator":0.6},
    "Creator":{"Founder":0.9, "Engineer":0.7, "Designer":0.9, "Researcher":0.6, "Writer":0.9, "Scientist":0.6, "Creator":0.3},
    "Undecided": {"Founder":0.6,"Engineer":0.6,"Designer":0.6,"Researcher":0.6,"Writer":0.6,"Scientist":0.6,"Creator":0.6,"Undecided":0.2}
}

def role_complementarity(df: pd.DataFrame) -> np.ndarray:
    roles = df['role'].tolist(); n=len(roles); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = ROLE_COMP.get(roles[i], {}).get(roles[j], 0.2)
    return S

def energy_compatibility(energies: List[int], target_gap=0):
    e = np.array(energies); n=len(e); S=np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            gap = abs(e[i]-e[j])
            S[i,j] = np.exp(-((gap-target_gap)**2)/(2*(1.25**2)))
    if S.max()>0: S = S/S.max()
    return S

COLLAB_COMP = {
    "async": {"async":1.0, "hybrid":0.7, "sync":0.3},
    "hybrid":{"async":0.7, "hybrid":1.0, "sync":0.7},
    "sync":  {"async":0.3, "hybrid":0.7, "sync":1.0},
}

def collab_style_compatibility(styles: List[str]) -> np.ndarray:
    n=len(styles); S=np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = COLLAB_COMP.get(styles[i],{}).get(styles[j], 0.5)
    return S

def availability_overlap(avails: List[str]) -> np.ndarray:
    map_mid = {"2-5":3.5,"5-10":7.5,"10-20":15.0,"20+":25.0}
    v = np.array([map_mid.get(a,7.5) for a in avails])
    n=len(v); S=np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            gap = abs(v[i]-v[j])
            S[i,j] = np.exp(-gap/15.0)
    if S.max()>0: S = S/S.max()
    return S

def time_zone_overlap(tz_list: List[int]) -> np.ndarray:
    tz = np.array(tz_list); n=len(tz); S=np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            diff = abs(tz[i]-tz[j])
            S[i,j] = np.exp(-diff/6.0)
    if S.max()>0: S = S/S.max()
    return S

def combine_content(S_text, S_geo, S_exp, S_role, S_energy, S_collab, S_avail, S_tz, w=(0.26,0.10,0.08,0.10,0.12,0.12,0.11,0.11)):
    a,b,c,d,e,f,g,h = w
    S = a*S_text + b*S_geo + c*S_exp + d*S_role + e*S_energy + f*S_collab + g*S_avail + h*S_tz
    return S / (S.max() + 1e-9)


## Skills — similar vs complementary (Hungarian/coverage)

In [8]:

def parse_skill_list(sk: str) -> List[str]:
    return [s.strip().lower() for s in (sk or "").split(",") if s.strip()]

def tfidf_cosine(a_list: List[str], b_list: List[str]) -> float:
    docs = ["; ".join(a_list), "; ".join(b_list)]
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(docs)
    return float(cosine_similarity(X[0], X[1])[0,0])

def similar_skills_matrix(df: pd.DataFrame, embedder=None) -> np.ndarray:
    n=len(df); S=np.zeros((n,n))
    if embedder is not None and getattr(embedder, "sbert_ok", False):
        corpus = df['skills_have'].fillna('').tolist()
        X = embedder.encode(corpus)
        S = (X @ X.T)
        S = (S - S.min())/(S.max()-S.min()+1e-9)
        np.fill_diagonal(S, 0.0)
        return S
    parsed = [parse_skill_list(x) for x in df['skills_have'].fillna('')]
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = tfidf_cosine(parsed[i], parsed[j])
    if S.max()>0: S = S/S.max()
    return S

def complementary_skills_matrix(df: pd.DataFrame) -> np.ndarray:
    wants = [parse_skill_list(row.get('skills_want', row.get('interests',''))) for _,row in df.iterrows()]
    haves = [parse_skill_list(row.get('skills_have','')) for _,row in df.iterrows()]
    n=len(df); S=np.zeros((n,n))
    try:
        from scipy.optimize import linear_sum_assignment
        for i in range(n):
            need = wants[i]
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                A = ["; ".join([n1]) for n1 in need]
                B = ["; ".join([h1]) for h1 in have]
                vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
                X = vec.fit_transform(A + B)
                m, k = len(need), len(have)
                Csim = np.zeros((m,k))
                for p in range(m):
                    for q in range(k):
                        Csim[p,q] = cosine_similarity(X[p], X[m+q])[0,0]
                size = max(m,k)
                padded = np.ones((size,size))
                padded[:m,:k] = 1.0 - Csim
                r_ind, c_ind = linear_sum_assignment(padded)
                total_sim = 0.0; count = 0
                for r,c in zip(r_ind, c_ind):
                    if r < m and c < k:
                        total_sim += 1.0 - padded[r,c]; count += 1
                S[i,j] = total_sim / (count + 1e-9)
        if S.max()>0: S = S/S.max()
    except Exception:
        for i in range(n):
            need = wants[i]
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                sims = []
                for nterm in need:
                    sims.append(max(tfidf_cosine([nterm], [h]) for h in have))
                S[i,j] = float(np.mean(sims)) if sims else 0.0
        if S.max()>0: S = S/S.max()
    return S


## Graph (PPR) + CF + Personality + Fusion (+ reciprocity)

In [9]:

def reciprocalize(S: np.ndarray) -> np.ndarray:
    return np.sqrt(S * S.T + 1e-12)

def fuse_scores(S_content, S_cf, S_graph, S_person, S_skills, weights=(0.30,0.18,0.14,0.12,0.26)):
    Sc = reciprocalize(S_content)
    Sf = reciprocalize(S_cf)
    Sg = reciprocalize(S_graph)
    Sp = reciprocalize(S_person)
    Ss = reciprocalize(S_skills)
    a,b,c,d,e = weights
    S = a*Sc + b*Sf + c*Sg + d*Sp + e*Ss
    return S / (S.max() + 1e-12)


## Diversification — MMR and DPP‑greedy

In [10]:

def mmr_rank(query_idx: int, S: np.ndarray, K: int = 5, lambda_rel: float = 0.7):
    n = S.shape[0]
    candidates = [i for i in range(n) if i != query_idx]
    selected = []
    while candidates and len(selected) < K:
        if not selected:
            i = max(candidates, key=lambda j: S[query_idx, j])
            selected.append(i); candidates.remove(i)
        else:
            def score(j):
                redundancy = max(S[j, s] for s in selected) if selected else 0.0
                return lambda_rel * S[query_idx, j] - (1-lambda_rel) * redundancy
            i = max(candidates, key=score)
            selected.append(i); candidates.remove(i)
    return selected

def dpp_greedy(query_idx: int, S: np.ndarray, K: int = 5, quality: Optional[np.ndarray] = None):
    n = S.shape[0]
    items = [i for i in range(n) if i != query_idx]
    if quality is None:
        quality = S[query_idx].copy()
    q = quality / (quality.max() + 1e-9)
    selected = []
    remaining = items.copy()
    while remaining and len(selected) < K:
        if not selected:
            idx = int(np.argmax([q[items.index(r)] for r in remaining]))
            chosen = remaining[idx]
        else:
            # score = q_i - max similarity to selected
            scores = []
            for r in remaining:
                max_sim = max(S[r, s] for s in selected) if selected else 0.0
                scores.append(q[items.index(r)] - max_sim)
            idx = int(np.argmax(scores))
            chosen = remaining[idx]
        selected.append(chosen)
        remaining.pop(idx)
    return selected


## Team formation — greedy submodular objective

In [11]:

def team_score(set_ids: List[int], query_idx: int, S_final: np.ndarray, skills_need: List[str], users_df: pd.DataFrame):
    if not set_ids: return 0.0
    rel = np.mean([S_final[query_idx, j] for j in set_ids])
    need = set([s.strip().lower() for s in skills_need if s.strip()])
    have = set()
    for j in set_ids:
        have |= set([s.strip().lower() for s in users_df.iloc[j].skills_have.split(",") if s.strip()])
    coverage = len(need & have) / (len(need) + 1e-9)
    if len(set_ids) > 1:
        pair_sims = []
        for a in range(len(set_ids)):
            for b in range(a+1, len(set_ids)):
                pair_sims.append(S_final[set_ids[a], set_ids[b]])
        div = 1.0 - float(np.mean(pair_sims))
    else:
        div = 1.0
    return 0.5*rel + 0.35*coverage + 0.15*div

def form_team(query_idx: int, S_final: np.ndarray, users_df: pd.DataFrame, K: int, skills_need: List[str], constraints: Optional[Dict[str, Any]] = None):
    n = S_final.shape[0]
    candidates = [i for i in range(n) if i != query_idx]
    selected = []
    def feasible(j):
        if not constraints: return True
        u = users_df.iloc[query_idx]; v = users_df.iloc[j]
        if constraints.get("max_tz_diff") is not None:
            if abs(int(u.tz_offset) - int(v.tz_offset)) > constraints["max_tz_diff"]:
                return False
        if constraints.get("min_avail_mid") is not None:
            map_mid = {"2-5":3.5,"5-10":7.5,"10-20":15.0,"20+":25.0}
            if map_mid.get(v.availability_hours, 0) < constraints["min_avail_mid"]:
                return False
        if constraints.get("allowed_styles"):
            if v.collab_style not in constraints["allowed_styles"]:
                return False
        return True
    while candidates and len(selected) < K:
        best_j, best_gain = None, -1
        base = team_score(selected, query_idx, S_final, skills_need, users_df)
        for j in candidates:
            if not feasible(j): 
                continue
            gain = team_score(selected+[j], query_idx, S_final, skills_need, users_df) - base
            if gain > best_gain:
                best_gain, best_j = gain, j
        if best_j is None:
            break
        selected.append(best_j)
        candidates.remove(best_j)
    return selected


## Adaptive learning — simple LTR stub (learn blend weights)

In [12]:

from sklearn.linear_model import SGDRegressor

def learn_weights(feature_matrix: np.ndarray, labels: np.ndarray):
    model = SGDRegressor(loss="squared_error", alpha=1e-4, max_iter=2000, tol=1e-4, learning_rate="optimal")
    model.fit(feature_matrix, labels)
    w = model.coef_
    w = np.clip(w, 1e-6, None)
    w = w / (w.sum() + 1e-9)
    return w


## Generate 100 synthetic users for testing

In [13]:

roles = ["Founder","Engineer","Designer","Researcher","Writer","Scientist","Creator"]
seniorities = ["Junior","Mid","Senior","Lead/Principal","Executive/Founder"]
cities = [
    ("Toronto",43.6532,-79.3832,-5),("New York",40.7128,-74.0060,-5),("San Francisco",37.7749,-122.4194,-8),
    ("London",51.5072,-0.1276,0),("Berlin",52.52,13.405,1),("Nairobi",-1.286389,36.817223,3),
    ("Sydney",-33.8688,151.2093,10),("Bangalore",12.9716,77.5946,5),("Paris",48.8566,2.3522,1),("Mexico City",19.4326,-99.1332,-6)
]
skill_bank = [
    "python","pytorch","tensorflow","django","react","nextjs","go","kubernetes","aws","gcp",
    "video editing","storyboarding","scriptwriting","podcasting","seo","branding","figma","design systems",
    "statistics","causal inference","nlp","cv","prompt engineering","sql","dbt","airflow",
    "grant writing","field research","lab techniques","oceanography","genomics","biostatistics",
    "supply chain","marketing","growth","product","fundraising","strategy"
]
interest_bank = [
    "ocean conservation","coral reef restoration","climate tech","educational apps","healthcare AI",
    "creator economy","open source tools","social impact","rural connectivity","financial inclusion",
    "short-form video","long-form YouTube","beauty brand","lipstick R&D","fashion sustainability",
    "music production","publishing","newsletter growth","sports analytics","mental health",
    "language learning","VR social spaces","next social network","privacy-first messaging"
]

def rand_words(pool, kmin, kmax):
    k = random.randint(kmin, kmax)
    return ", ".join(random.sample(pool, k))

def random_dob():
    y = random.randint(1961, 2004)
    m = random.randint(1,12); d = random.randint(1,28)
    return f"{y:04d}-{m:02d}-{d:02d}"

def mk_user(i):
    name = f"User{i:03d}"
    (city, lat, lon, tz) = random.choice(cities)
    role = random.choice(roles)
    seniority = random.choice(seniorities)
    skills_have = rand_words(skill_bank, 3, 7)
    skills_want = rand_words(skill_bank, 2, 5)
    interests = rand_words(interest_bank, 3, 7)
    years = random.randint(1, 18)
    human = f"I live in {city}. I value calm schedules and async collaboration; I like running, filming short videos, and cooking. Pets: none."
    professional = f"As a {role.lower()} with {years} years, I worked across startups and labs. I can produce prototypes, brand systems, docs, and production code."
    contributor = "I prefer weekly demos and short design docs. I bring reliability, curiosity, and momentum to small teams with clear ownership."
    interests_long = f"Goals: {random.choice(['launch a YouTube channel on ML','build ocean microplastics sensors','start a cruelty-free lipstick brand','prototype a privacy-first social app'])}."
    reason = random.choice(["Find projects","Expand network","Find collaborators","Build a dream"])
    row = dict(
        name=name, dob=random_dob(), location_city=city, location_country="",
        lat=lat, lon=lon, tz_offset=tz, availability_hours=random.choice(["2-5","5-10","10-20","20+"]),
        energy_1to5=random.randint(1,5), collab_style=random.choice(["async","hybrid","sync"]),
        role=role, seniority=seniority, years_exp=years,
        skills_have=skills_have, skills_want=skills_want, interests=interests,
        human=human, professional=professional, contributor=contributor, interests_long=interests_long, reason=reason
    )
    return normalize_intake(row)

records = [mk_user(i) for i in range(1,101)]
tipi_all = [[random.randint(2,6) for _ in range(10)] for __ in range(100)]
bfs = [score_tipi(t) for t in tipi_all]
users = pd.DataFrame(records)
users['bf'] = bfs
users.head(3)


Unnamed: 0,name,dob,age,age_band,location_city,location_country,lat,lon,tz_offset,availability_hours,...,years_exp,skills_have,skills_want,interests,human,professional,contributor,interests_long,reason,bf
0,User001,1981-04-21,44,35-44,Mexico City,,19.4326,-99.1332,-6,20+,...,12,"prompt engineering, oceanography, django, supp...","podcasting, cv, lab techniques, react, figma","creator economy, climate tech, fashion sustain...",I live in Mexico City. I value calm schedules ...,"As a creator with 12 years, I worked across st...",I prefer weekly demos and short design docs. I...,Goals: build ocean microplastics sensors.,Build a dream,"BigFive(O=0.4166666666666667, C=0.583333333333..."
1,User002,1984-12-12,40,35-44,New York,,40.7128,-74.006,-5,2-5,...,4,"podcasting, kubernetes, cv, design systems, ai...","seo, field research, react, airflow, grant wri...","fashion sustainability, open source tools, new...",I live in New York. I value calm schedules and...,"As a researcher with 4 years, I worked across ...",I prefer weekly demos and short design docs. I...,Goals: start a cruelty-free lipstick brand.,Find projects,"BigFive(O=0.5833333333333334, C=0.333333333333..."
2,User003,2004-12-20,20,18-24,San Francisco,,37.7749,-122.4194,-8,20+,...,12,"design systems, react, gcp, airflow","fundraising, genomics, video editing, biostati...","financial inclusion, creator economy, publishi...",I live in San Francisco. I value calm schedule...,"As a founder with 12 years, I worked across st...",I prefer weekly demos and short design docs. I...,Goals: prototype a privacy-first social app.,Find projects,"BigFive(O=0.33333333333333337, C=0.24999999999..."


## Build similarity signals

In [14]:

embedder = Embedder()
S_text = build_text_matrix(users, embedder)
S_geo  = geo_similarity(users)
S_exp  = experience_compatibility(users['years_exp'].tolist())
S_role = role_complementarity(users)
S_energy = energy_compatibility(users['energy_1to5'].tolist())
S_collab = collab_style_compatibility(users['collab_style'].tolist())
S_avail  = availability_overlap(users['availability_hours'].tolist())
S_tz     = time_zone_overlap(users['tz_offset'].tolist())
S_content = combine_content(S_text, S_geo, S_exp, S_role, S_energy, S_collab, S_avail, S_tz)

S_skills_sim = similar_skills_matrix(users, embedder if getattr(embedder, "sbert_ok", False) else None)
S_skills_comp = complementary_skills_matrix(users)

n = len(users); R = np.zeros((n,n), dtype=float)
for _ in range(600):
    u = random.randrange(n); v = random.randrange(n)
    if u==v: continue
    if users.iloc[u].role=="Founder" and users.iloc[v].role in ["Engineer","Designer"]: R[u,v]=1.0
    elif users.iloc[u].role=="Creator" and users.iloc[v].role in ["Writer","Designer","Engineer"]: R[u,v]=1.0
    elif random.random() < 0.05: R[u,v]=1.0
S_cf = cosine_similarity(R.T); S_cf = (S_cf - S_cf.min())/(S_cf.max()-S_cf.min()+1e-9)

G = nx.DiGraph(); G.add_nodes_from(range(n))
edges = [(u,v) for u in range(n) for v in range(n) if R[u,v]>0]
G.add_edges_from(edges)
S_graph = np.zeros((n,n))
for u in range(n):
    personalization = {k:(1.0 if k==u else 0.0) for k in range(n)}
    pr = nx.pagerank(G, alpha=0.8, personalization=personalization)
    for v,s in pr.items():
        S_graph[u, v] = s
S_graph = (S_graph - S_graph.min())/(S_graph.max()-S_graph.min()+1e-12)

S_person = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        if i==j: continue
        S_person[i,j] = bigfive_cosine(users.iloc[i].bf, users.iloc[j].bf)
S_person = (S_person - S_person.min())/(S_person.max()-S_person.min()+1e-9)


  from tqdm.autonotebook import tqdm, trange


## Top‑K recommendations (skills_mode + diversifier)

In [15]:

def top_k_for(query_idx: int, skills_mode="similar", k=5, diversifier="mmr", weights=(0.30,0.18,0.14,0.12,0.26)):
    S_sk = S_skills_sim if skills_mode=="similar" else S_skills_comp
    S_final = fuse_scores(S_content, S_cf, S_graph, S_person, S_skills=S_sk, weights=weights)
    if diversifier=="mmr":
        picks = mmr_rank(query_idx, S_final, K=k, lambda_rel=0.72)
    elif diversifier=="dpp":
        picks = dpp_greedy(query_idx, S_final, K=k)
    else:
        scores = list(enumerate(S_final[query_idx]))
        scores = [(j,s) for j,s in scores if j!=query_idx]
        picks = [j for j,_ in sorted(scores, key=lambda x: -x[1])[:k]]
    cols = ['name','role','seniority','interests','skills_have','skills_want','years_exp','age','age_band','energy_1to5','collab_style','availability_hours','reason']
    out = users.iloc[picks][cols].copy()
    out['score'] = [S_final[query_idx,j] for j in picks]
    return out

demo_sim = top_k_for(0, "similar", k=5, diversifier="mmr")
demo_comp = top_k_for(0, "complementary", k=5, diversifier="dpp")
demo_sim, demo_comp


(       name       role          seniority  \
 91  User092     Writer             Junior   
 76  User077    Creator             Junior   
 88  User089  Scientist  Executive/Founder   
 39  User040     Writer                Mid   
 29  User030   Engineer             Junior   
 
                                             interests  \
 91  language learning, creator economy, financial ...   
 76  VR social spaces, ocean conservation, open sou...   
 88  rural connectivity, mental health, creator eco...   
 39  beauty brand, privacy-first messaging, ocean c...   
 29  language learning, newsletter growth, privacy-...   
 
                                           skills_have  \
 91  podcasting, tensorflow, cv, supply chain, prod...   
 76         branding, design systems, django, aws, nlp   
 88     biostatistics, product, field research, django   
 39  python, growth, branding, statistics, fundrais...   
 29     figma, biostatistics, react, seo, oceanography   
 
                      

## Team formation demo

In [20]:

def build_team_for(query_idx: int, skills_need_text: str, K: int = 4):
    need = [s.strip() for s in skills_need_text.split(",") if s.strip()]
    S_sk = S_skills_comp
    S_final = fuse_scores(S_content, S_cf, S_graph, S_person, S_sk)
    team_ids = form_team(query_idx, S_final, users, K=K, skills_need=need, constraints={"max_tz_diff":6, "min_avail_mid":5.0, "allowed_styles":{"async","hybrid"}})
    cols = ['name','role','seniority','skills_have','years_exp','tz_offset','availability_hours','collab_style']
    df = users.iloc[team_ids][cols].copy()
    df['match_score'] = [S_final[query_idx,j] for j in team_ids]
    return df

team_example = build_team_for(0, "react, product, branding, growth", K=10)
team_example


Unnamed: 0,name,role,seniority,skills_have,years_exp,tz_offset,availability_hours,collab_style,match_score
50,User051,Founder,Executive/Founder,"scriptwriting, grant writing, nextjs, product,...",12,-6,20+,hybrid,0.629111
65,User066,Researcher,Executive/Founder,"branding, nextjs, cv, pytorch, react, statistics",3,-5,5-10,hybrid,0.455585
17,User018,Designer,Executive/Founder,"storyboarding, podcasting, product",5,-6,20+,hybrid,0.756776
46,User047,Researcher,Executive/Founder,"cv, tensorflow, sql",6,-6,20+,async,0.752282
76,User077,Creator,Junior,"branding, design systems, django, aws, nlp",15,-5,10-20,async,0.619448
67,User068,Engineer,Mid,"podcasting, nextjs, strategy, biostatistics, d...",11,-5,10-20,hybrid,0.657503
84,User085,Designer,Mid,"podcasting, grant writing, biostatistics, go, ...",13,-5,5-10,hybrid,0.649519
29,User030,Engineer,Junior,"figma, biostatistics, react, seo, oceanography",14,-5,20+,hybrid,0.593065
49,User050,Founder,Lead/Principal,"sql, go, tensorflow",9,-5,5-10,hybrid,0.59003
88,User089,Scientist,Executive/Founder,"biostatistics, product, field research, django",8,-6,10-20,hybrid,0.578176


## Adaptive learning — weight update demo

In [17]:

from sklearn.linear_model import SGDRegressor
def signal_breakdown(query_idx: int, picked_idx: List[int], skills_mode="similar"):
    S_sk = S_skills_sim if skills_mode=="similar" else S_skills_comp
    signals = []
    for j in picked_idx:
        signals.append([S_content[query_idx,j], S_cf[query_idx,j], S_graph[query_idx,j], S_person[query_idx,j], S_sk[query_idx,j]])
    return np.array(signals)

q = 0
picks_idx = top_k_for(q, "similar", k=5, diversifier="mmr").index.tolist()
X = signal_breakdown(q, picks_idx, "similar")
y = np.array([1,1,0,0,0], dtype=float)
model = SGDRegressor(loss="squared_error", alpha=1e-4, max_iter=2000, tol=1e-4, learning_rate="optimal")
model.fit(X, y)
w = model.coef_; w = np.clip(w, 1e-6, None); w = w/(w.sum()+1e-9)
w


array([1.27763537e-15, 1.27763537e-15, 1.00000000e+00, 1.27763537e-15,
       1.27763537e-15])

## Final — print ALL names

In [21]:
users.loc[users['name'].isin(team_example['name'])]

Unnamed: 0,name,dob,age,age_band,location_city,location_country,lat,lon,tz_offset,availability_hours,...,years_exp,skills_have,skills_want,interests,human,professional,contributor,interests_long,reason,bf
17,User018,1972-01-16,53,45-54,Mexico City,,19.4326,-99.1332,-6,20+,...,5,"storyboarding, podcasting, product","python, storyboarding, django, dbt","next social network, publishing, open source t...",I live in Mexico City. I value calm schedules ...,"As a designer with 5 years, I worked across st...",I prefer weekly demos and short design docs. I...,Goals: build ocean microplastics sensors.,Find collaborators,"BigFive(O=0.4166666666666667, C=0.5, E=0.5, A=..."
29,User030,1976-06-28,49,45-54,New York,,40.7128,-74.006,-5,20+,...,14,"figma, biostatistics, react, seo, oceanography","gcp, scriptwriting, design systems, growth, st...","language learning, newsletter growth, privacy-...",I live in New York. I value calm schedules and...,"As a engineer with 14 years, I worked across s...",I prefer weekly demos and short design docs. I...,Goals: prototype a privacy-first social app.,Find projects,"BigFive(O=0.5833333333333334, C=0.249999999999..."
46,User047,1962-01-20,63,55+,Mexico City,,19.4326,-99.1332,-6,20+,...,6,"cv, tensorflow, sql","supply chain, marketing, biostatistics, cv, ne...","newsletter growth, climate tech, music product...",I live in Mexico City. I value calm schedules ...,"As a researcher with 6 years, I worked across ...",I prefer weekly demos and short design docs. I...,Goals: launch a YouTube channel on ML.,Build a dream,"BigFive(O=0.41666666666666663, C=0.41666666666..."
49,User050,1976-05-16,49,45-54,New York,,40.7128,-74.006,-5,5-10,...,9,"sql, go, tensorflow","fundraising, product, grant writing","beauty brand, ocean conservation, newsletter g...",I live in New York. I value calm schedules and...,"As a founder with 9 years, I worked across sta...",I prefer weekly demos and short design docs. I...,Goals: prototype a privacy-first social app.,Expand network,"BigFive(O=0.33333333333333337, C=0.33333333333..."
50,User051,1972-11-01,52,45-54,Mexico City,,19.4326,-99.1332,-6,20+,...,12,"scriptwriting, grant writing, nextjs, product,...","marketing, cv","rural connectivity, educational apps, newslett...",I live in Mexico City. I value calm schedules ...,"As a founder with 12 years, I worked across st...",I prefer weekly demos and short design docs. I...,Goals: start a cruelty-free lipstick brand.,Find projects,"BigFive(O=0.6666666666666667, C=0.583333333333..."
65,User066,1998-08-04,27,25-34,New York,,40.7128,-74.006,-5,5-10,...,3,"branding, nextjs, cv, pytorch, react, statistics","react, growth, strategy","open source tools, social impact, rural connec...",I live in New York. I value calm schedules and...,"As a researcher with 3 years, I worked across ...",I prefer weekly demos and short design docs. I...,Goals: build ocean microplastics sensors.,Find projects,"BigFive(O=0.24999999999999997, C=0.75, E=0.5, ..."
67,User068,1992-01-24,33,25-34,Toronto,,43.6532,-79.3832,-5,10-20,...,11,"podcasting, nextjs, strategy, biostatistics, d...","aws, branding, django","climate tech, creator economy, language learni...",I live in Toronto. I value calm schedules and ...,"As a engineer with 11 years, I worked across s...",I prefer weekly demos and short design docs. I...,Goals: prototype a privacy-first social app.,Find collaborators,"BigFive(O=0.5, C=0.8333333333333334, E=0.66666..."
76,User077,2001-01-01,24,18-24,Toronto,,43.6532,-79.3832,-5,10-20,...,15,"branding, design systems, django, aws, nlp","tensorflow, growth, branding, react, figma","VR social spaces, ocean conservation, open sou...",I live in Toronto. I value calm schedules and ...,"As a creator with 15 years, I worked across st...",I prefer weekly demos and short design docs. I...,Goals: launch a YouTube channel on ML.,Build a dream,"BigFive(O=0.6666666666666667, C=0.5, E=0.41666..."
84,User085,1975-05-12,50,45-54,Toronto,,43.6532,-79.3832,-5,5-10,...,13,"podcasting, grant writing, biostatistics, go, ...","storyboarding, statistics, go, aws","coral reef restoration, rural connectivity, la...",I live in Toronto. I value calm schedules and ...,"As a designer with 13 years, I worked across s...",I prefer weekly demos and short design docs. I...,Goals: build ocean microplastics sensors.,Build a dream,"BigFive(O=0.6666666666666667, C=0.333333333333..."
88,User089,1982-11-15,42,35-44,Mexico City,,19.4326,-99.1332,-6,10-20,...,8,"biostatistics, product, field research, django","aws, lab techniques, strategy, python, django","rural connectivity, mental health, creator eco...",I live in Mexico City. I value calm schedules ...,"As a scientist with 8 years, I worked across s...",I prefer weekly demos and short design docs. I...,Goals: start a cruelty-free lipstick brand.,Find projects,"BigFive(O=0.3333333333333333, C=0.583333333333..."


In [18]:

all_names_df = users[['name']].copy()
print(all_names_df.to_string(index=False))
all_names_df.head()


   name
User001
User002
User003
User004
User005
User006
User007
User008
User009
User010
User011
User012
User013
User014
User015
User016
User017
User018
User019
User020
User021
User022
User023
User024
User025
User026
User027
User028
User029
User030
User031
User032
User033
User034
User035
User036
User037
User038
User039
User040
User041
User042
User043
User044
User045
User046
User047
User048
User049
User050
User051
User052
User053
User054
User055
User056
User057
User058
User059
User060
User061
User062
User063
User064
User065
User066
User067
User068
User069
User070
User071
User072
User073
User074
User075
User076
User077
User078
User079
User080
User081
User082
User083
User084
User085
User086
User087
User088
User089
User090
User091
User092
User093
User094
User095
User096
User097
User098
User099
User100


Unnamed: 0,name
0,User001
1,User002
2,User003
3,User004
4,User005
