
# Rocket Matching — Full Hybrid Recommender + AI Interview + Skills Modes (MVP)

This notebook prototypes Rocket's **matching engine** end‑to‑end:

- Hybrid signals: **Content (text/geo/experience/role)** + **Collaborative Filtering (implicit)** + **Graph/Markov (Personalized PageRank)** + **Personality (Big Five)** + **Reciprocity** + **Diversification (MMR)**
- **AI interview** scaffold with 4 seed questions (Human / Professional / Contributor / Interests) **+ reason for joining**
- **Unstructured extraction** (spaCy + KeyBERT + Sentence‑Transformers) to populate interests/skills/locations
- **Skills strategy switch**: **similar skills** (peer discovery) vs **complementary skills** (team formation via Hungarian assignment)
- Ready to swap TF‑IDF → sentence embeddings & Postgres **pgvector**

In [2]:

# !pip install numpy pandas scikit-learn networkx geopy spacy keybert sentence-transformers scipy
# !python -m spacy download en_core_web_sm


In [3]:

import numpy as np, pandas as pd
from typing import List, Tuple, Dict, Any
from dataclasses import dataclass
from geopy.distance import geodesic
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import importlib

np.random.seed(42)



## 1) Synthetic users (replace with real)
Adds a **reason_for_joining** intake field.


In [4]:

users = pd.DataFrame([
    dict(user_id=1, name="Alex", role="Founder", interests="ai, healthcare, remote work", 
         skills="product, fundraising, strategy", bio="building AI for healthcare ops", 
         lat=43.6532, lon=-79.3832, years_exp=7, reason_for_joining="Looking for collaborators to build a vision"),
    dict(user_id=2, name="Sam", role="Engineer", interests="ml, data, open source", 
         skills="python, pytorch, django", bio="ml engineer into OSS", 
         lat=43.7001, lon=-79.4163, years_exp=5, reason_for_joining="Expand network in my domain"),
    dict(user_id=3, name="Jamie", role="Designer", interests="ux, motion, branding", 
         skills="figma, design systems, prototyping", bio="designing for clarity", 
         lat=51.5072, lon=-0.1276, years_exp=6, reason_for_joining="Find projects to work on"),
    dict(user_id=4, name="Taylor", role="Founder", interests="creator economy, fintech", 
         skills="growth, marketing, product", bio="creator tools + fintech", 
         lat=40.7128, lon=-74.0060, years_exp=8, reason_for_joining="Find people to help build a dream"),
    dict(user_id=5, name="Riley", role="Engineer", interests="distributed systems, infra", 
         skills="go, kubernetes, aws", bio="SRE with taste for scale", 
         lat=37.7749, lon=-122.4194, years_exp=9, reason_for_joining="Expand network in infra/SRE"),
    dict(user_id=6, name="Morgan", role="Researcher", interests="nlp, recsys, fairness", 
         skills="python, data science, statistics", bio="research-minded data person", 
         lat=43.6532, lon=-79.3832, years_exp=4, reason_for_joining="Collaborate with founders on AI ideas"),
])
users


Unnamed: 0,user_id,name,role,interests,skills,bio,lat,lon,years_exp,reason_for_joining
0,1,Alex,Founder,"ai, healthcare, remote work","product, fundraising, strategy",building AI for healthcare ops,43.6532,-79.3832,7,Looking for collaborators to build a vision
1,2,Sam,Engineer,"ml, data, open source","python, pytorch, django",ml engineer into OSS,43.7001,-79.4163,5,Expand network in my domain
2,3,Jamie,Designer,"ux, motion, branding","figma, design systems, prototyping",designing for clarity,51.5072,-0.1276,6,Find projects to work on
3,4,Taylor,Founder,"creator economy, fintech","growth, marketing, product",creator tools + fintech,40.7128,-74.006,8,Find people to help build a dream
4,5,Riley,Engineer,"distributed systems, infra","go, kubernetes, aws",SRE with taste for scale,37.7749,-122.4194,9,Expand network in infra/SRE
5,6,Morgan,Researcher,"nlp, recsys, fairness","python, data science, statistics",research-minded data person,43.6532,-79.3832,4,Collaborate with founders on AI ideas



## 2) Personality (TIPI → Big Five)
For MVP, we support **TIPI** (10 items). Swap to **BFI‑2** later for stronger psychometrics.


In [5]:

from dataclasses import dataclass

@dataclass
class BigFive:
    O: float; C: float; E: float; A: float; N: float

def clip01(x): 
    import numpy as np
    return float(np.clip(x, 0.0, 1.0))

TIPI_KEY = {
    1: ("E", False), 2: ("A", True), 3: ("C", False), 4: ("N", False), 5: ("O", False),
    6: ("E", True),  7: ("A", False),8: ("C", True),  9: ("N", True),  10:("O", True)
}

def score_tipi(responses_1to7):
    import numpy as np
    assert len(responses_1to7)==10
    r = np.array(responses_1to7, dtype=float)
    r01 = (r-1)/6.0  # 1..7 -> 0..1
    traits = {"O":[], "C":[], "E":[], "A":[], "N":[]}
    for i,val in enumerate(r01, start=1):
        trait, rev = TIPI_KEY[i]
        traits[trait].append(1.0-val if rev else val)
    return BigFive(*(clip01(np.mean(traits[t])) for t in ["O","C","E","A","N"]))

def bigfive_cosine(u: BigFive, v: BigFive) -> float:
    import numpy as np
    a = np.array([u.O,u.C,u.E,u.A,u.N])
    b = np.array([v.O,v.C,v.E,v.A,v.N])
    return float(a @ b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def bigfive_complementarity(u: BigFive, v: BigFive, target_gap=0.25):
    import numpy as np
    a = np.array([u.O,u.C,u.E,u.A,u.N]); b = np.array([v.O,v.C,v.E,v.A,v.N])
    gap = np.abs(a-b)
    score = np.exp(-((gap-target_gap)**2)/(2*(target_gap**2)))
    return float(score.mean())

# Assign synthetic traits for now
users['bf'] = [
    BigFive(0.7,0.6,0.5,0.7,0.3),
    BigFive(0.6,0.7,0.4,0.6,0.4),
    BigFive(0.5,0.5,0.6,0.7,0.5),
    BigFive(0.7,0.6,0.7,0.5,0.4),
    BigFive(0.5,0.8,0.3,0.6,0.3),
    BigFive(0.8,0.6,0.4,0.7,0.2),
]
users[['name','bf']]


Unnamed: 0,name,bf
0,Alex,"BigFive(O=0.7, C=0.6, E=0.5, A=0.7, N=0.3)"
1,Sam,"BigFive(O=0.6, C=0.7, E=0.4, A=0.6, N=0.4)"
2,Jamie,"BigFive(O=0.5, C=0.5, E=0.6, A=0.7, N=0.5)"
3,Taylor,"BigFive(O=0.7, C=0.6, E=0.7, A=0.5, N=0.4)"
4,Riley,"BigFive(O=0.5, C=0.8, E=0.3, A=0.6, N=0.3)"
5,Morgan,"BigFive(O=0.8, C=0.6, E=0.4, A=0.7, N=0.2)"



## 3) AI interview (4 seed questions + reason for joining)
- **human**: who/age range/location/lifestyle/hobbies/pets  
- **professional**: skills/employers/seniority/deliverables  
- **contributor**: work style/past projects/value add  
- **interests**: passions/project types/collaborators  
- **reason**: why they joined (expand network, find projects, build dream, collaborators, etc.)


In [6]:

INTERVIEW_QUESTIONS = {
    "human": "Tell me about yourself as a person: age range, city/country, lifestyle, hobbies, and pets (if any).",
    "professional": "Describe your professional background: skills, notable employers/clients, seniority level, and things you can produce.",
    "contributor": "How do you like to work? What projects have you worked on? What do you bring to a project or team?",
    "interests": "What are your passions? What gets you out of bed? What kinds of projects or collaborators excite you?",
    "reason": "Why have you joined Rocket? (Expand network, find projects, find collaborators, build out a dream, etc.)"
}

INTERVIEW_SCHEMA = {
    "human": {"age_range": str, "location": str, "lifestyle": str, "hobbies": list, "pets": str},
    "professional": {"skills": list, "employers": list, "seniority": str, "deliverables": list},
    "contributor": {"work_style": str, "past_projects": list, "value_add": list},
    "interests": {"passions": list, "project_types": list, "collaborators": list},
    "reason": {"reason_for_joining": str}
}

def llm_interview_stub(answers: Dict[str, str]) -> Dict[str, Any]:
    """Placeholder for mapping free-text answers into the schema (replace with an LLM call).
    Here we simply pass through and fill minimal fields; extraction runs later.
    """
    parsed = {
        "human": {"age_range":"", "location":"", "lifestyle":"", "hobbies":[], "pets":""},
        "professional": {"skills":[], "employers":[], "seniority":"", "deliverables":[]},
        "contributor": {"work_style":"", "past_projects":[], "value_add":[]},
        "interests": {"passions":[], "project_types":[], "collaborators":[]},
        "reason": {"reason_for_joining": answers.get("reason","").strip()}
    }
    return parsed



## 4) Unstructured extraction (spaCy + KeyBERT + Sentence‑Transformers)
Extract locations/orgs and salient phrases from free text.


In [7]:

def ensure_loaded_spacy():
    if importlib.util.find_spec("spacy") is None:
        raise ImportError("spaCy not installed. Run: pip install spacy && python -m spacy download en_core_web_sm")
    import spacy
    try:
        nlp = spacy.load("en_core_web_sm")
    except Exception as e:
        raise RuntimeError("spaCy model not found. Run: python -m spacy download en_core_web_sm") from e
    return nlp

def ensure_loaded_keybert():
    if importlib.util.find_spec("keybert") is None:
        raise ImportError("KeyBERT not installed. Run: pip install keybert sentence-transformers")
    from keybert import KeyBERT
    from sentence_transformers import SentenceTransformer
    return KeyBERT(model=SentenceTransformer("all-MiniLM-L6-v2"))

def extract_profile_fields(text: str, top_k=10):
    nlp = ensure_loaded_spacy()
    kw_model = ensure_loaded_keybert()
    doc = nlp(text)

    locations = [ent.text for ent in doc.ents if ent.label_ in {"GPE","LOC"}]
    orgs = [ent.text for ent in doc.ents if ent.label_ in {"ORG"}]

    kp = kw_model.extract_keywords(
        text, keyphrase_ngram_range=(1,3), stop_words="english",
        top_n=top_k, use_mmr=True, diversity=0.6
    )
    keyphrases = [k for k,_ in kp]

    def dedupe(seq): 
        out = []
        for x in seq:
            if x not in out: out.append(x)
        return out

    return {"locations": dedupe(locations), "orgs": dedupe(orgs), "keyphrases": dedupe(keyphrases)}



## 5) Normalize interview → profile fields


In [8]:

def normalize_from_interview(answers: Dict[str,str]) -> Dict[str,Any]:
    blob = "\n\n".join([answers.get(k,"") for k in ["human","professional","contributor","interests"]])
    try:
        extracted = extract_profile_fields(blob, top_k=12)
    except Exception as e:
        extracted = {"locations":[], "orgs":[], "keyphrases":[]}

    location = extracted["locations"][0] if extracted["locations"] else ""
    skills = []
    for k in extracted["keyphrases"]:
        if any(s in k.lower() for s in ["python","ml","nlp","design","product","kubernetes","aws","go","django","pytorch","data","ux","branding","marketing","growth","strategy"]):
            skills.append(k)
    interests = [k for k in extracted["keyphrases"] if k not in skills]
    bio = answers.get("professional","")[:160]
    reason = answers.get("reason","").strip()

    return {
        "location": location,
        "skills": ", ".join(skills[:10]),
        "interests": ", ".join(interests[:10]),
        "bio": bio,
        "reason_for_joining": reason
    }

# Demo extraction
demo_answers = {
    "human": "I'm 32 in Toronto, into running and cooking; a timid dog named Sonata.",
    "professional": "Full‑stack engineer (Python, Django, React, AWS). Built recommender POCs.",
    "contributor": "Prefer async with weekly demos; I bring velocity & reliability.",
    "interests": "Creator tools, healthcare AI, community projects; like working with founders and researchers.",
    "reason": "Find collaborators and expand my network"
}
normalize_from_interview(demo_answers)


{'location': '',
 'skills': '',
 'interests': '',
 'bio': 'Full‑stack engineer (Python, Django, React, AWS). Built recommender POCs.',
 'reason_for_joining': 'Find collaborators and expand my network'}


## 6) Content features (text/geo/experience/role)


In [9]:

def build_text_matrix(df: pd.DataFrame) -> Tuple[TfidfVectorizer, np.ndarray]:
    corpus = (df['interests'].fillna('') + " ; " + df['skills'].fillna('') + " ; " + df['bio'].fillna('')).tolist()
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(corpus)
    return vec, X

vec, X_text = build_text_matrix(users)
S_text = cosine_similarity(X_text)
S_text = (S_text - S_text.min()) / (S_text.max() - S_text.min() + 1e-9)

def geo_similarity(df: pd.DataFrame, decay_km: float = 2500.0) -> np.ndarray:
    n = len(df); S = np.zeros((n,n), dtype=float)
    coords = list(zip(df['lat'], df['lon']))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            d_km = geodesic(coords[i], coords[j]).km
            S[i,j] = np.exp(-d_km/decay_km)
    if S.max()>0: S = S/S.max()
    return S

S_geo = geo_similarity(users)

def experience_compatibility(years: List[int], sweet_spot: float = 3.0) -> np.ndarray:
    years = np.array(years); n=len(years); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            gap = abs(years[i]-years[j])
            S[i,j] = np.exp(-((gap-sweet_spot)**2)/(2*(sweet_spot**2)))
    if S.max()>0: S = S/S.max()
    return S

S_exp = experience_compatibility(users['years_exp'].tolist())

ROLE_COMP = {
    "Founder": {"Engineer": 1.0, "Designer": 1.0, "Researcher": 0.8, "Founder": 0.2},
    "Engineer": {"Founder": 1.0, "Designer": 0.7, "Engineer": 0.2, "Researcher": 0.6},
    "Designer": {"Founder": 1.0, "Engineer": 0.7, "Designer": 0.2, "Researcher": 0.5},
    "Researcher": {"Founder": 0.8, "Engineer": 0.6, "Designer": 0.5, "Researcher": 0.3},
}

def role_complementarity(df: pd.DataFrame) -> np.ndarray:
    roles = df['role'].tolist(); n=len(roles); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = ROLE_COMP.get(roles[i], {}).get(roles[j], 0.2)
    return S

S_role = role_complementarity(users)

def combine_content(S_text, S_geo, S_exp, S_role, w=(0.45,0.2,0.15,0.2)):
    a,b,c,d = w
    S = a*S_text + b*S_geo + c*S_exp + d*S_role
    return S / (S.max() + 1e-9)

S_content = combine_content(S_text, S_geo, S_exp, S_role)
pd.DataFrame(S_content, index=users['name'], columns=users['name']).round(3)


name,Alex,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alex,0.831,1.0,0.647,0.616,0.735,0.942
Sam,1.0,0.831,0.518,0.942,0.422,0.889
Jamie,0.647,0.518,0.831,0.671,0.597,0.484
Taylor,0.616,0.942,0.671,0.831,0.662,0.854
Riley,0.735,0.422,0.597,0.662,0.831,0.529
Morgan,0.942,0.889,0.484,0.854,0.529,0.831



## 7) Skills strategy — similar vs complementary
Map phrases → canonical skills (embed & nearest neighbor). For demo we treat `skills` strings directly.


In [10]:

# In production, you would maintain a skills taxonomy (ESCO/O*NET) with embeddings.
# Here we approximate using token overlap + cosine on TF-IDF text to keep it runnable.

def parse_skill_list(sk: str) -> List[str]:
    return [s.strip().lower() for s in (sk or "").split(",") if s.strip()]

def tfidf_cosine(a_list: List[str], b_list: List[str]) -> float:
    docs = ["; ".join(a_list), "; ".join(b_list)]
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(docs)
    return float(cosine_similarity(X[0], X[1])[0,0])

def similar_skills_matrix(df: pd.DataFrame) -> np.ndarray:
    n=len(df); S=np.zeros((n,n))
    parsed = [parse_skill_list(x) for x in df['skills'].fillna('')]
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = tfidf_cosine(parsed[i], parsed[j])
    if S.max()>0: S = S/S.max()
    return S

def complementary_skills_matrix(df: pd.DataFrame) -> np.ndarray:
    # Interpret 'skills' as HAVEs; if a user provides 'skills_want' column it will be used, else fall back to interests.
    wants = []
    haves = []
    for _,row in df.iterrows():
        wants.append(parse_skill_list(row.get('skills_want', row.get('interests',''))))
        haves.append(parse_skill_list(row.get('skills','')))
    n=len(df); S=np.zeros((n,n))
    # Use Hungarian assignment on cosine costs (SciPy) if available; else fallback to average max similarity
    try:
        from scipy.optimize import linear_sum_assignment
        for i in range(n):
            need = wants[i]; 
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                # build cost matrix as (1 - cosine) between need terms and have terms with TF-IDF embeddings
                # compute pairwise matrix
                A = ["; ".join([n1]) for n1 in need]
                B = ["; ".join([h1]) for h1 in have]
                vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
                X = vec.fit_transform(A + B)
                m, k = len(need), len(have)
                # pairwise cosine
                Csim = np.zeros((m,k))
                for p in range(m):
                    for q in range(k):
                        Csim[p,q] = cosine_similarity(X[p], X[m+q])[0,0]
                # max-weight matching -> Hungarian on cost = 1 - sim (pad rectangle)
                mrows, ncols = Csim.shape
                # pad to square if needed
                size = max(mrows, ncols)
                padded = np.ones((size,size))
                padded[:mrows,:ncols] = 1.0 - Csim
                r_ind, c_ind = linear_sum_assignment(padded)
                # only count matches within original sizes
                total_sim = 0.0; count = 0
                for r,c in zip(r_ind, c_ind):
                    if r < mrows and c < ncols:
                        total_sim += 1.0 - padded[r,c]; count += 1
                S[i,j] = total_sim / (count + 1e-9)
        if S.max()>0: S = S/S.max()
    except Exception as e:
        # Fallback: average max similarity
        for i in range(n):
            need = wants[i]
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                sims = []
                for nterm in need:
                    sims.append(max(tfidf_cosine([nterm], [h]) for h in have))
                S[i,j] = float(np.mean(sims)) if sims else 0.0
        if S.max()>0: S = S/S.max()
    return S

S_skills_sim = similar_skills_matrix(users)
S_skills_comp = complementary_skills_matrix(users)
pd.DataFrame(S_skills_sim, index=users['name'], columns=users['name']).round(3)


name,Alex,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alex,0.0,0.0,0.0,1.0,0.0,0.0
Sam,0.0,0.0,0.0,0.0,0.0,0.832
Jamie,0.0,0.0,0.0,0.0,0.0,0.0
Taylor,1.0,0.0,0.0,0.0,0.0,0.0
Riley,0.0,0.0,0.0,0.0,0.0,0.0
Morgan,0.0,0.832,0.0,0.0,0.0,0.0



## 8) Collaborative filtering (implicit placeholder)


In [11]:

edges = [(1,2),(1,6),(2,1),(2,3),(3,4),(4,5),(5,2),(6,1),(6,2)]
n = len(users)
R = np.zeros((n,n), dtype=float)
for u,v in edges: R[u-1,v-1]=1.0
S_cf = cosine_similarity(R.T)
S_cf = (S_cf - S_cf.min())/(S_cf.max()-S_cf.min()+1e-9)
pd.DataFrame(S_cf, index=users['name'], columns=users['name']).round(3)


name,Alex,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alex,1.0,0.408,0.707,0.0,0.0,0.0
Sam,0.408,1.0,0.0,0.0,0.0,0.577
Jamie,0.707,0.0,1.0,0.0,0.0,0.0
Taylor,0.0,0.0,0.0,1.0,0.0,0.0
Riley,0.0,0.0,0.0,0.0,1.0,0.0
Morgan,0.0,0.577,0.0,0.0,0.0,1.0



## 9) Graph / Personalized PageRank


In [12]:

G = nx.DiGraph(); G.add_nodes_from(users['user_id'].tolist()); G.add_edges_from(edges)

def personalized_pagerank_scores(G: nx.DiGraph, source: int, alpha: float = 0.2):
    personalization = {n: 0.0 for n in G.nodes()}; personalization[source]=1.0
    return nx.pagerank(G, alpha=1-alpha, personalization=personalization)

nodes = sorted(G.nodes()); idx = {u:i for i,u in enumerate(nodes)}
S_graph = np.zeros((n,n))
for u in nodes:
    pr = personalized_pagerank_scores(G, u, alpha=0.2)
    for v,s in pr.items(): S_graph[idx[u], idx[v]] = s
S_graph = (S_graph - S_graph.min())/(S_graph.max()-S_graph.min()+1e-12)
pd.DataFrame(S_graph, index=users['name'], columns=users['name']).round(3)


name,Alex,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alex,0.941,0.631,0.187,0.128,0.081,0.312
Sam,0.42,1.0,0.335,0.247,0.176,0.103
Jamie,0.162,0.459,0.705,0.542,0.412,0.0
Taylor,0.23,0.601,0.176,0.705,0.542,0.027
Riley,0.314,0.778,0.247,0.176,0.705,0.061
Morgan,0.523,0.631,0.187,0.128,0.081,0.73



## 10) Personality similarity matrix


In [13]:

S_person = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        if i==j: continue
        S_person[i,j] = bigfive_cosine(users.iloc[i].bf, users.iloc[j].bf)
S_person = (S_person - S_person.min())/(S_person.max()-S_person.min()+1e-9)
pd.DataFrame(S_person, index=users['name'], columns=users['name']).round(3)


name,Alex,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alex,0.0,0.994,0.979,0.983,0.97,1.0
Sam,0.994,0.0,0.974,0.974,0.996,0.979
Jamie,0.979,0.974,0.0,0.977,0.934,0.939
Taylor,0.983,0.974,0.977,0.0,0.931,0.956
Riley,0.97,0.996,0.934,0.931,0.0,0.961
Morgan,1.0,0.979,0.939,0.956,0.961,0.0



## 11) Final fusion + reciprocity + MMR
You can toggle **skills_mode**: `"similar"` or `"complementary"`.


In [14]:

def reciprocalize(S: np.ndarray) -> np.ndarray:
    return np.sqrt(S * S.T + 1e-12)

def fuse_scores(S_content, S_cf, S_graph, S_person, S_skills, weights=(0.35,0.2,0.15,0.15,0.15)):
    # weights: content, cf, graph, personality, skills
    Sc = reciprocalize(S_content)
    Sf = reciprocalize(S_cf)
    Sg = reciprocalize(S_graph)
    Sp = reciprocalize(S_person)
    Ss = reciprocalize(S_skills)
    a,b,c,d,e = weights
    S = a*Sc + b*Sf + c*Sg + d*Sp + e*Ss
    return S / (S.max() + 1e-12)

def mmr(query_idx: int, S: np.ndarray, K: int = 3, lambda_rel: float = 0.7):
    n = S.shape[0]
    candidates = [i for i in range(n) if i != query_idx]
    selected = []
    while candidates and len(selected) < K:
        if not selected:
            i = max(candidates, key=lambda j: S[query_idx, j])
            selected.append(i); candidates.remove(i)
        else:
            def score(j):
                redundancy = max(S[j, s] for s in selected) if selected else 0.0
                return lambda_rel * S[query_idx, j] - (1-lambda_rel) * redundancy
            i = max(candidates, key=score)
            selected.append(i); candidates.remove(i)
    return selected

def top_matches(user_name: str, skills_mode="similar", k=3):
    if skills_mode not in {"similar","complementary"}:
        raise ValueError("skills_mode must be 'similar' or 'complementary'")
    S_sk = S_skills_sim if skills_mode=="similar" else S_skills_comp
    S_final = fuse_scores(S_content, S_cf, S_graph, S_person, S_sk)
    i = users.index[users['name']==user_name][0]
    picks = mmr(i, S_final, K=k, lambda_rel=0.7)
    cols = ['user_id','name','role','interests','skills','years_exp','lat','lon','reason_for_joining']
    return users.iloc[picks][cols].assign(score=[S_final[i,j] for j in picks])

# Demo:
top_matches("Alex", skills_mode="similar", k=3)


Unnamed: 0,user_id,name,role,interests,skills,years_exp,lat,lon,reason_for_joining,score
1,2,Sam,Engineer,"ml, data, open source","python, pytorch, django",5,43.7001,-79.4163,Expand network in my domain,0.893456
2,3,Jamie,Designer,"ux, motion, branding","figma, design systems, prototyping",6,51.5072,-0.1276,Find projects to work on,0.734287
3,4,Taylor,Founder,"creator economy, fintech","growth, marketing, product",8,40.7128,-74.006,Find people to help build a dream,0.731614



## 12) Create a new user from the AI interview → see matches


In [15]:

def add_user_from_interview(name: str, role: str, lat: float, lon: float, years_exp: int, tipi_responses_1to7: List[int], answers: Dict[str,str]):
    global users, vec, X_text, S_text, S_geo, S_exp, S_role, S_content, S_cf, S_graph, S_person, S_skills_sim, S_skills_comp
    bf = score_tipi(tipi_responses_1to7)
    norm = normalize_from_interview(answers)
    row = dict(
        user_id = int(users['user_id'].max())+1,
        name=name, role=role,
        interests=norm['interests'],
        skills=norm['skills'],
        bio=norm['bio'],
        lat=lat, lon=lon, years_exp=years_exp,
        reason_for_joining=norm['reason_for_joining'],
        bf=bf
    )
    users = pd.concat([users, pd.DataFrame([row])], ignore_index=True)

    # Recompute similarities (content)
    vec, X_text = build_text_matrix(users)
    S_text = cosine_similarity(X_text)
    S_text = (S_text - S_text.min()) / (S_text.max() - S_text.min() + 1e-9)
    S_geo = geo_similarity(users)
    S_exp = experience_compatibility(users['years_exp'].tolist())
    S_role = role_complementarity(users)
    S_content = combine_content(S_text, S_geo, S_exp, S_role)

    # CF (expand with zeros for new user)
    n2 = len(users)
    R2 = np.zeros((n2,n2), dtype=float)
    for u,v in edges: 
        if u-1 < n2 and v-1 < n2:
            R2[u-1, v-1] = 1.0
    S_cf2 = cosine_similarity(R2.T)
    S_cf2 = (S_cf2 - S_cf2.min())/(S_cf2.max()-S_cf2.min()+1e-9)

    # Graph
    G2 = nx.DiGraph(); G2.add_nodes_from(users['user_id'].tolist()); G2.add_edges_from(edges)
    nodes2 = sorted(G2.nodes()); idx2 = {u:i for i,u in enumerate(nodes2)}
    S_graph2 = np.zeros((n2,n2))
    for u in nodes2:
        pr = personalized_pagerank_scores(G2, u, alpha=0.2)
        for v,s in pr.items(): S_graph2[idx2[u], idx2[v]] = s
    S_graph2 = (S_graph2 - S_graph2.min())/(S_graph2.max()-S_graph2.min()+1e-12)

    # Personality
    S_person2 = np.zeros((n2,n2))
    for i in range(n2):
        for j in range(n2):
            if i==j: continue
            S_person2[i,j] = bigfive_cosine(users.iloc[i].bf, users.iloc[j].bf)
    S_person2 = (S_person2 - S_person2.min())/(S_person2.max()-S_person2.min()+1e-9)

    # Skills
    S_skills_sim2 = similar_skills_matrix(users)
    S_skills_comp2 = complementary_skills_matrix(users)

    return users, S_content, S_cf2, S_graph2, S_person2, S_skills_sim2, S_skills_comp2

new_user_answers = {
    "human": "I’m 30 in Toronto; I run, cook, and hike with my dog.",
    "professional": "Full‑stack engineer: Python, Django, React, AWS; ex‑fintech startup. I ship MVPs fast.",
    "contributor": "Async, milestone‑driven; I bring velocity, clean code, and reliability.",
    "interests": "Creator tools, healthcare AI, social impact; love working with founders and researchers.",
    "reason": "Looking for collaborators and to expand my network"
}
users2, S_content2, S_cf2, S_graph2, S_person2, S_sk_sim2, S_sk_comp2 = add_user_from_interview(
    name="Casey", role="Engineer", lat=43.65, lon=-79.38, years_exp=6,
    tipi_responses_1to7=[5,5,6,3,6,3,6,3,4,5], answers=new_user_answers
)

def top_matches_latest(skills_mode="similar", k=3):
    S_sk = S_sk_sim2 if skills_mode=="similar" else S_sk_comp2
    S_final = fuse_scores(S_content2, S_cf2, S_graph2, S_person2, S_sk)
    i = len(users2)-1  # new user
    picks = mmr(i, S_final, K=k, lambda_rel=0.7)
    cols = ['user_id','name','role','interests','skills','years_exp','lat','lon','reason_for_joining']
    return users2.iloc[picks][cols].assign(score=[S_final[i,j] for j in picks])

top_matches_latest("similar", k=3)


Unnamed: 0,user_id,name,role,interests,skills,years_exp,lat,lon,reason_for_joining,score
3,4,Taylor,Founder,"creator economy, fintech","growth, marketing, product",8,40.7128,-74.006,Find people to help build a dream,0.687672
0,1,Alex,Founder,"ai, healthcare, remote work","product, fundraising, strategy",7,43.6532,-79.3832,Looking for collaborators to build a vision,0.674704
5,6,Morgan,Researcher,"nlp, recsys, fairness","python, data science, statistics",4,43.6532,-79.3832,Collaborate with founders on AI ideas,0.624849



## 13) Notes & next steps
- Replace TF‑IDF with **Sentence‑Transformers** embeddings; store in **pgvector**.
- Map extracted phrases to a canonical skills taxonomy (ESCO/O*NET) using nearest‑neighbor on skill embeddings.
- Log outcomes to tune weights (precision@n, reply rate, acceptance, coverage/diversity).
- Always request consent for personality processing; allow view/edit.
