
# Rocket Matching — Interview (with DOB) + Hybrid Recommender + 100 Famous-Figure Test Users

**Purpose**: End‑to‑end prototype that:
- Captures interview data (including **DOB** → dynamic **age**) across 5 sections.
- Normalizes free‑text into profile fields.
- Runs a **hybrid recommender**: content (text/geo/exp/role) + **skills (similar vs complementary)** + CF (implicit) + graph (**Personalized PageRank**) + personality + reciprocity + MMR.
- Generates **100 synthetic test users** named after **famous figures** (names only). **All attributes are fictional for testing**: DOBs, bios, skills, interests, locations, personality, etc.

> References: TIPI scoring for Big Five; Personalized PageRank; Hungarian algorithm for complementary matching. See notes at bottom for citations.


## Setup

In [1]:

# Optional installs if running locally:
# !pip install numpy pandas scikit-learn networkx geopy scipy
# For extraction extras (not required to run core demo):
# !pip install spacy keybert sentence-transformers pdfplumber python-docx
# !python -m spacy download en_core_web_sm


In [2]:

import numpy as np, pandas as pd, random, math
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Tuple
from datetime import date, datetime
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from geopy.distance import geodesic
import networkx as nx

np.random.seed(11); random.seed(11)


## Utilities — DOB → Age / Bands

In [3]:

def parse_dob(dob_str: str) -> date:
    return datetime.strptime(dob_str, "%Y-%m-%d").date()

def compute_age(dob: date, today: Optional[date] = None) -> int:
    today = today or date.today()
    years = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
    return max(0, years)

def age_band(age: int) -> str:
    bands = [(18,24),(25,34),(35,44),(45,54)]
    for lo,hi in bands:
        if lo <= age <= hi: return f"{lo}-{hi}"
    return "55+"


## Personality — TIPI → Big Five (short)

In [4]:

@dataclass
class BigFive:
    O: float; C: float; E: float; A: float; N: float

def clip01(x): 
    import numpy as np
    return float(np.clip(x, 0.0, 1.0))

TIPI_KEY = {
    1: ("E", False), 2: ("A", True), 3: ("C", False), 4: ("N", False), 5: ("O", False),
    6: ("E", True),  7: ("A", False),8: ("C", True),  9: ("N", True),  10:("O", True)
}

def score_tipi(responses_1to7):
    import numpy as np
    assert len(responses_1to7)==10
    r = np.array(responses_1to7, dtype=float)
    r01 = (r-1)/6.0  # 1..7 -> 0..1
    traits = {"O":[], "C":[], "E":[], "A":[], "N":[]}
    for i,val in enumerate(r01, start=1):
        trait, rev = TIPI_KEY[i]
        traits[trait].append(1.0-val if rev else val)
    return BigFive(*(clip01(np.mean(traits[t])) for t in ["O","C","E","A","N"]))


## Interview schema + normalizer (with DOB)

In [5]:

INTERVIEW_FIELDS = [
    "dob","human","role","interests","skills","years_exp",
    "professional","contributor","interests_long","reason","location_city","location_country"
]

def parse_comma_list(s: str) -> List[str]:
    return [x.strip() for x in (s or "").split(",") if x.strip()]

def normalize_interview(answers: Dict[str,str]) -> Dict[str,Any]:
    # clip long answers to <=250 words per section
    def wclip(t): 
        ws = t.split(); 
        return " ".join(ws[:250])
    answers = {k: wclip(v or "") for k,v in answers.items()}

    dob_str = answers.get("dob","1985-01-01")
    try:
        dob = parse_dob(dob_str)
    except:
        dob = date(1985,1,1)
    age_val = compute_age(dob)
    out = {
        "dob": dob_str,
        "age": age_val,
        "age_band": age_band(age_val),
        "role": answers.get("role","Undecided").strip() or "Undecided",
        "interests": ", ".join(parse_comma_list(answers.get("interests",""))[:20]),
        "skills": ", ".join(parse_comma_list(answers.get("skills",""))[:20]),
        "years_exp": int(str(answers.get("years_exp","0")).strip() or 0),
        "bio": (answers.get("professional","") or "")[:220],
        "reason_for_joining": (answers.get("reason","") or "").strip(),
        "long_text": " ".join([answers.get("human",""), answers.get("contributor",""), answers.get("interests_long","")])[:1500],
        "location_city": answers.get("location_city",""),
        "location_country": answers.get("location_country",""),
    }
    return out


## Content features — text + geo + experience + role

In [6]:

def build_text_matrix(df: pd.DataFrame):
    corpus = (df['interests'].fillna('') + " ; " + df['skills'].fillna('') + " ; " + df['bio'].fillna('')).tolist()
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(corpus)
    return vec, X

def geo_similarity(df: pd.DataFrame, decay_km: float = 2500.0) -> np.ndarray:
    n = len(df); S = np.zeros((n,n), dtype=float)
    coords = list(zip(df['lat'], df['lon']))
    for i in range(n):
        for j in range(n):
            if i==j: continue
            # great-circle via geopy
            d_km = geodesic(coords[i], coords[j]).km
            S[i,j] = np.exp(-d_km/decay_km)
    if S.max()>0: S = S/S.max()
    return S

def experience_compatibility(years: List[int], sweet_spot: float = 3.0) -> np.ndarray:
    years = np.array(years); n=len(years); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            gap = abs(years[i]-years[j])
            S[i,j] = np.exp(-((gap-sweet_spot)**2)/(2*(sweet_spot**2)))
    if S.max()>0: S = S/S.max()
    return S

ROLE_COMP = {
    "Founder": {"Engineer": 1.0, "Designer": 1.0, "Researcher": 0.8, "Founder": 0.2, "Writer":0.6, "Scientist":0.7, "Creator":0.8},
    "Engineer": {"Founder": 1.0, "Designer": 0.7, "Engineer": 0.2, "Researcher": 0.6, "Writer":0.6, "Scientist":0.8, "Creator":0.7},
    "Designer": {"Founder": 1.0, "Engineer": 0.7, "Designer": 0.2, "Researcher": 0.5, "Writer":0.6, "Scientist":0.5, "Creator":0.9},
    "Researcher": {"Founder": 0.8, "Engineer": 0.7, "Designer": 0.5, "Researcher": 0.3, "Writer":0.5, "Scientist":0.9, "Creator":0.6},
    "Writer": {"Founder":0.8, "Engineer":0.6, "Designer":0.7, "Researcher":0.5, "Writer":0.2, "Scientist":0.5, "Creator":0.9},
    "Scientist":{"Founder":0.9, "Engineer":0.9, "Designer":0.5, "Researcher":0.8, "Writer":0.5, "Scientist":0.2, "Creator":0.6},
    "Creator":{"Founder":0.9, "Engineer":0.7, "Designer":0.9, "Researcher":0.6, "Writer":0.9, "Scientist":0.6, "Creator":0.3},
    "Undecided": {"Founder":0.6,"Engineer":0.6,"Designer":0.6,"Researcher":0.6,"Writer":0.6,"Scientist":0.6,"Creator":0.6,"Undecided":0.2}
}

def role_complementarity(df: pd.DataFrame) -> np.ndarray:
    roles = df['role'].tolist(); n=len(roles); S=np.zeros((n,n),dtype=float)
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = ROLE_COMP.get(roles[i], {}).get(roles[j], 0.2)
    return S

def combine_content(S_text, S_geo, S_exp, S_role, w=(0.45,0.2,0.15,0.2)):
    a,b,c,d = w
    S = a*S_text + b*S_geo + c*S_exp + d*S_role
    return S / (S.max() + 1e-9)


## Skills — similar vs complementary (Hungarian fallback)

In [7]:

from sklearn.feature_extraction.text import TfidfVectorizer

def parse_skill_list(sk: str) -> List[str]:
    return [s.strip().lower() for s in (sk or "").split(",") if s.strip()]

def tfidf_cosine(a_list: List[str], b_list: List[str]) -> float:
    docs = ["; ".join(a_list), "; ".join(b_list)]
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(docs)
    return float(cosine_similarity(X[0], X[1])[0,0])

def similar_skills_matrix(df: pd.DataFrame) -> np.ndarray:
    n=len(df); S=np.zeros((n,n))
    parsed = [parse_skill_list(x) for x in df['skills'].fillna('')]
    for i in range(n):
        for j in range(n):
            if i==j: continue
            S[i,j] = tfidf_cosine(parsed[i], parsed[j])
    if S.max()>0: S = S/S.max()
    return S

def complementary_skills_matrix(df: pd.DataFrame) -> np.ndarray:
    wants = [parse_skill_list(row.get('skills_want', row.get('interests',''))) for _,row in df.iterrows()]
    haves = [parse_skill_list(row.get('skills','')) for _,row in df.iterrows()]
    n=len(df); S=np.zeros((n,n))
    try:
        from scipy.optimize import linear_sum_assignment
        for i in range(n):
            need = wants[i]
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                A = ["; ".join([n1]) for n1 in need]
                B = ["; ".join([h1]) for h1 in have]
                vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
                X = vec.fit_transform(A + B)
                m, k = len(need), len(have)
                Csim = np.zeros((m,k))
                for p in range(m):
                    for q in range(k):
                        Csim[p,q] = cosine_similarity(X[p], X[m+q])[0,0]
                size = max(m,k)
                padded = np.ones((size,size))
                padded[:m,:k] = 1.0 - Csim  # cost = 1 - sim
                r_ind, c_ind = linear_sum_assignment(padded)
                total_sim = 0.0; count = 0
                for r,c in zip(r_ind, c_ind):
                    if r < m and c < k:
                        total_sim += 1.0 - padded[r,c]; count += 1
                S[i,j] = total_sim / (count + 1e-9)
        if S.max()>0: S = S/S.max()
    except Exception:
        # Fallback: average max similarity
        for i in range(n):
            need = wants[i]
            for j in range(n):
                if i==j: continue
                have = haves[j]
                if not need or not have: 
                    S[i,j]=0.0; continue
                sims = []
                for nterm in need:
                    sims.append(max(tfidf_cosine([nterm], [h]) for h in have))
                S[i,j] = float(np.mean(sims)) if sims else 0.0
        if S.max()>0: S = S/S.max()
    return S


## Graph/CF/Personality/Fusion + Diversification

In [8]:

def reciprocalize(S: np.ndarray) -> np.ndarray:
    return np.sqrt(S * S.T + 1e-12)

def fuse_scores(S_content, S_cf, S_graph, S_person, S_skills, weights=(0.35,0.2,0.15,0.15,0.15)):
    Sc = reciprocalize(S_content)
    Sf = reciprocalize(S_cf)
    Sg = reciprocalize(S_graph)
    Sp = reciprocalize(S_person)
    Ss = reciprocalize(S_skills)
    a,b,c,d,e = weights
    S = a*Sc + b*Sf + c*Sg + d*Sp + e*Ss
    return S / (S.max() + 1e-12)

def mmr(query_idx: int, S: np.ndarray, K: int = 3, lambda_rel: float = 0.7):
    n = S.shape[0]
    candidates = [i for i in range(n) if i != query_idx]
    selected = []
    while candidates and len(selected) < K:
        if not selected:
            i = max(candidates, key=lambda j: S[query_idx, j])
            selected.append(i); candidates.remove(i)
        else:
            def score(j):
                redundancy = max(S[j, s] for s in selected) if selected else 0.0
                return lambda_rel * S[query_idx, j] - (1-lambda_rel) * redundancy
            i = max(candidates, key=score)
            selected.append(i); candidates.remove(i)
    return selected


## Generate 100 famous‑figure test users (fictional attributes)

In [9]:

# Names only; everything else is synthetic for testing (not factual).
famous_names = [
    "Albert Einstein","Marie Curie","Ada Lovelace","Alan Turing","Nikola Tesla","Katherine Johnson","Rosalind Franklin","Grace Hopper","Stephen Hawking","Tim Berners-Lee",
    "Elon Musk","Oprah Winfrey","Beyoncé","Taylor Swift","Rihanna","Ariana Grande","Drake","Ed Sheeran","Billie Eilish","Bruno Mars",
    "LeBron James","Serena Williams","Lionel Messi","Rafael Nadal","Usain Bolt","Simone Biles","Roger Federer","Naomi Osaka","Cristiano Ronaldo","Megan Rapinoe",
    "J.K. Rowling","George R.R. Martin","Neil Gaiman","Margaret Atwood","Chimamanda Adichie","Malala Yousafzai","Michelle Obama","Barack Obama","Greta Thunberg","Nelson Mandela",
    "Steve Jobs","Sundar Pichai","Satya Nadella","Jeff Bezos","Mark Zuckerberg","Susan Wojcicki","Sheryl Sandberg","Jack Dorsey","Brian Chesky","Whitney Wolfe Herd",
    "Quincy Jones","Hans Zimmer","John Williams","Lin-Manuel Miranda","Pharrell Williams","Lady Gaga","Adele","Kendrick Lamar","J. Cole","The Weeknd",
    "Hayao Miyazaki","Christopher Nolan","Greta Gerwig","Quentin Tarantino","Ava DuVernay","James Cameron","Peter Jackson","Kathryn Bigelow","Taika Waititi","Chloé Zhao",
    "Noam Chomsky","Yuval Noah Harari","Steven Pinker","Angela Davis","Cornel West","Jane Goodall","David Attenborough","Neil deGrasse Tyson","Brian Cox","Carl Sagan",
    "Ilya Sutskever","Demis Hassabis","Fei-Fei Li","Yann LeCun","Geoffrey Hinton","Andrew Ng","Yoshua Bengio","Lex Fridman","Sam Altman","Dario Amodei",
    "Martha Stewart","Gordon Ramsay","Jamie Oliver","Nigella Lawson","Ree Drummond","Heston Blumenthal","Massimo Bottura","Christina Tosi","Yotam Ottolenghi","Anthony Bourdain"
]
assert len(famous_names) == 100

roles = ["Founder","Engineer","Designer","Researcher","Writer","Scientist","Creator"]
cities = [
    ("Toronto",43.6532,-79.3832),("New York",40.7128,-74.0060),("San Francisco",37.7749,-122.4194),
    ("London",51.5072,-0.1276),("Berlin",52.52,13.405),("Nairobi",-1.286389,36.817223),
    ("Sydney",-33.8688,151.2093),("Bangalore",12.9716,77.5946),("Paris",48.8566,2.3522),("Mexico City",19.4326,-99.1332)
]

skill_bank = [
    "python","pytorch","tensorflow","django","react","nextjs","go","kubernetes","aws","gcp",
    "video editing","storyboarding","scriptwriting","podcasting","seo","branding","figma","design systems",
    "statistics","causal inference","nlp","cv","llm prompting","sql","dbt","airflow",
    "grant writing","field research","lab techniques","oceanography","genomics","biostatistics",
    "supply chain","marketing","growth","product","fundraising","strategy"
]
interest_bank = [
    "ocean conservation","coral reef restoration","climate tech","educational apps","healthcare AI",
    "creator economy","open source tools","social impact","rural connectivity","financial inclusion",
    "short-form video","long-form YouTube","beauty brand","lipstick R&D","fashion sustainability",
    "music production","publishing","newsletter growth","sports analytics","mental health",
    "language learning","VR social spaces","next social network","privacy-first messaging"
]

goals_examples = [
    "launch a YouTube channel teaching ML from scratch",
    "build low-cost sensors to monitor microplastics in rivers",
    "create a community-powered social network with better moderation",
    "start a cruelty-free lipstick brand with transparent supply chain",
    "develop AI tools for writers to plan book outlines",
    "spin up an educational game for climate science",
    "ship a mobile app to connect volunteers with ocean NGOs",
    "build a data pipeline for grassroots health clinics",
    "open-source a toolkit for video creators to analyze audience retention",
    "prototype a privacy-first group chat app with local-first sync"
]

def rand_words(pool, kmin, kmax):
    k = random.randint(kmin, kmax)
    return ", ".join(random.sample(pool, k))

def make_human(city):
    opts = [
        f"In {city}, I balance focused work with creative side projects. Evenings go to running, cooking, and sketching ideas. Weekends: parks and long walks with friends or a pet.",
        f"{city} is home base. I keep a simple routine—deep work blocks, gym, and cooking. I love small meetups where people demo what they’re building.",
        f"I live in {city}. I value calm schedules, async collaboration, and consistent routines. I like filming short videos, reading widely, and trying new coffee spots."
    ]
    return random.choice(opts)

def random_dob():
    # ensure 23+ (born 1960-2002)
    y = random.randint(1960, 2002)
    m = random.randint(1,12)
    d = random.randint(1,28)
    return f"{y:04d}-{m:02d}-{d:02d}"

records = []
for i, name in enumerate(famous_names, start=1):
    role = random.choice(roles)
    (city, lat, lon) = random.choice(cities)
    years = random.randint(1, 20)
    skills = rand_words(skill_bank, 3, 7)
    interests = rand_words(interest_bank, 3, 7)
    human = make_human(city)
    professional = f"As a {role.lower()} with {years} years across startups and studios, I work on: {skills}. I’ve shipped prototypes and production systems. Lately I’ve focused on {random.choice(interest_bank)}. I can produce clear docs, stable code or creative assets, and collaborate across functions."
    contributor = "I prefer async collaboration with tight scopes and weekly demos. I write short design docs, propose milestones, and keep momentum. I aim for psychological safety and crisp ownership."
    interests_long = f"Goals: {random.choice(goals_examples)}. I like teaming with thoughtful builders who care about impact and craft."
    reason = random.choice(["Find projects","Expand network","Find collaborators","Build a dream"])
    answers = dict(
        dob=random_dob(),
        human=human, role=role, interests=interests, skills=skills, years_exp=str(years),
        professional=professional, contributor=contributor, interests_long=interests_long, reason=reason,
        location_city=city, location_country=""
    )
    norm = normalize_interview(answers)
    # TIPI random but plausible center (3-5)
    tipi = [random.randint(2,6) for _ in range(10)]
    bf = score_tipi(tipi)
    records.append(dict(
        user_id=i, name=name, role=norm["role"], interests=norm["interests"], skills=norm["skills"],
        years_exp=norm["years_exp"], bio=norm["bio"], reason_for_joining=norm["reason_for_joining"],
        long_text=norm["long_text"], lat=lat, lon=lon, dob=norm["dob"], age=norm["age"], age_band=norm["age_band"], bf=bf
    ))

users = pd.DataFrame(records)
users.head(10)


Unnamed: 0,user_id,name,role,interests,skills,years_exp,bio,reason_for_joining,long_text,lat,lon,dob,age,age_band,bf
0,1,Albert Einstein,Researcher,"fashion sustainability, financial inclusion, h...","supply chain, scriptwriting, storyboarding, st...",15,As a researcher with 15 years across startups ...,Build a dream,"In Paris, I balance focused work with creative...",48.8566,2.3522,1988-11-24,36,35-44,"BigFive(O=0.75, C=0.8333333333333334, E=0.8333..."
1,2,Marie Curie,Writer,"music production, ocean conservation, VR socia...","lab techniques, strategy, scriptwriting, marke...",15,As a writer with 15 years across startups and ...,Find projects,"I live in Toronto. I value calm schedules, asy...",43.6532,-79.3832,1976-06-25,49,45-54,"BigFive(O=0.3333333333333333, C=0.416666666666..."
2,3,Ada Lovelace,Researcher,"next social network, beauty brand, lipstick R&...","podcasting, django, genomics",1,As a researcher with 1 years across startups a...,Find projects,"I live in New York. I value calm schedules, as...",40.7128,-74.006,1979-06-01,46,45-54,"BigFive(O=0.4166666666666667, C=0.333333333333..."
3,4,Alan Turing,Scientist,"lipstick R&D, open source tools, ocean conserv...","supply chain, scriptwriting, aws, grant writin...",7,As a scientist with 7 years across startups an...,Build a dream,"In Paris, I balance focused work with creative...",48.8566,2.3522,1998-11-19,26,25-34,"BigFive(O=0.5833333333333333, C=0.249999999999..."
4,5,Nikola Tesla,Researcher,"mental health, long-form YouTube, fashion sust...","podcasting, strategy, branding",3,As a researcher with 3 years across startups a...,Expand network,"In New York, I balance focused work with creat...",40.7128,-74.006,1984-03-21,41,35-44,"BigFive(O=0.33333333333333337, C=0.24999999999..."
5,6,Katherine Johnson,Researcher,"privacy-first messaging, next social network, ...","django, go, fundraising, tensorflow, supply ch...",3,As a researcher with 3 years across startups a...,Find collaborators,Mexico City is home base. I keep a simple rout...,19.4326,-99.1332,1993-03-24,32,25-34,"BigFive(O=0.75, C=0.5833333333333334, E=0.1666..."
6,7,Rosalind Franklin,Engineer,"financial inclusion, long-form YouTube, publis...","grant writing, lab techniques, branding, djang...",3,As a engineer with 3 years across startups and...,Expand network,"In Toronto, I balance focused work with creati...",43.6532,-79.3832,1988-06-22,37,35-44,"BigFive(O=0.5, C=0.41666666666666663, E=0.8333..."
7,8,Grace Hopper,Founder,"newsletter growth, long-form YouTube, coral re...","react, causal inference, nlp, aws, strategy, m...",3,As a founder with 3 years across startups and ...,Find projects,"In Mexico City, I balance focused work with cr...",19.4326,-99.1332,1986-01-28,39,35-44,"BigFive(O=0.75, C=0.5, E=0.5, A=0.5, N=0.5)"
8,9,Stephen Hawking,Founder,"fashion sustainability, newsletter growth, cli...","grant writing, cv, dbt, strategy, oceanography",4,As a founder with 4 years across startups and ...,Build a dream,New York is home base. I keep a simple routine...,40.7128,-74.006,1961-04-23,64,55+,"BigFive(O=0.3333333333333333, C=0.666666666666..."
9,10,Tim Berners-Lee,Engineer,"open source tools, language learning, lipstick...","lab techniques, biostatistics, branding, nlp, ...",6,As a engineer with 6 years across startups and...,Expand network,"I live in Paris. I value calm schedules, async...",48.8566,2.3522,1968-08-12,56,55+,"BigFive(O=0.25, C=0.5, E=0.25, A=0.75, N=0.5)"


## Build all similarity signals

In [10]:

# Text
vec, X_text = build_text_matrix(users)
S_text = cosine_similarity(X_text)
S_text = (S_text - S_text.min())/(S_text.max()-S_text.min()+1e-9)

# Geo/Exp/Role
S_geo = geo_similarity(users)
S_exp = experience_compatibility(users['years_exp'].tolist())
S_role = role_complementarity(users)
S_content = combine_content(S_text, S_geo, S_exp, S_role)

# Skills
S_skills_sim = similar_skills_matrix(users)
S_skills_comp = complementary_skills_matrix(users)

# Implicit CF from synthetic likes (biased by roles)
n = len(users)
R = np.zeros((n,n), dtype=float)
for _ in range(600):
    u = random.randrange(n); v = random.randrange(n)
    if u==v: continue
    if users.iloc[u].role=="Founder" and users.iloc[v].role in ["Engineer","Designer"]: R[u,v]=1.0
    elif users.iloc[u].role=="Creator" and users.iloc[v].role in ["Writer","Designer","Engineer"]: R[u,v]=1.0
    elif random.random() < 0.05: R[u,v]=1.0
S_cf = cosine_similarity(R.T)
S_cf = (S_cf - S_cf.min())/(S_cf.max()-S_cf.min()+1e-9)

# Graph PPR (directed)
G = nx.DiGraph(); G.add_nodes_from(users['user_id'].tolist())
edges = [(int(users.iloc[u].user_id), int(users.iloc[v].user_id)) for u in range(n) for v in range(n) if R[u,v]>0]
G.add_edges_from(edges)
nodes = sorted(G.nodes()); idx = {u:i for i,u in enumerate(nodes)}
S_graph = np.zeros((n,n))
for u in nodes:
    pr = nx.pagerank(G, alpha=0.8, personalization={k:(1.0 if k==u else 0.0) for k in nodes})
    for v,s in pr.items(): S_graph[idx[u], idx[v]] = s
S_graph = (S_graph - S_graph.min())/(S_graph.max()-S_graph.min()+1e-12)

# Personality matrix
S_person = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        if i==j: continue
        S_person[i,j] = (users.iloc[i].bf.O*users.iloc[j].bf.O + users.iloc[i].bf.C*users.iloc[j].bf.C +
                         users.iloc[i].bf.E*users.iloc[j].bf.E + users.iloc[i].bf.A*users.iloc[j].bf.A +
                         users.iloc[i].bf.N*users.iloc[j].bf.N) / (
                         ( (users.iloc[i].bf.O**2 + users.iloc[i].bf.C**2 + users.iloc[i].bf.E**2 + users.iloc[i].bf.A**2 + users.iloc[i].bf.N**2) ** 0.5) *
                         ( (users.iloc[j].bf.O**2 + users.iloc[j].bf.C**2 + users.iloc[j].bf.E**2 + users.iloc[j].bf.A**2 + users.iloc[j].bf.N**2) ** 0.5) + 1e-9
        )
S_person = (S_person - S_person.min())/(S_person.max()-S_person.min()+1e-9)


## Top‑K recommendations & mutual pairs

In [11]:

def top_matches_for(idx: int, skills_mode="similar", k=3):
    S_sk = S_skills_sim if skills_mode=="similar" else S_skills_comp
    S_final = fuse_scores(S_content, S_cf, S_graph, S_person, S_sk)
    picks = mmr(idx, S_final, K=k, lambda_rel=0.7)
    cols = ['user_id','name','role','interests','skills','years_exp','age','age_band','reason_for_joining']
    return users.iloc[picks][cols].assign(score=[S_final[idx,j] for j in picks])

# Example: show for a few seed users under both modes
sample_idx = [0, 5, 20, 40, 75]
demo = {}
for i in sample_idx:
    uname = users.iloc[i]["name"]
    demo[f"{uname} (similar)"] = top_matches_for(i, "similar", 3)
    demo[f"{uname} (complementary)"] = top_matches_for(i, "complementary", 3)
demo


{'Albert Einstein (similar)':     user_id           name       role  \
 18       19  Billie Eilish     Writer   
 3         4    Alan Turing  Scientist   
 70       71   Noam Chomsky  Scientist   
 
                                             interests  \
 18  financial inclusion, VR social spaces, educati...   
 3   lipstick R&D, open source tools, ocean conserv...   
 70  next social network, short-form video, languag...   
 
                                                skills  years_exp  age  \
 18  gcp, scriptwriting, storyboarding, kubernetes,...         19   59   
 3   supply chain, scriptwriting, aws, grant writin...          7   26   
 70  scriptwriting, product, field research, dbt, p...         18   57   
 
    age_band reason_for_joining     score  
 18      55+      Find projects  0.859246  
 3     25-34      Build a dream  0.800828  
 70      55+     Expand network  0.765735  ,
 'Albert Einstein (complementary)':     user_id           name       role  \
 18       19  B

In [12]:

def mutual_best_pairs(skills_mode="similar"):
    S_sk = S_skills_sim if skills_mode=="similar" else S_skills_comp
    S_final = fuse_scores(S_content, S_cf, S_graph, S_person, S_sk)
    n = S_final.shape[0]
    best = {i:int(np.argmax(S_final[i,:] + (np.arange(n)==i)*-1e9)) for i in range(n)}
    used=set(); pairs=[]
    for i in range(n):
        if i in used: continue
        j = best[i]
        if j!=i and best.get(j)==i and j not in used:
            pairs.append((i,j,float(S_final[i,j])))
            used.add(i); used.add(j)
    pairs = sorted(pairs, key=lambda x: -x[2])
    return [(users.iloc[i].name, users.iloc[j].name, score) for i,j,score in pairs]

pairs_similar = mutual_best_pairs("similar")[:20]
pairs_complementary = mutual_best_pairs("complementary")[:20]
pairs_similar[:5], pairs_complementary[:5]


([(79, 93, 0.9560341484405946),
  (44, 72, 0.9527607725742798),
  (38, 64, 0.9252068344650105),
  (16, 70, 0.893487970813419),
  (49, 77, 0.8834666424168107)],
 [(79, 93, 0.9349303738187729),
  (44, 72, 0.9176032487591314),
  (86, 91, 0.9004042753407553),
  (49, 77, 0.8834666424168107),
  (9, 68, 0.8179259232431988)])

In [14]:
pairs_similar

[(79, 93, 0.9560341484405946),
 (44, 72, 0.9527607725742798),
 (38, 64, 0.9252068344650105),
 (16, 70, 0.893487970813419),
 (49, 77, 0.8834666424168107),
 (36, 62, 0.8724798460771322),
 (8, 50, 0.8690482557878062),
 (30, 75, 0.868352330693554),
 (9, 68, 0.8620415550820019),
 (0, 18, 0.8592461488135739),
 (84, 87, 0.855925112719942),
 (24, 57, 0.854671382050292),
 (21, 97, 0.8536957914903776),
 (14, 25, 0.8488206294323145),
 (41, 67, 0.8415471855196504),
 (2, 52, 0.8407571079817655),
 (20, 32, 0.8356406008959758),
 (58, 76, 0.8282395271525106),
 (15, 92, 0.820559268911069),
 (40, 56, 0.815983720948886)]


## Notes & references
- **TIPI** (10‑item Big Five) — scoring & properties. Use as a quick intake; upgrade to BFI‑2 later.
- **Personalized PageRank** — graph signal for friends‑of‑friends and network adjacency.
- **Hungarian algorithm** — complementary skills coverage via `scipy.optimize.linear_sum_assignment`.

**Important**: All attributes generated here for public‑figure names are **fictional** and for **testing only**. Do not treat as factual.
