
# Rocket Matching MVP — Hybrid Recommender (IPYNB)

This notebook scaffolds a **hybrid** matching system for Rocket that scores potential connections using:
- **Content-based similarity** on interests/skills/bios/location
- **Collaborative filtering** on implicit interactions (likes, matches, follows)
- **Graph/Markov signal** via personalized PageRank/random walks on the user network
- **Reciprocity** (u likes v and v likely likes u) and **complementarity** (e.g., founder ↔ engineer/designer)

> Swap synthetic data for your real datasets as soon as your Django API is ready. The plumbing below is designed to be modular.


In [25]:

# !pip install scikit-learn numpy pandas networkx geopy
# If running locally, uncomment the line above to install missing packages.
import numpy as np
import pandas as pd
from typing import List, Tuple
from geopy.distance import geodesic  # for haversine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

np.random.seed(42)



## 1) Synthetic data (replace with real)
Fields: `user_id, name, role, interests, skills, bio, lat, lon, years_exp`.


In [None]:

users = pd.DataFrame([
    dict(user_id=1, name="Alex", role="Founder", interests="ai, healthcare, remote work", 
         skills="product, fundraising, strategy", bio="building AI for healthcare ops", 
         lat=43.6532, lon=-79.3832, years_exp=7),
     dict(user_id=1, name="Billybob", role="Founder", interests="ai, healthcare, remote work", 
         skills="product, fundraising, strategy", bio="building AI for healthcare ops", 
         lat=43.6532, lon=-79.3832, years_exp=7),
    dict(user_id=2, name="Sam", role="Engineer", interests="ml, data, open source", 
         skills="python, pytorch, django", bio="ml engineer into OSS", 
         lat=43.7001, lon=-79.4163, years_exp=5),
    dict(user_id=3, name="Jamie", role="Designer", interests="ux, motion, branding", 
         skills="figma, design systems, prototyping", bio="designing for clarity", 
         lat=51.5072, lon=-0.1276, years_exp=6),
    dict(user_id=4, name="Taylor", role="Founder", interests="creator economy, fintech", 
         skills="growth, marketing, product", bio="creator tools + fintech", 
         lat=40.7128, lon=-74.0060, years_exp=8),
    dict(user_id=5, name="Riley", role="Engineer", interests="distributed systems, infra", 
         skills="go, kubernetes, aws", bio="SRE with taste for scale", 
         lat=37.7749, lon=-122.4194, years_exp=9),
#     dict(user_id=6, name="Morgan", role="Researcher", interests="nlp, recsys, fairness", 
#          skills="python, data science, statistics", bio="research-minded data person", 
#          lat=43.6532, lon=-79.3832, years_exp=4),
])
users


Unnamed: 0,user_id,name,role,interests,skills,bio,lat,lon,years_exp
0,1,Alex,Founder,"ai, healthcare, remote work","product, fundraising, strategy",building AI for healthcare ops,43.6532,-79.3832,7
1,1,Billybob,Founder,"ai, healthcare, remote work","product, fundraising, strategy",building AI for healthcare ops,43.6532,-79.3832,7
2,2,Sam,Engineer,"ml, data, open source","python, pytorch, django",ml engineer into OSS,43.7001,-79.4163,5
3,3,Jamie,Designer,"ux, motion, branding","figma, design systems, prototyping",designing for clarity,51.5072,-0.1276,6
4,4,Taylor,Founder,"creator economy, fintech","growth, marketing, product",creator tools + fintech,40.7128,-74.006,8
5,5,Riley,Engineer,"distributed systems, infra","go, kubernetes, aws",SRE with taste for scale,37.7749,-122.4194,9
6,6,Morgan,Researcher,"nlp, recsys, fairness","python, data science, statistics",research-minded data person,43.6532,-79.3832,4



## 2) Content features (TF‑IDF on text fields) + location + experience
We build a **content vector** per user from text (`interests + skills + bio`) and add:
- **Geo proximity** (Toronto ≈ Toronto > London > SF/NY by distance)
- **Experience compatibility** (complementary years of experience, configurable)


In [27]:

def build_text_matrix(df: pd.DataFrame) -> Tuple[TfidfVectorizer, np.ndarray]:
    corpus = (df['interests'] + " ; " + df['skills'] + " ; " + df['bio']).tolist()
    vec = TfidfVectorizer(ngram_range=(1,2), min_df=1)
    X = vec.fit_transform(corpus)
    return vec, X

vec, X_text = build_text_matrix(users)
S_text = cosine_similarity(X_text)
S_text = (S_text - S_text.min()) / (S_text.max() - S_text.min() + 1e-9)  # normalize
S_text.round(3)


array([[1.   , 1.   , 0.   , 0.021, 0.026, 0.02 , 0.   ],
       [1.   , 1.   , 0.   , 0.021, 0.026, 0.02 , 0.   ],
       [0.   , 0.   , 1.   , 0.   , 0.   , 0.   , 0.094],
       [0.021, 0.021, 0.   , 1.   , 0.   , 0.056, 0.   ],
       [0.026, 0.026, 0.   , 0.   , 1.   , 0.   , 0.   ],
       [0.02 , 0.02 , 0.   , 0.056, 0.   , 1.   , 0.   ],
       [0.   , 0.   , 0.094, 0.   , 0.   , 0.   , 1.   ]])

In [28]:

def geo_similarity(df: pd.DataFrame, decay_km: float = 3000.0) -> np.ndarray:
    n = len(df)
    S = np.zeros((n, n), dtype=float)
    coords = list(zip(df['lat'], df['lon']))
    for i in range(n):
        for j in range(n):
            if i == j:
                continue
            d_km = geodesic(coords[i], coords[j]).km
            S[i, j] = np.exp(-d_km / decay_km)  # RBF-style decay
    if S.max() > 0:
        S = S / S.max()
    return S

S_geo = geo_similarity(users, decay_km=2500)  # tune decay
S_geo.round(3)


array([[0.   , 1.   , 0.998, 0.101, 0.802, 0.232, 1.   ],
       [1.   , 0.   , 0.998, 0.101, 0.802, 0.232, 1.   ],
       [0.998, 0.998, 0.   , 0.101, 0.8  , 0.232, 0.998],
       [0.101, 0.101, 0.101, 0.   , 0.107, 0.032, 0.101],
       [0.802, 0.802, 0.8  , 0.107, 0.   , 0.191, 0.802],
       [0.232, 0.232, 0.232, 0.032, 0.191, 0.   , 0.232],
       [1.   , 1.   , 0.998, 0.101, 0.802, 0.232, 0.   ]])

In [29]:

def experience_compatibility(years: List[int], sweet_spot: float = 2.0) -> np.ndarray:
    # High when the experience gap is around sweet_spot (complementarity) rather than identical
    years = np.array(years)
    n = len(years)
    S = np.zeros((n, n), dtype=float)
    for i in range(n):
        for j in range(n):
            if i == j: 
                continue
            gap = abs(years[i] - years[j])
            S[i, j] = np.exp(-((gap - sweet_spot)**2) / (2 * (sweet_spot**2)))
    if S.max() > 0:
        S = S / S.max()
    return S

S_exp = experience_compatibility(users['years_exp'].tolist(), sweet_spot=3.0)
S_exp.round(3)


array([[0.   , 0.607, 0.946, 0.801, 0.801, 0.946, 1.   ],
       [0.607, 0.   , 0.946, 0.801, 0.801, 0.946, 1.   ],
       [0.946, 0.946, 0.   , 0.801, 1.   , 0.946, 0.801],
       [0.801, 0.801, 0.801, 0.   , 0.946, 1.   , 0.946],
       [0.801, 0.801, 1.   , 0.946, 0.   , 0.801, 0.946],
       [0.946, 0.946, 0.946, 1.   , 0.801, 0.   , 0.801],
       [1.   , 1.   , 0.801, 0.946, 0.946, 0.801, 0.   ]])


## 3) Role complementarity
Encourage **Founders ↔ Engineers/Designers/Researchers** (tuneable mapping). This enforces *reciprocal/professional* matching logic.


In [30]:

ROLE_COMP = {
    "Founder": {"Engineer": 1.0, "Designer": 1.0, "Researcher": 0.8, "Founder": 0.2},
    "Engineer": {"Founder": 1.0, "Designer": 0.7, "Engineer": 0.2, "Researcher": 0.6},
    "Designer": {"Founder": 1.0, "Engineer": 0.7, "Designer": 0.2, "Researcher": 0.5},
    "Researcher": {"Founder": 0.8, "Engineer": 0.6, "Designer": 0.5, "Researcher": 0.3},
}

def role_complementarity(df: pd.DataFrame) -> np.ndarray:
    roles = df['role'].tolist()
    n = len(roles)
    S = np.zeros((n, n), dtype=float)
    for i in range(n):
        for j in range(n):
            if i == j: 
                continue
            S[i, j] = ROLE_COMP.get(roles[i], {}).get(roles[j], 0.2)
    return S

S_role = role_complementarity(users)
S_role


array([[0. , 0.2, 1. , 1. , 0.2, 1. , 0.8],
       [0.2, 0. , 1. , 1. , 0.2, 1. , 0.8],
       [1. , 1. , 0. , 0.7, 1. , 0.2, 0.6],
       [1. , 1. , 0.7, 0. , 1. , 0.7, 0.5],
       [0.2, 0.2, 1. , 1. , 0. , 1. , 0.8],
       [1. , 1. , 0.2, 0.7, 1. , 0. , 0.6],
       [0.8, 0.8, 0.6, 0.5, 0.8, 0.6, 0. ]])


## 4) Content score aggregation
Weighted sum of text, geo, experience, role signals. We’ll calibrate weights later.


In [31]:

def combine_content(S_text, S_geo, S_exp, S_role, w=(0.45, 0.2, 0.15, 0.2)):
    a, b, c, d = w
    S = a*S_text + b*S_geo + c*S_exp + d*S_role
    S = S / (S.max() + 1e-9)
    return S

S_content = combine_content(S_text, S_geo, S_exp, S_role)
pd.DataFrame(S_content, index=users['name'], columns=users['name']).round(3)


name,Alex,Billybob,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alex,0.576,1.0,0.693,0.448,0.425,0.509,0.653
Billybob,1.0,0.576,0.693,0.448,0.425,0.509,0.653
Sam,0.693,0.693,0.576,0.359,0.653,0.292,0.617
Jamie,0.448,0.448,0.359,0.576,0.465,0.412,0.336
Taylor,0.425,0.425,0.653,0.465,0.576,0.459,0.592
Riley,0.509,0.509,0.292,0.412,0.459,0.576,0.367
Morgan,0.653,0.653,0.617,0.336,0.592,0.367,0.576



## 5) Collaborative filtering (implicit) — user↔user
Simulate implicit interactions (likes/follows). In production, train an ALS/BPR model and infer user embeddings.


In [32]:

# Synthetic implicit interactions: list of (src_user, dst_user) "likes"
edges = [
    (1, 2), (1, 6),   # Alex likes Sam, Morgan
    (2, 1), (2, 3),   # Sam likes Alex, Jamie
    (3, 4),           # Jamie likes Taylor
    (4, 5),           # Taylor likes Riley
    (5, 2),           # Riley likes Sam
    (6, 1), (6, 2),   # Morgan likes Alex, Sam
]
n = len(users)
R = np.zeros((n, n), dtype=float)
for u,v in edges:
    R[u-1, v-1] = 1.0  # implicit feedback

# Simple item-based CF: similarity between "receivers"
S_cf = cosine_similarity(R.T)  # who tends to be liked by similar people
S_cf = (S_cf - S_cf.min()) / (S_cf.max() - S_cf.min() + 1e-9)
pd.DataFrame(S_cf, index=users['name'], columns=users['name']).round(3)


name,Alex,Billybob,Sam,Jamie,Taylor,Riley,Morgan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alex,1.0,0.408,0.707,0.0,0.0,0.0,0.0
Billybob,0.408,1.0,0.0,0.0,0.0,0.577,0.0
Sam,0.707,0.0,1.0,0.0,0.0,0.0,0.0
Jamie,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Taylor,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Riley,0.0,0.577,0.0,0.0,0.0,1.0,0.0
Morgan,0.0,0.0,0.0,0.0,0.0,0.0,0.0



## 6) Graph/Markov signal — Personalized PageRank (random walk with restart)
Encodes “friends‑of‑friends” discovery. We compute a PPR vector per source user on the **directed** like/follow graph.


In [33]:

G = nx.DiGraph()
G.add_nodes_from(users['user_id'].tolist())
G.add_edges_from(edges)

def personalized_pagerank_scores(G: nx.DiGraph, source: int, alpha: float = 0.15):
    personalization = {n: 0.0 for n in G.nodes()}
    personalization[source] = 1.0
    pr = nx.pagerank(G, alpha=1-alpha, personalization=personalization)  # restart prob = alpha
    return pr

# Example: PPR scores from user 1 (Alex)
ppr_1 = personalized_pagerank_scores(G, 1)
sorted(ppr_1.items(), key=lambda x: x[1], reverse=True)


[(1, 0.31841925775474156),
 (2, 0.2609510265903641),
 (6, 0.13532774797368194),
 (3, 0.11090471697680314),
 (4, 0.09426841590038328),
 (5, 0.08012883480402593)]


## 7) Reciprocity & final ranking
For reciprocal/professional matching, we want `score(u→v)` **and** `score(v→u)` to be high. We symmetrize by geometric mean.
Then blend content, CF, and graph signals.


In [34]:

def reciprocalize(S: np.ndarray) -> np.ndarray:
    # symmetric by geometric mean
    return np.sqrt(S * S.T + 1e-12)

def final_score(S_content, S_cf, G: nx.DiGraph, alpha_graph=0.2, w=(0.55, 0.25, 0.2)):
    # Compute PPR matrix (n source users)
    nodes = sorted(G.nodes())
    idx = {u:i for i,u in enumerate(nodes)}
    n = len(nodes)
    S_graph = np.zeros((n,n))
    for u in nodes:
        pr = personalized_pagerank_scores(G, u, alpha=alpha_graph)
        for v, s in pr.items():
            S_graph[idx[u], idx[v]] = s
    # normalize
    S_graph = (S_graph - S_graph.min()) / (S_graph.max() - S_graph.min() + 1e-12)

    # Symmetrize (reciprocity) each signal
    S_c = reciprocalize(S_content)
    S_cf_sym = reciprocalize(S_cf)
    S_g = reciprocalize(S_graph)

    a,b,c = w
    S = a*S_c + b*S_cf_sym + c*S_g
    S = S / (S.max() + 1e-12)
    return S, S_graph

S_final, S_graph = final_score(S_content, S_cf, G)
pd.DataFrame(S_final, index=users['name'], columns=users['name']).round(3)


ValueError: operands could not be broadcast together with shapes (7,7) (6,6) 


## 8) Diversification (MMR) and top‑K picks
Avoid showing near‑duplicates. Use **Maximal Marginal Relevance** versus the query user.


In [15]:

def mmr(query_idx: int, S: np.ndarray, K: int = 3, lambda_rel: float = 0.7):
    n = S.shape[0]
    candidates = [i for i in range(n) if i != query_idx]
    selected = []
    while candidates and len(selected) < K:
        if not selected:
            # pick most relevant
            i = max(candidates, key=lambda j: S[query_idx, j])
            selected.append(i); candidates.remove(i)
        else:
            def score(j):
                redundancy = max(S[j, s] for s in selected) if selected else 0.0
                return lambda_rel * S[query_idx, j] - (1-lambda_rel) * redundancy
            i = max(candidates, key=score)
            selected.append(i); candidates.remove(i)
    return selected

def top_matches_for(user_name: str, k=3):
    i = users.index[users['name'] == user_name][0]
    picks = mmr(i, S_final, K=k, lambda_rel=0.7)
    cols = ['user_id','name','role','interests','skills','years_exp','lat','lon']
    return users.iloc[picks][cols].assign(score=[S_final[i, j] for j in picks])

top_matches_for("Alex", k=3)


Unnamed: 0,user_id,name,role,interests,skills,years_exp,lat,lon,score
1,2,Sam,Engineer,"ml, data, open source","python, pytorch, django",5,43.7001,-79.4163,0.832235
2,3,Jamie,Designer,"ux, motion, branding","figma, design systems, prototyping",6,51.5072,-0.1276,0.625508
5,6,Morgan,Researcher,"nlp, recsys, fairness","python, data science, statistics",4,43.6532,-79.3832,0.660076



## 9) Tuning
- Weights `w` in `combine_content` and `final_score`
- Geo decay, experience sweet spot, role map
- Graph restart rate `alpha_graph`
- Diversification `lambda_rel`
Use offline A/B on historical acceptance/reply/match labels (precision@k, MAP, nDCG, coverage, diversity).



## 10) Next steps (hook to production)
- Replace synthetic `users` with your **Django API** dump (profiles + interactions)
- Swap TF‑IDF with **sentence embeddings** (e.g., `sentence-transformers`) and **pgvector** for scalable ANN
- Swap toy CF with **ALS/BPR** (e.g., `implicit` or **LightFM** for hybrid)
- Move PPR to a batch job; cache per user; refresh daily
- Keep a **reciprocity gate**: only show if `score(u→v)` and `score(v→u)` exceed thresholds
- Add fairness/serendipity constraints and cold‑start rules
