# User Matching Experiment

This experiment demonstrates how to compute a compatibility score between two users based on a shared conversation transcript.

## Decision 
- TF-IDF Vectorizer
- Topic Weight : 0.5, Psychometrics Weight : 1.0
- Compatibility Classes: High / Moderate / Low (0.8/0.5/0.0) 

Method Choices
1. TF-IDF Vectorization : good to vectorize into an array of numbers
2. One-hot Encoding : producing only 0 and 1, better with categorical features and doesn't fit this task

Weight Choices
1. Topic Weight : 0.0 - 1.0, I chose 0.5 because users that listen to the same audio (only one audio/transcript here) may have a higher tendency to show compatibility
2. Psychometrics weight : 0.0 - 1.0, I chose 1.0 as personality is still a more dominant factor in my opinion 

#### Step1: Env Setup & Load Users

**Observation**
- From the raw data, a clear difference and be noticed, the difference between each trait range from 0.4 - 0.6
    - traits = [openness, conscientiousness, extraversion, agreeableness, neuroticism]
    - user_1 {'id': 'user_1', 'psychometrics': [0.8, 0.4, 0.7, 0.2, 0.9]}
    - user_2 {'id': 'user_2', 'psychometrics': [0.3, 0.9, 0.1, 0.6, 0.4]}

In [None]:
import json
from pathlib import Path
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


USER_PATH = Path("../sample_data/synthetic_users.json")

with open(USER_PATH) as f:
    users = json.load(f)

user_1 = users[0]
user_2 = users[1]
print("user_1", user_1)
print("user_2", user_2)

user_1 {'id': 'user_1', 'psychometrics': [0.8, 0.4, 0.7, 0.2, 0.9]}
user_2 {'id': 'user_2', 'psychometrics': [0.3, 0.9, 0.1, 0.6, 0.4]}


#### Step2 : Vectorize the topics with TF-IDF

**Note**
1. KeyBERT are used to gives semantic keywords (strings), but no fixed-length vector
2.	TF-IDF allows us to **encode** these keywords numerically into a vector space

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize_topics(topics, method='tfidf'):
    # Convert list of topics into a single string per user
    text = " ".join(topics)
    vectorizer = TfidfVectorizer()
    vec = vectorizer.fit_transform([text])
    return vec.toarray()[0]

# Both users listened to the same transcript (topic_extraction_demo.ipynb)
keybert_topics = ['starships', 'terraforming', 'scales', 'plan', 'synchronization']
user_1_topic_vec = vectorize_topics(keybert_topics)
user_2_topic_vec = vectorize_topics(keybert_topics)
print("user_1_topic_vec", user_1_topic_vec)
print("user_2_topic_vec", user_2_topic_vec)


user_1_topic_vec [0.4472136 0.4472136 0.4472136 0.4472136 0.4472136]
user_2_topic_vec [0.4472136 0.4472136 0.4472136 0.4472136 0.4472136]


**Observation**
- With only one transcript, the topic vector alone is not differentiating users
- The psychometric weight is the key lever to demonstrate meaningful compatibility scores

**Solution**
- By adjusting w_topic and w_psych to control how much psychometrics vs topics contribute to the compatibility score in the next step

    - w_psych >> w_topic → compatibility mostly reflects personality similarity
    - w_topic >> w_psych → topics dominate (less meaningful here, since topics are identical)

#### Step3: Combine with Psychometrics

As discussed above, adjust the weights of topics and psychometrics (personality)

I define the weights to be:
- topic weights : 0.5, since users may share similar interests if they listen to the same audio (e.g. people with same interests tend to listen to the same podcast than others)
- psychometrics weights : 1.0, since the personality is a more dominant factor to decide if people are able to get along

In [None]:
def combine_vectors(topic_vec, psych_vec, topic_weight=1.0, psych_weight=1.0):
    topic_vec = np.array(topic_vec)
    psych_vec = np.array(psych_vec)
    
    # Scale vectors if needed
    combined = np.concatenate([topic_weight * topic_vec, psych_weight * psych_vec])
    return combined

w_topic = 0.5  
w_psych = 1.0  

user_1_combined = combine_vectors(user_1_topic_vec, user_1["psychometrics"], w_topic, w_psych)
user_2_combined = combine_vectors(user_2_topic_vec, user_2["psychometrics"], w_topic, w_psych)
print("user_1_combined", user_1_combined)
print("user_2_combined", user_2_combined)

user_1_combined [0.2236068 0.2236068 0.2236068 0.2236068 0.2236068 0.8       0.4
 0.7       0.2       0.9      ]
user_2_combined [0.2236068 0.2236068 0.2236068 0.2236068 0.2236068 0.3       0.9
 0.1       0.6       0.4      ]


#### Step4: Compute Compatibility with Cosine Similarity

Define different classes
- High : 1.0 score > 0.8
- Moderate : 0.8 > score > 0.5
- Low : 0.5 > score > 0.0 

In [43]:
def compute_compatibility(vec1, vec2):
    score = cosine_similarity([vec1], [vec2])[0][0]
    
    # Simple interpretation
    if score > 0.8:
        interpretation = "Highly compatible"
    elif score > 0.5:
        interpretation = "Moderately compatible"
    else:
        interpretation = "Low compatibility"
    
    return score, interpretation

score, interpretation = compute_compatibility(user_1_combined, user_2_combined)

print(f"Compatibility Score: {score:.3f}")
print(f"Interpretation: {interpretation}")

Compatibility Score: 0.699
Interpretation: Moderately compatible


**Observation**
- Since both users share the same transcript topics only a smaller weight (0.2) is applied, the differentiating factor comes primarily from their psychometric vectors, so a higher weight (1.0) is used to distinguish them
- From the raw data, the difference between each trait range from 0.4 - 0.6, which is around low to moderate
- Considering the fact that both users listened to the same audio, they have smiliar interest, therefore, the final compatibility score increases after the adjustment of topic weights
