# User Matching Experiment

This experiment demonstrates how to compute a compatibility score between two users based on a shared conversation transcript.

## Decision 
- **TF-IDF Vectorizer**: Good for vectorizing topics into numerical arrays
- **Topic Weight**: 0.5, **Psychometrics Weight**: 1.0
- **Compatibility Classes**: 7 levels (0.9/0.8/0.7/0.6/0.4/0.2/0.0)
- **Data Preprocessing**: Normalization + Resampling for production robustness


Method Choices
1. **TF-IDF Vectorization**: Good to vectorize into an array of numbers
2. **One-hot Encoding**: Producing only 0 and 1, better with categorical features and doesn't fit this task

Weight Choices
1. **Topic Weight** : 0.0 - 1.0, I chose 0.5 because users that listen to the same audio (only one audio/transcript here) may have a higher tendency to show compatibility
2. **Psychometrics Weight** : 0.0 - 1.0, I chose 1.0 as personality is still a more dominant factor in my opinion 

Data Preprocessing
1. **Normalization**: Min-max scaling to [0,1] range ensures fair comparison regardless of original data scales
2. **Resampling**: Linear interpolation handles different psychometric vector lengths
3. **Edge Case Handling**: Graceful handling of empty data, single values, and out-of-range inputs
4. **Mathematical Stability**: Prevents division by zero and ensures valid [0,1] ranges

Experiment Result (With v.s. without data processing)
- **Original Score**: 0.699 (Moderately compatible - Decent match)
- **Processed Score**: 0.516 (Somewhat compatible - Weak match)
- **Difference**: 0.183 - The processed score is more accurate as it eliminates scale bias

#### Step1: Env Setup & Load Users

**Observation**
- From the raw data, a clear difference and be noticed, the difference between each trait range from 0.4 - 0.6
    - traits = [openness, conscientiousness, extraversion, agreeableness, neuroticism]
    - user_1 {'id': 'user_1', 'psychometrics': [0.8, 0.4, 0.7, 0.2, 0.9]}
    - user_2 {'id': 'user_2', 'psychometrics': [0.3, 0.9, 0.1, 0.6, 0.4]}

In [74]:
import json
from pathlib import Path
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
np.set_printoptions(precision=3, suppress=True)


USER_PATH = Path("../sample_data/synthetic_users.json")

with open(USER_PATH) as f:
    users = json.load(f)

user_1 = users[0]
user_2 = users[1]
print("user_1", user_1)
print("user_2", user_2)

user_1 {'id': 'user_1', 'psychometrics': [0.8, 0.4, 0.7, 0.2, 0.9]}
user_2 {'id': 'user_2', 'psychometrics': [0.3, 0.9, 0.1, 0.6, 0.4]}


#### Step2 : Vectorize the topics with TF-IDF

**Note**
1. KeyBERT are used to gives semantic keywords (strings), but no fixed-length vector
2.	TF-IDF allows us to **encode** these keywords numerically into a vector space

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize_topics(topics, method='tfidf'):
    # Convert list of topics into a single string per user
    text = " ".join(topics)
    vectorizer = TfidfVectorizer()
    vec = vectorizer.fit_transform([text])
    return vec.toarray()[0]

# Both users listened to the same transcript (topic_extraction_demo.ipynb)
keybert_topics = ['starships', 'terraforming', 'scales', 'plan', 'synchronization']
user_1_topic_vec = vectorize_topics(keybert_topics)
user_2_topic_vec = vectorize_topics(keybert_topics)
print("user_1_topic_vec", user_1_topic_vec)
print("user_2_topic_vec", user_2_topic_vec)


user_1_topic_vec [0.447 0.447 0.447 0.447 0.447]
user_2_topic_vec [0.447 0.447 0.447 0.447 0.447]


**Observation**
- With only one transcript, the topic vector alone is not differentiating users
- The psychometric weight is the key lever to demonstrate meaningful compatibility scores

**Solution**
- By adjusting w_topic and w_psych to control how much psychometrics vs topics contribute to the compatibility score in the next step

    - w_psych >> w_topic → compatibility mostly reflects personality similarity
    - w_topic >> w_psych → topics dominate (less meaningful here, since topics are identical)

#### Step3: Combine with Psychometrics

As discussed above, adjust the weights of topics and psychometrics (personality)

I define the weights to be:
- topic weights : 0.5, since users may share similar interests if they listen to the same audio (e.g. people with same interests tend to listen to the same podcast than others)
- psychometrics weights : 1.0, since the personality is a more dominant factor to decide if people are able to get along

In [71]:
def combine_vectors(topic_vec, psych_vec, topic_weight=1.0, psych_weight=1.0):
    topic_vec = np.array(topic_vec)
    psych_vec = np.array(psych_vec)
    
    # Scale vectors if needed
    combined = np.concatenate([topic_weight * topic_vec, psych_weight * psych_vec])
    return combined

w_topic = 0.5  
w_psych = 1.0  

user_1_combined = combine_vectors(user_1_topic_vec, user_1["psychometrics"], w_topic, w_psych)
user_2_combined = combine_vectors(user_2_topic_vec, user_2["psychometrics"], w_topic, w_psych)
print("user_1_combined", user_1_combined)
print("user_2_combined", user_2_combined)

user_1_combined [0.224 0.224 0.224 0.224 0.224 0.8   0.4   0.7   0.2   0.9  ]
user_2_combined [0.224 0.224 0.224 0.224 0.224 0.3   0.9   0.1   0.6   0.4  ]


#### Step4: Compute Compatibility with Cosine Similarity

Define 7 different classes
- Perfect Match: > 0.9
- Strong Match: > 0.8
- Good Match: > 0.7
- Decent Match: > 0.6
- Weak Match: > 0.4
- Poor Match: > 0.2
- Minimal Match: else

In [72]:
def interpret_score(score):
    if score >= 0.9:
        return "Exceptionally compatible - Perfect match"
    elif score >= 0.8:
        return "Highly compatible - Strong match"
    elif score >= 0.7:
        return "Very compatible - Good match"
    elif score >= 0.6:
        return "Moderately compatible - Decent match"
    elif score >= 0.4:
        return "Somewhat compatible - Weak match"
    elif score >= 0.2:
        return "Low compatibility - Poor match"
    else:
        return "Very low compatibility - Minimal match"
    

def compute_compatibility(vec1, vec2):
    score = cosine_similarity([vec1], [vec2])[0][0]
    interpretation = interpret_score(score)
    return score, interpretation

score, interpretation = compute_compatibility(user_1_combined, user_2_combined)


print(f"Compatibility Score: {score:.3f}")
print(f"Interpretation: {interpretation}")

Compatibility Score: 0.699
Interpretation: Moderately compatible - Decent match


**Observation**
- Since both users share the same transcript topics only a smaller weight (0.2) is applied, the differentiating factor comes primarily from their psychometric vectors, so a higher weight (1.0) is used to distinguish them
- From the raw data, the difference between each trait range from 0.4 - 0.6, which is around low to moderate
- Considering the fact that both users listened to the same audio, they have smiliar interest, therefore, the final compatibility score increases after the adjustment of topic weights


### Step5: Experiment: Normalization & Resampling

#### Normalization
- Consistent data range : [0, 1] range using **min-max** scaling
- Handles edge cases (all same values, zero variance)
- Returns neutral values (0.5) for constant data

#### Resampling
- Flexible vector lengths : Handles different psychometric vector lengths
- Uses linear interpolation for smooth resampling
- Handles edge cases (empty data, single values)
- Returns neutral values (0.5) for empty data

In [65]:
def normalize_psychometrics(psychometric_data: list[float]) -> np.ndarray:
    """
    Normalize psychometric data to [0, 1] range
    Handles cases where data might be outside expected range
    """
    psych_array = np.array(psychometric_data)
    
    # Handle all same values
    if np.all(psych_array == psych_array[0]):
        return np.full_like(psych_array, 0.5)
    
    # Min-max normalization to [0, 1]
    min_val = np.min(psych_array)
    max_val = np.max(psych_array)
    
    if max_val == min_val:
        return np.full_like(psych_array, 0.5)
    
    normalized = (psych_array - min_val) / (max_val - min_val)
    return np.clip(normalized, 0.0, 1.0)

In [66]:
# Test the normalization and resampling functions
print("=== Testing Normalization ===")

# Test with original psychometric data
user1_psych_raw = user_1["psychometrics"]
user2_psych_raw = user_2["psychometrics"]

print(f"Original user_1 psychometrics: {user1_psych_raw}")
print(f"Original user_2 psychometrics: {user2_psych_raw}")

# Normalize the data
user1_psych_normalized = normalize_psychometrics(user1_psych_raw)
user2_psych_normalized = normalize_psychometrics(user2_psych_raw)

print(f"Normalized user_1: {user1_psych_normalized}")
print(f"Normalized user_2: {user2_psych_normalized}")

# Test edge cases
print("\n=== Testing Edge Cases ===")
# Test with data outside [0,1] range
out_of_range = [2.5, -0.3, 1.8, 0.1, 3.0]
normalized_out = normalize_psychometrics(out_of_range)
print(f"Out of range {out_of_range}")
print(f"normalized: {normalized_out}")

=== Testing Normalization ===
Original user_1 psychometrics: [0.8, 0.4, 0.7, 0.2, 0.9]
Original user_2 psychometrics: [0.3, 0.9, 0.1, 0.6, 0.4]
Normalized user_1: [0.857 0.286 0.714 0.    1.   ]
Normalized user_2: [0.25  1.    0.    0.625 0.375]

=== Testing Edge Cases ===
Out of range [2.5, -0.3, 1.8, 0.1, 3.0]
normalized: [0.848 0.    0.636 0.121 1.   ]


In [67]:
def resample_psychometrics(psychometric_data: list[float], target_length: int = 5) -> np.ndarray:
    """
    Resample psychometric data to target length using interpolation
    Handles cases where psychometric vectors have different lengths
    """
    psych_array = np.array(psychometric_data)
    current_length = len(psych_array)
    
    if current_length == target_length:
        return psych_array
    
    if current_length == 0:
        # Return neutral values if no data
        return np.full(target_length, 0.5)
    
    if current_length == 1:
        # Replicate single value
        return np.full(target_length, psych_array[0])
    
    # Use linear interpolation to resample
    x_old = np.linspace(0, 1, current_length)
    x_new = np.linspace(0, 1, target_length)
    
    resampled = np.interp(x_new, x_old, psych_array)
    return resampled

In [69]:
# Test resampling with different target lengths
target_lengths = [3, 5, 7, 10]

for target_len in target_lengths:
    user1_resampled = resample_psychometrics(user1_psych_normalized, target_len)
    user2_resampled = resample_psychometrics(user2_psych_normalized, target_len)
    
    print(f"\nTarget length {target_len}:")
    print(f"  user_1 resampled: {user1_resampled}")
    print(f"  user_2 resampled: {user2_resampled}")

# Test edge cases
print("\n=== Testing Edge Cases ===")

# Test with all same values
same_values = [0.5, 0.5, 0.5, 0.5, 0.5]
normalized_same = normalize_psychometrics(same_values)
print(f"All same values {same_values}")
print(f"normalized: {normalized_same}")

# Test with empty data
empty_data = []
resampled_empty = resample_psychometrics(empty_data, 5)
print(f"Empty data resampled to length 5: {resampled_empty}")

# Test with single value
single_value = [0.8]
resampled_single = resample_psychometrics(single_value, 5)
print(f"Single value {single_value} resampled to length 5: {resampled_single}")


Target length 3:
  user_1 resampled: [0.857 0.714 1.   ]
  user_2 resampled: [0.25  0.    0.375]

Target length 5:
  user_1 resampled: [0.857 0.286 0.714 0.    1.   ]
  user_2 resampled: [0.25  1.    0.    0.625 0.375]

Target length 7:
  user_1 resampled: [0.857 0.476 0.429 0.714 0.238 0.333 1.   ]
  user_2 resampled: [0.25  0.75  0.667 0.    0.417 0.542 0.375]

Target length 10:
  user_1 resampled: [0.857 0.603 0.349 0.429 0.619 0.556 0.238 0.111 0.556 1.   ]
  user_2 resampled: [0.25  0.583 0.917 0.667 0.222 0.139 0.417 0.597 0.486 0.375]

=== Testing Edge Cases ===
All same values [0.5, 0.5, 0.5, 0.5, 0.5]
normalized: [0.5 0.5 0.5 0.5 0.5]
Empty data resampled to length 5: [0.5 0.5 0.5 0.5 0.5]
Single value [0.8] resampled to length 5: [0.8 0.8 0.8 0.8 0.8]


### Matching with Normalization and Resampling

In [82]:
from regex.regex import U


def test_matching_with_processing(user1_data, user2_data, topics, w_topic=0.5, w_psych=1.0, target_length=5):
    # Process psychometric data
    user1_psych_processed = resample_psychometrics(normalize_psychometrics(user1_data), target_length)
    user2_psych_processed = resample_psychometrics(normalize_psychometrics(user2_data), target_length)
    
    print(f"Processed user_1 psychometrics: {user1_psych_processed}")
    print(f"Processed user_2 psychometrics: {user2_psych_processed}")
    
    # Vectorize topics
    user_1_topic_vec = vectorize_topics(topics)
    user_2_topic_vec = vectorize_topics(topics)
    
    # Combine vectors with processed psychometric data
    user_1_combined_processed = combine_vectors(user_1_topic_vec, user1_psych_processed, w_topic, w_psych)
    user_2_combined_processed = combine_vectors(user_2_topic_vec, user2_psych_processed, w_topic, w_psych)
    
    print(f"\nCombined vectors (with processed psychometrics):")
    print(f"user_1_combined: {user_1_combined_processed}")
    print(f"user_2_combined: {user_2_combined_processed}")
    
    # Compute compatibility with processed data
    score_processed, interpretation_processed = compute_compatibility(user_1_combined_processed, user_2_combined_processed)
    
    # Compare with original (unprocessed) results
    user_1_combined_original = combine_vectors(user_1_topic_vec, user1_data, w_topic, w_psych)
    user_2_combined_original = combine_vectors(user_2_topic_vec, user2_data, w_topic, w_psych)
    score_original, interpretation_original = compute_compatibility(user_1_combined_original, user_2_combined_original)
    
    print(f"\n=== RESULTS ===")
    print(f"Original Score: {score_original:.3f} - {interpretation_original}")
    print(f"Processed Score: {score_processed:.3f} - {interpretation_processed}")
    print(f"Difference: {abs(score_processed - score_original):.3f}")
    
    return {
        'original_score': score_original,
        'processed_score': score_processed,
        'difference': abs(score_processed - score_original),
        'processed_psychometrics': {
            'user1': user1_psych_processed,
            'user2': user2_psych_processed
        }
    }

# Test with our own example data
keybert_topics = ['starships', 'terraforming', 'scales', 'plan', 'synchronization']
results = test_matching_with_processing(
    user1_data=user_1["psychometrics"],
    user2_data=user_2["psychometrics"], 
    topics=keybert_topics,
    w_topic=0.5,
    w_psych=1.0,
    target_length=5
)

Processed user_1 psychometrics: [0.857 0.286 0.714 0.    1.   ]
Processed user_2 psychometrics: [0.25  1.    0.    0.625 0.375]

Combined vectors (with processed psychometrics):
user_1_combined: [0.224 0.224 0.224 0.224 0.224 0.857 0.286 0.714 0.    1.   ]
user_2_combined: [0.224 0.224 0.224 0.224 0.224 0.25  1.    0.    0.625 0.375]

=== RESULTS ===
Original Score: 0.699 - Moderately compatible - Decent match
Processed Score: 0.516 - Somewhat compatible - Weak match
Difference: 0.183


**Observation**

The score with normalization and resampling (0.516) is lower than original (0.699)

Reason
1. Data Quality Improvement
    - Original: Inconsistent scales and ranges between users
    - Processed: Normalized to [0,1] range ensures fair comparison
    - Result: Eliminates scale bias that inflated the original score

2. Reveals True Personality Differences
    - Original: [0.8, 0.4, 0.7, 0.2, 0.9] vs [0.3, 0.9, 0.1, 0.6, 0.4]
    - Processed: [0.857, 0.286, 0.714, 0.000, 1.000] vs [0.250, 1.000, 0.000, 0.625, 0.375]
    - Result: Shows more extreme differences, revealing true incompatibility


**Conclusion**: The processed score of 0.516 ("Somewhat compatible - Weak match") is more accurate and trustworthy than the original 0.699 ("Moderately compatible - Decent match") because it accounts for data quality issues and reveals true personality compatibility, and because the original personality difference range from 0.4-0.6.