# LinkedIn Compatibility Analysis

This notebook analyzes LinkedIn profile compatibility using machine learning models to predict network value between profile pairs.

## Setup & Imports

In [27]:
import kagglehub
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sentence_transformers import SentenceTransformer
from sklearn.neural_network import MLPRegressor
import time

print("‚úÖ All imports loaded successfully!")

‚úÖ All imports loaded successfully!


## Load Dataset

In [28]:
# Download latest version
path = kagglehub.dataset_download("likithagedipudi/linkedin-compatibility-dataset-50k-profiles")
print("Path to dataset files:", path)

Path to dataset files: /Users/mali8/.cache/kagglehub/datasets/likithagedipudi/linkedin-compatibility-dataset-50k-profiles/versions/1


In [29]:
# Load Pairs
csv_file_path_compatibility = os.path.join(path, "compatibility_pairs.csv")
pairs_df = pd.read_csv(csv_file_path_compatibility)

# Load Profiles
csv_file_path_profiles = os.path.join(path, "profiles.csv")
profiles_df = pd.read_csv(csv_file_path_profiles)

print(f"Pairs Loaded: {pairs_df.shape}")
print(f"Profiles Loaded: {profiles_df.shape}")

Pairs Loaded: (4999890, 14)
Profiles Loaded: (50000, 20)


In [30]:
profiles_df.head()

Unnamed: 0,profile_id,name,email,location,headline,about,current_role,current_company,industry,years_experience,seniority_level,skills,experience,education,connections,goals,needs,can_offer,remote_preference,source
0,ab04b973af478550ddf247879393df42,Daniel Doyle,garzaanthonyexample.org,"East William, AK",Analyst Product Building impactful solutions,Experienced professional focused on driving gr...,Assistant,Microsoft,Healthcare,2,entry,"['Prototyping', 'Go', 'C', 'C', 'NLP']","[{'title': 'Assistant', 'company': 'Google', '...","[{'school': 'Penn', 'degree': 'MS', 'field': '...",106,"['Get promoted', 'Build network']","['funding', 'mentorship', 'business advice']","['partnership opportunities', 'investment', 'c...",remote,synthetic
1,b620e3fa2ec361b1d728115eeabb71af,Jennifer Cole,lisa02example.net,"Petersonberg, IL",Senior Engineer Design Building impactful so...,Passionate about building innovative solutions...,Lead Data Scientist,Stripe,Consulting,9,senior,"['Business Development', 'SQL', 'Ruby', 'Marke...","[{'title': 'Senior Engineer', 'company': 'Rapi...","[{'school': 'Princeton', 'degree': 'MBA', 'fie...",2372,"['Strategic role', 'Scale impact']","['job opportunities', 'mentorship', 'clients']","['industry connections', 'consulting', 'produc...",hybrid,synthetic
2,cfeeb31581a0b3e0515c01691b9dc2b5,Brent Abbott,lindsay78example.org,"Millerport, MP",Software Engineer Product Building impactful...,Passionate about building innovative solutions...,Data Scientist,NextGen,Telecommunications,5,mid,"['NLP', 'Business Development', 'Sales', 'Big ...","[{'title': 'Consultant', 'company': 'ScaleUp',...","[{'school': 'CMU', 'degree': 'MS', 'field': 'D...",874,"['Specialize', 'Mentor others']","['job opportunities', 'business advice', 'care...","['partnership opportunities', 'product feedbac...",onsite,synthetic
3,5d54826665a5898662661a96719cc4a7,Corey Jones,kendragallowayexample.org,"South Joshuastad, GA",Chief Data Officer Tech Building impactful s...,Passionate about building innovative solutions...,COO,NextGen,Healthcare,17,executive,"['Sketch', 'Machine Learning', 'AWS', 'TensorF...","[{'title': 'VP Engineering', 'company': 'Tesla...","[{'school': 'MIT', 'degree': 'BS', 'field': 'E...",3259,"['Build company', 'Advisory roles']","['clients', 'hiring', 'career guidance']","['consulting', 'career advice', 'hiring referr...",onsite,synthetic
4,6ad3c64c6cb4bac60b692f3d5bab271d,Timothy Wong,amandasanchezexample.com,"Nelsonside, IN",Consultant Design Building impactful solutions,Strategic thinker with expertise in scaling or...,Product Manager,NextGen,Healthcare,4,mid,"['Data Science', 'DevOps', 'Flask', 'Analytics...","[{'title': 'Software Engineer', 'company': 'Ai...","[{'school': 'Berkeley', 'degree': 'MS', 'field...",372,"['Specialize', 'Lead projects']","['clients', 'hiring', 'funding']","['partnership opportunities', 'career advice',...",remote,synthetic


In [31]:
pairs_df.head()

Unnamed: 0,skill_match_score,skill_complementarity_score,network_value_a_to_b,network_value_b_to_a,career_alignment_score,experience_gap,industry_match,geographic_score,seniority_match,compatibility_score,mutual_benefit_explanation,pair_id,profile_a_id,profile_b_id
0,0.0,0.0,5.3,14.55,80.0,0,0.0,60.0,85.0,24.98,Peer-level relationship - can learn together,742f902f23b9d1be5fa0ba0816e3490b,ab04b973af478550ddf247879393df42,fdf3243d1ad97255e0ce313aebc0be79
1,5.555556,0.0,5.3,42.8,80.0,2,0.0,60.0,100.0,30.33,Peer-level relationship - can learn together,fc6c6a8029bc4e9beb5dd8186147f042,ab04b973af478550ddf247879393df42,371bc2adbdc4ca8f0dd16d373f85f2ae
2,5.555556,0.0,5.3,38.8,80.0,2,0.0,60.0,100.0,29.73,Peer-level relationship - can learn together,6cb0929495e49ddd5dff151ead7b3f5e,ab04b973af478550ddf247879393df42,d18e7cd91fc4e621fd6879ca5ef6e1b2
3,8.0,0.0,5.3,80.0,40.0,27,0.0,60.0,50.0,28.39,Valuable network connections in same industry,7b191de3be4c52fd0874f31936d51819,ab04b973af478550ddf247879393df42,6615dabdd3b5b8f9627ca933dc3d9ae3
4,7.142857,0.0,5.3,28.6,90.0,5,0.0,60.0,100.0,30.51,Ideal mentorship gap (5 years experience diffe...,7b568a4ee7254a9b388e6430ac252568,ab04b973af478550ddf247879393df42,fc73d0e790dea954d20db176f638ab86


## Data Preprocessing

In [33]:
# Remove unnecessary columns from profiles_df and pairs_df
profiles_df = profiles_df.drop(['name','email','location'], axis=1, errors='ignore')
pairs_df = pairs_df.drop(['industry_match','compatibility_score','mutual_benefit_explanation',
                          'geographic_score','career_alignment_score',
                          'skill_complementarity_score', 'skill_match_score'],
                          axis=1, errors='ignore')

# Shuffle and keep only 5000 rows to reduce dataset size
pairs_df = pairs_df.sample(frac=1, random_state=93).reset_index(drop=True).head(5000)

print(f"‚úÖ Cleaned data shape: {pairs_df.shape}")
print(f"üìä Columns retained in pairs: {list(pairs_df.columns)}")

‚úÖ Cleaned data shape: (5000, 7)
üìä Columns retained in pairs: ['network_value_a_to_b', 'network_value_b_to_a', 'experience_gap', 'seniority_match', 'pair_id', 'profile_a_id', 'profile_b_id']


In [34]:
# Merge Profile A
data = pairs_df.merge(profiles_df, left_on='profile_a_id', right_on='profile_id', how='left')
# Rename columns for A
cols_to_rename = {col: f"{col}_a" for col in profiles_df.columns if col != 'profile_id'}
data = data.rename(columns=cols_to_rename)

# Merge Profile B
data = data.merge(profiles_df, left_on='profile_b_id', right_on='profile_id', how='left')
# Rename columns for B
cols_to_rename_b = {col: f"{col}_b" for col in profiles_df.columns if col != 'profile_id'}
data = data.rename(columns=cols_to_rename_b)

# Cleanup IDs
data = data.drop(['profile_id_x', 'profile_id_y'], axis=1, errors='ignore')

# Ensure target is numeric
data['network_value_a_to_b'] = pd.to_numeric(data['network_value_a_to_b'], errors='coerce').fillna(0)

print(f"‚úÖ Merged dataset shape: {data.shape}")
print(f"üìä Target variable stats:")
print(f"   Mean: {data['network_value_a_to_b'].mean():.2f}")
print(f"   Std: {data['network_value_a_to_b'].std():.2f}")
print(f"   Range: [{data['network_value_a_to_b'].min():.2f}, {data['network_value_a_to_b'].max():.2f}]")

‚úÖ Merged dataset shape: (5000, 39)
üìä Target variable stats:
   Mean: 51.37
   Std: 29.13
   Range: [2.50, 100.00]


## Feature Engineering

In [35]:
# Combine all useful text fields into one rich string per user
def create_profile_text(row, suffix):
    # e.g., suffix='_a' -> grabs headline_a, current_role_a...
    return (
        str(row[f'headline{suffix}']) + " | " +
        str(row[f'current_role{suffix}']) + " | " +
        str(row[f'current_company{suffix}']) + " | " +
        str(row[f'about{suffix}']) + " | " +
        str(row[f'skills{suffix}']) + " | " +
        str(row[f'experience{suffix}']) + " | " +
        str(row[f'seniority_level{suffix}']) + " | " +
        str(row[f'industry{suffix}']) + " | " +
        str(row[f'needs{suffix}']) + " | " +
        str(row[f'can_offer{suffix}'])
    )

print("üìù Constructing text profiles...")
start_time = time.time()

data['full_text_a'] = data.apply(lambda row: create_profile_text(row, '_a'), axis=1)
data['full_text_b'] = data.apply(lambda row: create_profile_text(row, '_b'), axis=1)

print(f"‚úÖ Created {len(data)} profile pairs in {time.time() - start_time:.2f}s")
print(f"üìä Average text length: {data['full_text_a'].str.len().mean():.0f} characters")

üìù Constructing text profiles...
‚úÖ Created 5000 profile pairs in 0.14s
üìä Average text length: 817 characters


In [37]:
# Load pre-trained model for semantic understanding
print("üß† Loading Sentence Transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')

print("‚è≥ Converting text to embeddings (this may take a minute)...")
start_time = time.time()

embeddings_a = model.encode(data['full_text_a'].tolist(), show_progress_bar=True)
embeddings_b = model.encode(data['full_text_b'].tolist(), show_progress_bar=True)

print(f"‚úÖ Embeddings created in {time.time() - start_time:.2f}s")
print(f"üìä Embedding shape: Profile A = {embeddings_a.shape}, Profile B = {embeddings_b.shape}")

# Stack them side-by-side: [User A Vector, User B Vector]
X = np.hstack([embeddings_a, embeddings_b])
y = data['network_value_a_to_b'].values

print(f"üìä Final feature matrix: {X.shape} (768 = 384 + 384 dimensions)")

üß† Loading Sentence Transformer model...
‚è≥ Converting text to embeddings (this may take a minute)...
‚è≥ Converting text to embeddings (this may take a minute)...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 157/157 [00:11<00:00, 13.69it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 157/157 [00:11<00:00, 13.81it/s]



‚úÖ Embeddings created in 23.05s
üìä Embedding shape: Profile A = (5000, 384), Profile B = (5000, 384)
üìä Final feature matrix: (5000, 768) (768 = 384 + 384 dimensions)


## Train Model

In [38]:
print("üîÑ Splitting data into train/test sets...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"üìä Train set: {X_train.shape[0]} samples")
print(f"üìä Test set: {X_test.shape[0]} samples")

# Train Neural Network
print("\nüß† Training Neural Network (MLP)...")
start_time = time.time()

regressor = MLPRegressor(
    hidden_layer_sizes=(256, 128, 64), 
    activation='relu',
    max_iter=500, 
    random_state=42, 
    verbose=False  # Set to True to see training progress
)

regressor.fit(X_train, y_train)

print(f"‚úÖ Model trained in {time.time() - start_time:.2f}s")

üîÑ Splitting data into train/test sets...
üìä Train set: 4000 samples
üìä Test set: 1000 samples

üß† Training Neural Network (MLP)...


  ret = a @ b
  ret = a @ b
  ret = a @ b


‚úÖ Model trained in 17.41s




## Evaluate Results

In [39]:
print("üìä Evaluating model performance...\n")

# Make predictions
preds = regressor.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)

# Display results
print("=" * 50)
print("üéØ MODEL PERFORMANCE")
print("=" * 50)
print(f"R¬≤ Score:            {r2:.4f}")
print(f"Mean Squared Error:  {mse:.2f}")
print(f"RMSE:                {np.sqrt(mse):.2f}")
print("=" * 50)

# Show example predictions
print("\nüìã Sample Predictions (first 5 test samples):")
print(f"{'Actual':<10} {'Predicted':<10} {'Difference':<10}")
print("-" * 35)
for i in range(min(5, len(y_test))):
    diff = abs(y_test[i] - preds[i])
    print(f"{y_test[i]:<10.2f} {preds[i]:<10.2f} {diff:<10.2f}")

# Overall accuracy insight
print(f"\nüí° The model explains {r2*100:.1f}% of the variance in network value!")

üìä Evaluating model performance...

üéØ MODEL PERFORMANCE
R¬≤ Score:            0.8061
Mean Squared Error:  162.64
RMSE:                12.75

üìã Sample Predictions (first 5 test samples):
Actual     Predicted  Difference
-----------------------------------
70.00      84.87      14.87     
12.75      6.95       5.80      
6.00       14.41      8.41      
52.55      34.24      18.31     
70.00      74.83      4.83      

üí° The model explains 80.6% of the variance in network value!


  ret = a @ b
  ret = a @ b
  ret = a @ b


## Summary

‚úÖ Built ML model to predict LinkedIn network value  
‚úÖ Used Sentence Transformers for semantic understanding  
‚úÖ Trained Neural Network on 5K profile pairs  

**Next Steps:** Fine-tune model, deploy API, validate with real data