<a href="https://colab.research.google.com/github/ayush-030/Mentora-Matching-Optimization/blob/main/notebook/Mentora_Matching_Algorithm_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mentora – Matching Algorithm Optimization

## Problem Statement
The original Mentora matching algorithm is a rule-based, keyword overlap system
used to match students with faculty, industry mentors, and projects based on skills
and interests.

While fast and explainable, this baseline approach fails to capture semantic similarity
(e.g., "ML" vs "Machine Learning"), leading to poor match quality in realistic scenarios.

This notebook:
1. Computes baseline matching performance using the existing algorithm
2. Introduces an optimized semantic matching approach using embeddings (LLM-assisted)
3. Compares both approaches on dummy data
4. Quantifies the performance lift achieved


# Existing Baseling Version (Code)

In [None]:
# Core imports
import numpy as np
import pandas as pd

In [None]:
# Dummy Student Profiles

students = [
    {
        "id": "S1",
        "skills": ["Python", "ML", "React"],
        "interests": ["AI", "Healthcare"]
    },
    {
        "id": "S2",
        "skills": ["Java", "Spring", "SQL"],
        "interests": ["Backend Systems", "Databases"]
    },
    {
        "id": "S3",
        "skills": ["C++", "Data Structures", "Algorithms"],
        "interests": ["Competitive Programming", "Optimization"]
    }
]


In [None]:
# Dummy Faculty Profiles

faculty = [
    {
        "id": "F1",
        "expertise": ["Machine Learning", "Deep Learning", "Python"],
        "research_areas": ["Artificial Intelligence", "Medical AI"]
    },
    {
        "id": "F2",
        "expertise": ["Java", "Distributed Systems"],
        "research_areas": ["Backend Architecture", "Scalable Databases"]
    },
    {
        "id": "F3",
        "expertise": ["Algorithms", "C++"],
        "research_areas": ["Competitive Programming", "Graph Theory"]
    }
]

In [None]:
students, faculty

In [None]:
# Existing Logic (Baseline Matching Function)
def calc_match_percent(set1, set2):
    set1 = set(s.lower().strip() for s in (set1 or []) if s)
    set2 = set(s.lower().strip() for s in (set2 or []) if s)
    if not set1 or not set2:
        return 0
    overlap = set1 & set2
    if not overlap:
        return 0
    score = round((len(overlap) / max(len(set1), len(set2))) * 100)
    return score


def student_to_faculty_baseline(student_profile, faculty_profile):
    skills_score = calc_match_percent(
        student_profile.get("skills", []),
        faculty_profile.get("expertise", [])
    )
    interests_score = calc_match_percent(
        student_profile.get("interests", []),
        faculty_profile.get("research_areas", [])
    )

    return round((skills_score + interests_score) / 2)

In [None]:
# Baseline Function results
baseline_results = []

for s in students:
    for f in faculty:
        score = student_to_faculty_baseline(s, f)
        baseline_results.append({
            "student_id": s["id"],
            "faculty_id": f["id"],
            "baseline_score": score
        })

baseline_df = pd.DataFrame(baseline_results)
baseline_df

## Baseline Matching Results – Analysis

The baseline matching algorithm relies on exact keyword overlap between skills
and interests.

### Observations:
- Strong semantic matches such as:
  - "ML" vs "Machine Learning"
  - "AI" vs "Artificial Intelligence"
  are not captured.
- This leads to low scores even for human-obvious strong matches.
- Exact matches (e.g., C++ + Algorithms) perform well.

### Conclusion:
While the baseline approach is fast and interpretable, it lacks semantic
understanding and fails in realistic academic matching scenarios.


# Now moving to the optimized version (Code)

In [None]:
# Now we move to optimization, we’ll use Sentence Transformers
# Install Semantic Embedding Library
!pip install -q sentence-transformers

In [None]:
# Loading Embedding Model
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Loading a lightweight semantic embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
# Now we'll write the Semantic Matching Utility Function
def semantic_match(list1, list2):
    if not list1 or not list2:
        return 0

    embeddings_1 = model.encode(list1)
    embeddings_2 = model.encode(list2)

    similarity_matrix = cosine_similarity(embeddings_1, embeddings_2)

    # Take the best semantic match
    max_similarity = similarity_matrix.max()

    return round(float(max_similarity) * 100)

In [None]:
# Optimized Student to Faculty Matching
def student_to_faculty_optimized(student_profile, faculty_profile):
    skill_score = semantic_match(
        student_profile.get("skills", []),
        faculty_profile.get("expertise", [])
    )

    interest_score = semantic_match(
        student_profile.get("interests", []),
        faculty_profile.get("research_areas", [])
    )

    # Weighted scoring (skills more important)
    final_score = round(0.7 * skill_score + 0.3 * interest_score)

    return final_score

In [None]:
# Computing Optimized Results
optimized_results = []

for s in students:
    for f in faculty:
        score = student_to_faculty_optimized(s, f)
        optimized_results.append({
            "student_id": s["id"],
            "faculty_id": f["id"],
            "optimized_score": score
        })

optimized_df = pd.DataFrame(optimized_results)
optimized_df

In [None]:
# Merging Baseline & Optimized Results
comparison_df = baseline_df.merge(
    optimized_df,
    on=["student_id", "faculty_id"]
)

comparison_df

In [None]:
# Computing Quantitative Lift

# Calculating absolute and relative improvement
comparison_df["absolute_lift"] = (
    comparison_df["optimized_score"] - comparison_df["baseline_score"]
)

comparison_df["relative_lift_percent"] = comparison_df.apply(
    lambda row: round(
        (row["absolute_lift"] / row["baseline_score"]) * 100, 2
    ) if row["baseline_score"] > 0 else None,
    axis=1
)

comparison_df

In [None]:
# Aggregate Performance Metrics
summary_metrics = {
    "Average Baseline Score": round(comparison_df["baseline_score"].mean(), 2),
    "Average Optimized Score": round(comparison_df["optimized_score"].mean(), 2),
    "Average Absolute Lift": round(comparison_df["absolute_lift"].mean(), 2),
    "Max Optimized Score": comparison_df["optimized_score"].max(),
}

pd.DataFrame(summary_metrics.items(), columns=["Metric", "Value"])

## Performance Comparison & Lift Analysis

### Quantitative Results
- The optimized semantic matching algorithm significantly outperforms
  the baseline keyword-based approach.
- Average match scores increased substantially across all student–faculty pairs.
- Strong semantic matches that scored poorly in the baseline (e.g., "ML" vs
  "Machine Learning") now receive high scores.

### Key Improvements
- Semantic understanding of skills and research areas
- Robust handling of synonyms and related concepts
- Better alignment with human intuition

### Conclusion
The optimized approach achieves a large absolute lift in match quality,
demonstrating that LLM-assisted semantic embeddings are a superior choice
for real-world academic matching systems.


## Final Conclusion

The baseline Mentora matching algorithm provided a fast and explainable
starting point but failed to capture semantic similarity.

By integrating embedding-based semantic matching, the system achieved
a significant improvement in match quality while remaining modular and extensible.

This optimized approach can be further enhanced using:
- Caching of embeddings
- Hybrid exact + semantic matching
- Threshold-based recommendations
- Offline batch processing for scalability
