# ðŸŽ“ VKM Recommender System: A Step-by-Step Guide

Welcome! In this notebook, we will build a **smart recommendation engine** for VKM modules. 

**What does this do?**
Imagine a student looking for a specific minor or module. Instead of reading through hundreds of descriptions, they can describe what they are interested in (e.g., "machine learning", "healthcare", "design"), and our system will suggest the best matches.

**How does it work?**
1. **Understand the Data:** We load the module descriptions.
2. **Process Text:** We turn text into numbers so the computer can understand it (using TF-IDF).
3. **Match Interests:** We compare the student's interests with module descriptions.
4. **Check Constraints:** We ensure the recommendations fit practical needs (location, difficulty, etc.).
5. **Rank & Recommend:** We give each module a score and show the top results.

Let's get started! ðŸš€


## 1. Setup and Tools
First, we need to import the Python libraries (tools) we'll be using. 
- **Pandas** is for working with data tables.
- **Scikit-learn** provides the machine learning tools for text analysis.
- **Re** is for regular expressions (text cleaning).


In [1]:
import re
from dataclasses import dataclass
from pathlib import Path
from typing import List, Optional, Tuple, Dict, Any

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Set pandas option to show more content in columns so we can read full descriptions
pd.set_option("display.max_colwidth", 200)

def norm_col(name: str) -> str:
    """
    Helper function to normalize column names.
    It converts names to lowercase, replaces spaces with underscores, 
    and removes special characters.
    Example: "Study Credit" -> "study_credit"
    """
    name = str(name).strip().lower()
    name = re.sub(r"\s+", "_", name)
    name = name.replace("-", "_")
    return name


## 2. Loading and Exploring the Data
Now we load our dataset. This CSV file contains all the information about the available modules.
We will also clean up the column names to make them easier to work with.


In [2]:
# Path to your dataset (adjust the filename if necessary)
DATA_PATH = Path('Opgeschoonde_VKM_dataset.csv')

# Load the CSV file. We use engine="python" to auto-detect separators (like ; or ,)
df_raw = pd.read_csv(DATA_PATH, sep=None, engine="python")
df = df_raw.copy()

# Normalize all column names using our helper function
df.columns = [norm_col(c) for c in df.columns]

# Display the shape of the data (rows, columns)
print(f"Data Shape: {df.shape[0]} rows and {df.shape[1]} columns")

# Show the first few rows to inspect the data
display(df.head(2))

# Show information about columns and data types
print("\nColumn Information:")
print(df.info())

# Check for missing values
print("\nMissing values per column:")
print(df.isna().sum())


Data Shape: 211 rows and 15 columns


Unnamed: 0,id,name,shortdescription,description,studycredit,location,contact_id,level,learningoutcomes,module_tags,interests_match_score,popularity_score,estimated_difficulty,available_spots,start_date
0,159,kennismaking met psychologie,"brein, gedragsbeinvloeding, ontwikkelingspsychologie, gespreksvoering en ontwikkelingsfasen.",in deze module leer je hoe je gedrag van jezelf en van anderen kunt begrijpen en beinvloeden. je maakt kennis met de basistheorie van psychologie. aan bod komen onderwerpen die te maken hebben met...,15,den bosch,58,nlqf5,a. je beantwoordt vragen in een meerkeuze kennistoets waarin je laat zien dat je de basis van de psychologie kunt reproduceren en begrijpt. je laat zien dat je gedrag van individuen en groepen in ...,"['brein', 'gedragsbeinvloeding', 'ontwikkelingspsychologie', 'gespreksvoering', 'en', 'ontwikkelingsfasen']",0.54,319,1,79,2025-12-24
1,160,learning and working abroad,"internationaal, persoonlijke ontwikkeling, verpleegkunde","studenten kiezen binnen de (stam) van de opleiding van verpleegkunde steeds vaker voor een stage in het buitenland, waarbij zij de beroepsprestaties graag in een internationale stagecontext willen...",15,den bosch,58,nlqf5,de student toont professioneel gedrag conform de beroepscode bij laagcomplexe zorgvragers en collega's in de zorgsetting.,"['internationaal', 'persoonlijke', 'ontwikkeling', 'verpleegkunde']",0.92,172,5,56,2025-12-20



Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     211 non-null    int64  
 1   name                   211 non-null    object 
 2   shortdescription       211 non-null    object 
 3   description            211 non-null    object 
 4   studycredit            211 non-null    int64  
 5   location               211 non-null    object 
 6   contact_id             211 non-null    int64  
 7   level                  211 non-null    object 
 8   learningoutcomes       211 non-null    object 
 9   module_tags            211 non-null    object 
 10  interests_match_score  211 non-null    float64
 11  popularity_score       211 non-null    int64  
 12  estimated_difficulty   211 non-null    int64  
 13  available_spots        211 non-null    int64  
 14  start_date             211 non-null  

## 3. Creating Content Profiles (What is the module about?)
To recommend a module, the computer needs to know what it's about. 
We will combine several text columns (like `description`, `learning_outcomes`, `tags`) into one big text field called `VKM_text`. 
This gives us a rich description for every module.


In [3]:
# Select columns that contain text descriptions
# We check if they exist in the dataframe first to avoid errors
text_columns = [c for c in ["short_description", "description", "module_tags", "learning_outcomes"] if c in df.columns]

if not text_columns:
    raise ValueError("No suitable text columns found to create 'VKM_text'. Please check your dataset.")

# Combine the selected text columns into a single 'VKM_text' column
# We fill missing values with empty strings and join them with spaces
df["VKM_text"] = df[text_columns].fillna("").astype(str).agg(" ".join, axis=1).str.replace(r"\s+", " ", regex=True).str.strip()

# Determine which column to use as the title/name
title_col = "title" if "title" in df.columns else ("name" if "name" in df.columns else df.columns[0])

# Show the result: The Name and the combined Text
df[[title_col, "VKM_text"]].head()


Unnamed: 0,name,VKM_text
0,kennismaking met psychologie,in deze module leer je hoe je gedrag van jezelf en van anderen kunt begrijpen en beinvloeden. je maakt kennis met de basistheorie van psychologie. aan bod komen onderwerpen die te maken hebben met...
1,learning and working abroad,"studenten kiezen binnen de (stam) van de opleiding van verpleegkunde steeds vaker voor een stage in het buitenland, waarbij zij de beroepsprestaties graag in een internationale stagecontext willen..."
2,proactieve zorgplanning,"het jeroen bosch ziekenhuis wil graag samen met de opleiding verpleegkunde een module ontwikkelen, waarin de studenten de mogelijkheid krijgen om zich te verdiepen in de ziekenhuissetting. jbz sta..."
3,rouw en verlies,"in deze module wordt stil gestaan bij rouw en verlies, vanuit diverse invalshoeken waaronder de palliatieve zorg. thema's zoals oncologie kunnen hier een plaats krijgen (werkveld verpleegkunde vra..."
4,acuut complexe zorg,"in deze module kunnen studenten zich verdiepen in de acuut, complexe zorg binnen het verpleegkundig vakgebied. ['acute', 'zorg', 'complexiteit', 'ziekenhuis', 'revalidatie']"


## 4. Converting Text to Numbers (TF-IDF)
Computers can't read English (or Dutch). They read numbers.
We use a technique called **TF-IDF** (Term Frequency-Inverse Document Frequency).

*   **TF:** How often does a word appear in this document? (More = important)
*   **IDF:** How rare is this word across all documents? (Rarer = more specific/important)

This transforms our `VKM_text` into a mathematical vector.


In [4]:
import nltk
# Download Dutch stopwords (common words like 'de', 'het', 'een' that carry little meaning)
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

dutch_stopwords = stopwords.words('dutch')

# Initialize the TF-IDF Vectorizer
# max_features=8000: Keep only the top 8000 most important words
# ngram_range=(1, 2): Look at single words and pairs of words (bi-grams)
tfidf = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words=dutch_stopwords  # Ignore common Dutch words
)

# 'Fit' the vectorizer to our data and 'transform' the text into numbers
# We use the 'VKM_text' we created earlier. 
# (Note: The original code used 'name', but 'VKM_text' is much richer!)
X = tfidf.fit_transform(df["VKM_text"].fillna(""))

print(f"Transformed text into a matrix of shape: {X.shape}")


Transformed text into a matrix of shape: (211, 8000)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yazan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 5. Defining the Student Profile
Now we need a way to represent the student. 
We create a `CandidateProfile` class that holds:
*   **Interests:** What they want to learn (text).
*   **Preferences:** Location, difficulty, study credits, etc.


In [5]:
@dataclass
class CandidateProfile:
    """
    A data structure to hold the student's profile and preferences.
    """
    interests_text: str  # The student's description of what they want to learn
    preferred_location: Optional[str] = None # e.g., 'breda', 'Den Bosch'
    min_studycredits: Optional[float] = None # e.g., 15 or 30
    max_difficulty: Optional[float] = None # Scale of 1 to 5
    moduletags_include: Optional[List[str]] = None  # Specific tags they are looking for
    min_level: Optional[str] = None # e.g., 'NLQF5'


def profile_vector(profile: CandidateProfile):
    """
    Converts the candidate's interest text into the same numerical format (vector)
    as our modules, so we can compare them.
    """
    return tfidf.transform([profile.interests_text])


## 6. The Constraint Score (Practical Fit)
A module might match your interests perfectly, but if it's in the wrong city or has the wrong number of credits, it's not a good option.

We calculate a **Constraint Score** (0 to 1) based on:
*   **Location:** Does it match?
*   **Study Credits:** Is it enough credits?
*   **Difficulty:** Is it too hard?
*   **Tags:** Does it have specific required tags?


In [6]:
def extract_numeric_series(series: pd.Series) -> pd.Series:
    """Helper to extract numbers from text columns (e.g. '15 EC' -> 15.0)"""
    s = series.astype(str).str.replace(",", ".", regex=False)
    nums = pd.to_numeric(
        s.str.extract(r"([0-9]+\.?[0-9]*)")[0],
        errors="coerce"
    )
    return nums


def constraint_score(row: pd.Series, profile: CandidateProfile) -> Tuple[float, Dict[str, str]]:
    """
    Calculates how well a module (row) fits the hard constraints of the profile.
    Returns a weighted score (0.0 to 1.0) and a dictionary of reasons.
    
    Priorities:
    1. Location (Highest)
    2. Module Tags
    3. Difficulty
    4. Study Credits (Lowest)
    """
    scores = []
    weights = []
    reasons: Dict[str, str] = {}

    # Define Weights
    W_LOCATION = 0.4
    W_TAGS = 0.3
    W_DIFFICULTY = 0.2
    W_CREDITS = 0.1

    # 1. Location Check (Weight: 0.4)
    loc_col = "location" if "location" in row.index else None
    if profile.preferred_location and loc_col:
        job_loc = str(row[loc_col]).lower()
        pref = profile.preferred_location.lower()
        match = pref in job_loc
        scores.append(1.0 if match else 0.0)
        weights.append(W_LOCATION)
        reasons["location"] = f"Location {'matches' if match else 'does not match'} (Weight {W_LOCATION}): candidate={profile.preferred_location}, module={row[loc_col]}"
    else:
        reasons["location"] = "No location preference or column found."

    # 2. Module Tags / Domain Match (Weight: 0.3)
    role_col = "module_tags" if "module_tags" in row.index else None
    wanted = getattr(profile, "module_include", None) or getattr(profile, "moduletags_include", None)
    if wanted and role_col:
        tags = str(row[role_col]).lower()
        wanted = [w.lower() for w in wanted]
        # Count how many wanted tags appear in the module tags
        match_count = sum(1 for w in wanted if w in tags)
        # Score is percentage of wanted tags found
        sc = min(1.0, match_count / max(1, len(wanted)))
        scores.append(sc)
        weights.append(W_TAGS)
        reasons["module_tags"] = f"Tag Match (Weight {W_TAGS}): Found {match_count} of {len(wanted)} tags."
    else:
        reasons["module_tags"] = "No tag constraint or column."

    # 3. Difficulty Level (Weight: 0.2)
    diff_col = "estimated_difficulty" if "estimated_difficulty" in row.index else None
    if getattr(profile, "max_difficulty", None) is not None and diff_col:
        diff_val = extract_numeric_series(pd.Series([row[diff_col]])).iloc[0]
        if pd.isna(diff_val):
            scores.append(0.5)
            weights.append(W_DIFFICULTY)
            reasons["estimated_difficulty"] = f"Difficulty unknown (Weight {W_DIFFICULTY}); neutral score."
        else:
            match = diff_val <= getattr(profile, "max_difficulty")
            scores.append(1.0 if match else 0.0)
            weights.append(W_DIFFICULTY)
            reasons["estimated_difficulty"] = f"Difficulty {'OK' if match else 'too high'} (Weight {W_DIFFICULTY}) (Level {diff_val})."
    else:
        reasons["estimated_difficulty"] = "No difficulty constraint or column."

    # 4. Minimum Study Credits (Weight: 0.1)
    sal_col = "studycredit" if "studycredit" in row.index else ("salary" if "salary" in row.index else None)
    min_credits = getattr(profile, "min_studycredits", None)
    
    if min_credits is not None and sal_col:
        val = extract_numeric_series(pd.Series([row[sal_col]])).iloc[0]
        if pd.isna(val):
            scores.append(0.5)
            weights.append(W_CREDITS)
            reasons["studycredit"] = f"Credits unknown (Weight {W_CREDITS}); neutral score."
        else:
            match = val >= min_credits
            scores.append(1.0 if match else 0.0)
            weights.append(W_CREDITS)
            reasons["studycredit"] = f"Credits {'sufficient' if match else 'too low'} (Weight {W_CREDITS}) (found {val})."
    else:
        reasons["studycredit"] = "No credit constraint or column."

    if not scores:
        return 0.0, reasons

    # Calculate Weighted Average
    total_weight = sum(weights)
    if total_weight == 0:
        return 0.0, reasons
        
    weighted_score = sum(s * w for s, w in zip(scores, weights)) / total_weight
    return float(weighted_score), reasons


## 7. The Recommendation Engine
This is the core function. It calculates a **Final Score** for every module by combining:

1.  **Content Similarity (`alpha`):** How much does the text match? (Cosine Similarity)
2.  **Constraint Score (`beta`):** Does it fit the practical requirements?
3.  **Popularity (`gamma`):** Is it a popular module? (Optional)

Formula: `Final Score = (alpha * Content) + (beta * Constraints) + (gamma * Popularity)`


In [7]:
def first_existing_col(df: pd.DataFrame, candidates: list) -> Optional[str]:
    """Finds the first column from a list that exists in the dataframe."""
    for c in candidates:
        if c in df.columns:
            return c
    return None

def recommend(
    profile: CandidateProfile,
    k: int = 10,
    alpha: float = 0.7,   # Weight for Content Match
    beta: float = 0.2,    # Weight for Constraints
    gamma: float = 0.1    # Weight for Popularity
) -> pd.DataFrame:
    
    # Ensure weights sum to 1
    if not np.isclose(alpha + beta + gamma, 1.0):
        raise ValueError("Weights (alpha + beta + gamma) must sum to 1.0")

    # 1. Calculate Content Similarity
    p_vec = profile_vector(profile)
    # cosine_similarity returns a matrix, we flatten it to a simple list of scores
    content_scores = cosine_similarity(p_vec, X).flatten()

    # 2. Calculate Popularity Score (Normalized to 0-1)
    pop_col = first_existing_col(df, ["popularity_score"])
    if pop_col:
        pop_raw = extract_numeric_series(df[pop_col]).fillna(0.0)
        # Min-Max Scaling
        pop_scaled = (pop_raw - pop_raw.min()) / (pop_raw.max() - pop_raw.min() + 1e-9)
    else:
        pop_scaled = pd.Series(0.0, index=df.index)

    rows = []
    title_col = first_existing_col(df, ["name"]) or df.columns[0]
    loc_col = first_existing_col(df, ["location"])

    # 3. Iterate through all modules to calculate scores
    for idx, row in df.iterrows():
        cstr_score, reasons = constraint_score(row, profile)
        c_score = float(content_scores[idx])
        pop_score = float(pop_scaled.loc[idx])

        # Calculate Final Weighted Score
        final_score = alpha * c_score + beta * cstr_score + gamma * pop_score

        rows.append({
            "index": idx,
            "name": row.get(title_col, ""),
            "location": row.get(loc_col, ""),
            "final_score": final_score,
            "content_sim": c_score,
            "constraint_score": cstr_score,
            "popularity_score": pop_score,
            "constraint_reasons": reasons,
        })

    # Create DataFrame and sort by Final Score (highest first)
    rec_df = pd.DataFrame(rows).sort_values("final_score", ascending=False).head(k).reset_index(drop=True)
    return rec_df


## 8. Diversity & Explanations
To make the results better, we add:
*   **Diversity Score:** Are the recommendations too similar to each other? (We want variety).
*   **Explanation:** Which words actually matched? This helps the user understand *why* they got this recommendation.


In [8]:
def diversity_score(indices: List[int]) -> float:
    """
    Calculates how diverse the set of recommendations is.
    Returns 1.0 (very diverse) to 0.0 (all identical).
    """
    if len(indices) < 2:
        return 0.0
    sub = X[indices]
    sim = cosine_similarity(sub)
    # We only care about the upper triangle of the similarity matrix (excluding diagonal)
    mask = np.triu(np.ones_like(sim, dtype=bool), k=1)
    if mask.sum() == 0:
        return 0.0
    avg_sim = sim[mask].mean()
    return float(1.0 - avg_sim)

def explain_overlap(profile_text: str, job_text: str, top_k: int = 8) -> List[str]:
    """
    Finds the common words between the profile and the module description.
    """
    def tokenize(s: str) -> List[str]:
        return re.findall(r"\w+", s.lower())

    p_tokens = tokenize(profile_text)
    j_tokens = tokenize(job_text)

    # Find intersection
    common = [t for t in j_tokens if t in set(p_tokens)]
    
    # Count frequency of common words
    counts: Dict[str, int] = {}
    for t in common:
        counts[t] = counts.get(t, 0) + 1

    # Sort by frequency
    sorted_terms = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    return [w for w, _ in sorted_terms[:top_k]]


## 9. Demo: Let's Try It Out!
We will create a fake student profile and see what the system recommends.
You can change the `interests_text` or `preferred_location` to test different scenarios.


In [None]:
# Create a demo profile
demo_profile = CandidateProfile(
    interests_text="data analytics machine learning python", # fysio therapie ofzo kan ook
    preferred_location="breda", # of den bosch, tilburg
    min_studycredits=15, # 15, 30
    max_difficulty=4.0, #1 tot 5
    moduletags_include=["data", "analytics"] # 'fysio' 'health' 'medisch' weet ik veel
)

# Get recommendations
print("Calculating recommendations...")
demo_recs = recommend(demo_profile, k=5, alpha=0.7, beta=0.2, gamma=0.1)

# Display the results
display(demo_recs)


Calculating recommendations...


Unnamed: 0,index,name,location,final_score,content_sim,constraint_score,popularity_score,constraint_reasons
0,171,molecular modeling & data-driven analysis,breda,0.43615,0.240564,0.85,0.977551,"{'location': 'Location matches (Weight 0.4): candidate=breda, module=breda', 'module_tags': 'Tag Match (Weight 0.3): Found 1 of 2 tags.', 'estimated_difficulty': 'Difficulty OK (Weight 0.2) (Level..."
1,146,smart industry & internet of things,breda,0.357558,0.212838,0.85,0.385714,"{'location': 'Location matches (Weight 0.4): candidate=breda, module=breda', 'module_tags': 'Tag Match (Weight 0.3): Found 1 of 2 tags.', 'estimated_difficulty': 'Difficulty OK (Weight 0.2) (Level..."
2,130,datagedreven besluitvorming met ai,breda,0.316496,0.168171,0.85,0.287755,"{'location': 'Location matches (Weight 0.4): candidate=breda, module=breda', 'module_tags': 'Tag Match (Weight 0.3): Found 1 of 2 tags.', 'estimated_difficulty': 'Difficulty OK (Weight 0.2) (Level..."
3,131,ai driven robotics,breda,0.304174,0.100715,0.7,0.936735,"{'location': 'Location matches (Weight 0.4): candidate=breda, module=breda', 'module_tags': 'Tag Match (Weight 0.3): Found 0 of 2 tags.', 'estimated_difficulty': 'Difficulty OK (Weight 0.2) (Level..."
4,158,chemie & gezondheid,breda en den bosch,0.293329,0.033327,0.85,1.0,"{'location': 'Location matches (Weight 0.4): candidate=breda, module=breda en den bosch', 'module_tags': 'Tag Match (Weight 0.3): Found 1 of 2 tags.', 'estimated_difficulty': 'Difficulty OK (Weigh..."


### Why these recommendations?
Let's look at the diversity and the matching keywords for the top results.


In [10]:
# Calculate diversity of the top 5
div_score = diversity_score(demo_recs["index"].tolist())
print(f"Diversity Score (0=Similar, 1=Diverse): {div_score:.2f}\n")

# Explain the matches
print("--- Explanation of Matches ---")
for i, row in demo_recs.iterrows():
    idx = row["index"]
    # Get the full text for this module
    module_text = df.loc[idx, "VKM_text"]
    
    # Find overlapping words
    overlap = explain_overlap(demo_profile.interests_text, module_text)
    
    print(f"Rank {i+1}: {row['name']}")
    print(f"  -> Matching Keywords: {', '.join(overlap)}")
    print(f"  -> Location: {row['location']}")
    print("")


Diversity Score (0=Similar, 1=Diverse): 0.97

--- Explanation of Matches ---
Rank 1: molecular modeling & data-driven analysis
  -> Matching Keywords: data, python, machine, learning
  -> Location: breda

Rank 2: smart industry & internet of things
  -> Matching Keywords: data, machine, learning
  -> Location: breda

Rank 3: datagedreven besluitvorming met ai 
  -> Matching Keywords: data, python
  -> Location: breda

Rank 4: ai driven robotics
  -> Matching Keywords: machine, learning
  -> Location: breda

Rank 5: chemie & gezondheid
  -> Matching Keywords: data
  -> Location: breda en den bosch



# save model voor later gebruik

In [11]:

import joblib
from pathlib import Path

ARTIFACT_DIR = Path("models")
ARTIFACT_DIR.mkdir(exist_ok=True)

artifact_path = ARTIFACT_DIR / "vkm_recommender.pkl"
joblib.dump({"tfidf": tfidf, "df": df, "X": X}, artifact_path, compress=3)

print(f"Saved recommender to: {artifact_path}")


Saved recommender to: models\vkm_recommender.pkl
