# 3 Generate the RIASEC vectors on the programmes

**Idea**: Lexicon score, concept and math\
RIASEC gives six interest areas Realistic, Investigative, Artistic, Social, Enterprising, Conventional. O*NET’s Interest Profiler materials define these areas and show example activities and descriptors that map to each one. We will build a small dictionary of words and short phrases for each area, then score any program text by how much of that text overlaps with each dictionary. That gives a six number vector per program.


Let each course produce a continuous six number vector, then aggregate with weights.

Per course vector v_course = scores from the lexicon over that course description.

Weight per course w = ECTS credits times a year factor. For year weight use 1.0 for year 1, 0.6 for year 2, 0.4 for year 3.

Programme course vector V_courses = weighted average of v_course over all required courses.

Programme vector V_programme = 0.4 × scores from programme description plus 0.6 × V_courses.
Tune these weights later using advisor feedback.

This gives you a fair picture of what students will actually do, not only the marketing copy.

Level 2. Improve quality with embeddings and metadata




**Core steps:**

1. **Preprocess the text**
Lowercase, strip markup, keep words, optional lemmatize.

2. **Choose a lexicon for each letter**
Example for Investigative might include analyze, theory, model, hypothesis, data, experiment. The O*NET Interest Profiler explains what belongs under each area and gives good guidance for building such seed sets.

3. **Weight the words**
    Two simple choices:
    a. Raw counts per thousand tokens
    b. TF IDF weights which downweight very common words and upweight informative ones

4. **Aggregate to the six areas**
For a program p and letter L, the raw score is the sum of weights for all tokens that appear in the L lexicon.

5. **Normalize**
Turn the six raw scores into proportions that sum to one. This gives a clean six number vector.

## 1. Imports, paths, and loader

In [20]:
# Cell 1. Imports, paths, and a robust JSON reader

from pathlib import Path
import json
import re
from typing import List, Dict, Iterable

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


# Files
PATH_PROG = Path(r"..\data_programmes_courses\silver\df_programmes_silver.csv")   # programmes table
PATH_COUR = Path(r"..\data_programmes_courses\silver\df_courses_silver.csv")      # courses table

def read_json_table(path: Path) -> pd.DataFrame:
    """
    Read either a JSON list or a JSON lines file into a DataFrame.
    Keeps this flexible so the scraper format can evolve.
    """
    raw = path.read_text(encoding="utf-8").strip()
    if raw.startswith("["):
        return pd.DataFrame(json.loads(raw))
    rows = [json.loads(x) for x in raw.splitlines() if x]
    return pd.DataFrame(rows)

# Read data
df_prog = pd.read_csv(PATH_PROG)
df_courses = pd.read_csv(PATH_COUR)

# Minimal sanity checks
assert "programme_title" in df_prog.columns, "programme_title missing in df_programmes.json"
assert "programme_title" in df_courses.columns, "programme_title missing in df_courses.json"

## 2. Text fields, cleaning, and programme texts by column
Selects the exact programme columns you listed.

Cleans each text field into a stable, lowercase, punctuation free form.

Produces a long table with one row per programme per field.



In [21]:
# Cell 2. Select programme columns, clean text, and build per-field texts

# Programme columns we will score
PROG_FIELDS = [
    "sg_description",
    "vunl_description",
    "vunl_description_curriculum",
    "vunl_future_description",
    "vunl_future_career",
    "year1_description",
    "year2_description",
    "year3_description",
]

# Create empty columns if any are missing
for col in PROG_FIELDS:
    if col not in df_prog.columns:
        df_prog[col] = None

def clean_text(s: str) -> str:
    """
    Lowercase, remove urls, keep letters and spaces, collapse whitespace.
    This keeps tokenisation stable for TF IDF with a small vocabulary.
    """
    if not isinstance(s, str):
        return ""
    s = s.lower()
    s = re.sub(r"http[s]?://\S+", " ", s)
    s = re.sub(r"[^a-z\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

# Cleaned versions of each programme level field
for col in PROG_FIELDS:
    df_prog[f"clean__{col}"] = df_prog[col].fillna("").apply(clean_text)

# For later convenience keep a long shaped view of programme texts by field
prog_long = (
    df_prog[["programme_title"] + [f"clean__{c}" for c in PROG_FIELDS]]
      .rename(columns={f"clean__{c}": c for c in PROG_FIELDS})
      .melt(id_vars="programme_title", var_name="field", value_name="text")
)

# Drop empty rows to avoid zero only documents in the corpus
prog_long = prog_long[prog_long["text"].str.len() > 0].reset_index(drop=True)

prog_long.head(3)


Unnamed: 0,programme_title,field,text
0,Ancient Studies,sg_description,the ancient world and its heritage are the foc...
1,Archaeology,sg_description,archaeology as a discipline focuses upon the s...
2,Artificial Intelligence,sg_description,artificial intelligence ai is the science of b...



## 3. RIASEC seed lexicon and a vectoriser that knows only these terms

Defines a seed lexicon for the six letters.

Adds a tiny set of two word phrases that capture strong signals.

Fits a TF IDF on the real corpus but with a restricted vocabulary. This enforces that only lexicon terms carry weight.

Stores the column indices for each letter so we can sum the TF IDF weights per letter.

In [22]:
# Cell 3. RIASEC lexicon and TF IDF vectoriser

# Small seed lexicon. Expand later with O*NET style terms and Dutch mates.
LEXICON: Dict[str, List[str]] = {
    "R": ["lab","field","equipment","tools","build","repair","operate","install","measure",
          "laboratory","prototype","machinery","hardware","electronics","sample","specimen","safety"
          ,"construction","manual","physical","technician","maintenance","inspection","diagnose","weld"],
    "I": ["analyze","theory","model","proof","derive","experiment","hypothesis","data",
          "research","statistics","algorithm","simulate","evidence","inference","mathematics","physics","logic"
          ,"quantitative","scientific","compute","computation","evaluate","study","investigate"],
    "A": ["design","draw","sketch","compose","write","narrative","visual","media","art",
          "music","film","theatre","creative","story","photography","gallery","curation"
          ,"performance","aesthetic","illustrate","exhibit","craft","fashion","style"],
    "S": ["help","support","advise","coach","teach","tutor","counsel","community","team",
          "care","wellbeing","interview","facilitate","mentor","outreach","collaborate","group","clients"
          ,"service","social","develop","train","educate"],
    "E": ["business","lead","manage","strategy","sales","marketing","finance","entrepreneurship",
          "pitch","negotiate","market","revenue","growth","product","stakeholder","budget","plan"
          ,"customer","commercial","operation","organisational","investor","network"],
    "C": ["organize","detail","procedure","policy","regulation","compliance","audit","accounting",
          "schedule","record","document","database","spreadsheet","report","inventory","forms","workflow","quality"
          ,"administration","logistics","systematic","process","standard"],
}

# Optional short list of two word phrases that are very diagnostic
PHRASES: List[str] = [
    "field work",
    "case study",
    "data analysis",
    "random assignment",
    "time series",
    "quality control",
]

def build_vectorizer(corpus_texts: Iterable[str], lexicon: Dict[str, List[str]], phrases: List[str]) -> tuple[TfidfVectorizer, np.ndarray, Dict[str, np.ndarray]]:
    """
    Fit a TF IDF on the actual corpus, but restrict the vocabulary to the lexicon terms and chosen phrases.
    Returns the fitted vectoriser, the feature name array, and indices for each letter.
    """
    vocab = sorted(set(sum(lexicon.values(), [])) | set(phrases))
    vectorizer = TfidfVectorizer(
        vocabulary=vocab,
        ngram_range=(1, 2),   # allow the phrases
        norm="l2",
        min_df=1
    )
    vectorizer.fit(list(corpus_texts))
    feats = np.array(vectorizer.get_feature_names_out())
    letter_to_idx = {L: np.where(np.isin(feats, lexicon[L]))[0] for L in ["R","I","A","S","E","C"]}
    return vectorizer, feats, letter_to_idx

# Build a corpus using all programme texts and later we will extend with course texts before scoring courses
corpus_prog = prog_long["text"].tolist()
tfidf_prog, feat_names_prog, idx_prog = build_vectorizer(corpus_prog, LEXICON, PHRASES)

len(feat_names_prog), {k: len(v) for k, v in idx_prog.items()}


(148, {'R': 25, 'I': 24, 'A': 24, 'S': 23, 'E': 23, 'C': 23})

## 4. Scoring and aggregation functions for programme and course vectors
Some words are everywhere and tell us nothing. Some words are rare and very useful. We use TF IDF to give higher weight to useful words and lower weight to common ones.

In [23]:
# Cell 4. Scoring helpers and the two aggregations: V_progdesc and V_courses

LETTERS = ["R","I","A","S","E","C"]

def riasec_props_from_text(text: str, vectorizer: TfidfVectorizer, letter_to_idx: Dict[str, np.ndarray]) -> np.ndarray:
    """
    Compute p(T) in R^6 for a single text.
    1. Transform text to TF IDF row
    2. Sum weights over the feature indices for each letter
    3. Normalise to unit sum with a tiny epsilon
    """
    if not isinstance(text, str) or not text.strip():
        return np.zeros(6, dtype=float)
    row = vectorizer.transform([text])
    sums = []
    for L in LETTERS:
        idx = letter_to_idx[L]
        val = float(row[:, idx].sum()) if idx.size else 0.0
        sums.append(val)
    v = np.array(sums, dtype=float)
    denom = v.sum() + 1e-8
    return v / denom if denom > 0 else np.zeros(6, dtype=float)

def aggregate_programme_description(df_prog: pd.DataFrame, fields: List[str], vectorizer: TfidfVectorizer, letter_to_idx: Dict[str, np.ndarray], alpha: Dict[str, float] | None = None) -> pd.DataFrame:
    """
    Compute V_progdesc for each programme_title as a weighted blend of p(T_k) over selected programme fields.
    If alpha is None, use equal weights over nonempty fields per programme.
    Returns a DataFrame with one row per programme and six columns R,I,A,S,E,C.
    """
    rows = []
    for title, group in df_prog.groupby("programme_title", dropna=False):
        pieces = []
        weights = []
        for col in fields:
            txt = group[f"clean__{col}"].iloc[0] if f"clean__{col}" in group.columns else ""
            if not txt:
                continue
            pieces.append(riasec_props_from_text(txt, vectorizer, letter_to_idx))
            if alpha is None:
                weights.append(1.0)
            else:
                weights.append(alpha.get(col, 0.0))
        if not pieces:
            vec = np.zeros(6, dtype=float)
        else:
            w = np.array(weights, dtype=float)
            if alpha is None:
                w = w / w.sum()
            vec = np.average(np.vstack(pieces), axis=0, weights=w)
            # renormalise for safety
            s = vec.sum() + 1e-8
            vec = vec / s if s > 0 else vec
        rows.append({"programme_title": title, **{LETTERS[i]: vec[i] for i in range(6)}})
    return pd.DataFrame(rows)

def make_course_text(row: pd.Series) -> str:
    """
    Concatenate the course text fields we want to score.
    These are the highest value fields with the least boilerplate.
    """
    parts = [
        row.get("course_objective", ""),
        row.get("course_content", ""),
        row.get("method_of_assessment", ""),
        row.get("recommended_background_knowledge", "")
    ]
    return clean_text(" ".join([p for p in parts if isinstance(p, str)]))

def aggregate_courses_by_programme(df_cour: pd.DataFrame, vectorizer: TfidfVectorizer, letter_to_idx: Dict[str, np.ndarray]) -> pd.DataFrame:
    """
    Compute V_courses for each programme as a credits weighted mean of course vectors p(T_c).
    w_c equals ects_c divided by total ects in the programme.
    Returns one row per programme with columns R I A S E C.
    """
    # Prepare cleaned course text and numeric ects
    df = df_cour.copy()
    for col in ["ects"]:
        df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0.0)
    df["course_text_clean"] = df.apply(make_course_text, axis=1)

    rows = []
    for title, group in df.groupby("programme_title", dropna=False):
        group = group[group["course_text_clean"].str.len() > 0]
        if group.empty:
            vec = np.zeros(6, dtype=float)
        else:
            ects = group["ects"].to_numpy(dtype=float)
            # avoid zero total with a tiny epsilon
            total = ects.sum()
            if total <= 0:
                w = np.ones(len(group), dtype=float) / len(group)
            else:
                w = ects / total
            mats = []
            for txt in group["course_text_clean"]:
                mats.append(riasec_props_from_text(txt, vectorizer, letter_to_idx))
            M = np.vstack(mats)
            vec = np.average(M, axis=0, weights=w)
            s = vec.sum() + 1e-8
            vec = vec / s if s > 0 else vec
        rows.append({"programme_title": title, **{LETTERS[i]: vec[i] for i in range(6)}})
    return pd.DataFrame(rows)


## 5. Build a single TF IDF model on the full corpus, programme texts plus course texts

You create one shared model for both programme and course texts, so the idf part reflects the whole dataset. You keep the vocabulary limited to the RIASEC lexicon and a few phrases, which keeps interpretation simple. You also compute the feature indices per letter for fast summing.

In [24]:
# Cell 5. Fit one TF IDF on programme and course corpora combined
# This gives a shared idf baseline for both sides

# Prepare a cleaned course text for fitting and later scoring
def make_course_text(row: pd.Series) -> str:
    parts = [
        row.get("course_objective", ""),
        row.get("course_content", ""),
        row.get("method_of_assessment", ""),
        row.get("recommended_background_knowledge", "")
    ]
    return clean_text(" ".join([p for p in parts if isinstance(p, str)]))

df_courses["course_text_clean"] = df_courses.apply(make_course_text, axis=1)

# Build the combined corpus
corpus_all = []
corpus_all.extend(prog_long["text"].tolist())
corpus_all.extend(df_courses["course_text_clean"].dropna().tolist())

# Fit a vectoriser using the same lexicon and phrases as Cell 3
def build_vectorizer_all(corpus_texts):
    vocab = sorted(set(sum(LEXICON.values(), [])) | set(PHRASES))
    v = TfidfVectorizer(
        vocabulary=vocab,
        ngram_range=(1, 2),
        norm="l2",
        min_df=1
    )
    v.fit(list(corpus_texts))
    feats = np.array(v.get_feature_names_out())
    idx = {L: np.where(np.isin(feats, LEXICON[L]))[0] for L in LETTERS}
    return v, feats, idx

tfidf_all, feat_names_all, idx_all = build_vectorizer_all(corpus_all)

len(feat_names_all), {k: len(v) for k, v in idx_all.items()}


(148, {'R': 25, 'I': 24, 'A': 24, 'S': 23, 'E': 23, 'C': 23})

In [25]:
# Cell 6. Compute V_progdesc using equal weights over non empty programme fields

def aggregate_programme_description(df_prog, fields, vectorizer, letter_to_idx, alpha=None):
    rows = []
    for title, group in df_prog.groupby("programme_title", dropna=False):
        pieces = []
        weights = []
        for col in fields:
            col_clean = f"clean__{col}"
            txt = group[col_clean].iloc[0] if col_clean in group.columns else ""
            if not txt:
                continue
            vec = riasec_props_from_text(txt, vectorizer, letter_to_idx)
            pieces.append(vec)
            if alpha is None:
                weights.append(1.0)
            else:
                weights.append(alpha.get(col, 0.0))
        if not pieces:
            out_vec = np.zeros(6, dtype=float)
        else:
            w = np.array(weights, dtype=float)
            if alpha is None:
                w = w / w.sum()
            M = np.vstack(pieces)
            out_vec = np.average(M, axis=0, weights=w)
            s = out_vec.sum() + 1e-8
            out_vec = out_vec / s if s > 0 else out_vec
        rows.append({"programme_title": title, **{LETTERS[i]: out_vec[i] for i in range(6)}})
    return pd.DataFrame(rows)

V_progdesc = aggregate_programme_description(
    df_prog=df_prog,
    fields=PROG_FIELDS,
    vectorizer=tfidf_all,
    letter_to_idx=idx_all,
    alpha=None  # equal weights over non empty fields
)

V_progdesc.head(3)


Unnamed: 0,programme_title,R,I,A,S,E,C
0,Ancient Studies,0.0,0.34361,0.487695,0.097507,0.031115,0.040073
1,Archaeology,0.416881,0.395154,0.07838,0.085196,0.024389,0.0
2,Artificial Intelligence,0.126682,0.374078,0.087097,0.304267,0.099832,0.008043


In [26]:
# Cell 7. Compute V_courses as a credits weighted mean of course vectors

def aggregate_courses_by_programme(df_courses, vectorizer, letter_to_idx):
    rows = []
    df = df_courses.copy()
    df["ects"] = pd.to_numeric(df["ects"], errors="coerce").fillna(0.0)

    for title, group in df.groupby("programme_title", dropna=False):
        # keep only non empty course texts
        subset = group[group["course_text_clean"].str.len() > 0]
        if subset.empty:
            out_vec = np.zeros(6, dtype=float)
        else:
            ects = subset["ects"].to_numpy(dtype=float)
            total = ects.sum()
            if total <= 0:
                w = np.ones(len(subset), dtype=float) / len(subset)
            else:
                w = ects / total
            mats = []
            for txt in subset["course_text_clean"]:
                mats.append(riasec_props_from_text(txt, vectorizer, letter_to_idx))
            M = np.vstack(mats)
            out_vec = np.average(M, axis=0, weights=w)
            s = out_vec.sum() + 1e-8
            out_vec = out_vec / s if s > 0 else out_vec
        rows.append({"programme_title": title, **{LETTERS[i]: out_vec[i] for i in range(6)}})
    return pd.DataFrame(rows)

V_courses = aggregate_courses_by_programme(
    df_courses=df_courses,
    vectorizer=tfidf_all,
    letter_to_idx=idx_all
)

V_courses.head(3)


Unnamed: 0,programme_title,R,I,A,S,E,C
0,Ancient Studies,0.206828,0.333915,0.153338,0.125839,0.037534,0.142545
1,Archaeology,0.216848,0.364213,0.13111,0.082458,0.036773,0.168598
2,Artificial Intelligence,0.099028,0.332302,0.135109,0.17644,0.043423,0.213698


In [27]:
# Cell 8. Final blend, simple audit, and save files for the next notebook

OUT_DIR = Path(".") / "../data_RIASEC"
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Join programme and course vectors
V_progdesc = V_progdesc.set_index("programme_title")
V_courses  = V_courses.set_index("programme_title")

# Align indices
all_titles = sorted(set(V_progdesc.index) | set(V_courses.index))
V_progdesc = V_progdesc.reindex(all_titles).fillna(0.0)
V_courses  = V_courses.reindex(all_titles).fillna(0.0)

# Final 50 plus 50 blend with a safety renormalisation
V_final = 0.5 * V_progdesc.values + 0.5 * V_courses.values
V_final = V_final / (V_final.sum(axis=1, keepdims=True) + 1e-8)

df_final = pd.DataFrame(V_final, columns=LETTERS, index=all_titles).reset_index().rename(columns={"index":"programme_title"})


In [28]:
# save the vectors
df_final.to_csv(OUT_DIR / "df_RIASEC_programmes_vectors.csv", index=False)