# 3 Generate the RIASEC vectors on the programmes

**Idea**: Lexicon score, concept and math\

RIASEC gives six interest areas Realistic, Investigative, Artistic, Social, Enterprising, Conventional. 

O*NETâ€™s Interest Profiler materials define these areas and show example activities and descriptors that map to each one. We will build a small dictionary of words and short phrases for each area, then score any program text by how much of that text overlaps with each dictionary. 

Method: **TF-IDF**

**Output:**
A six number vector per program. Each vector must be normalized: l2


## 1. Imports, paths, and loader

In [1]:
from pathlib import Path
import json
import re
from typing import List, Dict, Iterable

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Files
PATH_PROG = Path(r"..\data_programmes_courses\silver\df_programmes_silver.csv")   # programmes table
PATH_COUR = Path(r"..\data_programmes_courses\silver\df_courses_silver.csv")      # courses table

# Read data
df_prog = pd.read_csv(PATH_PROG)
df_cour = pd.read_csv(PATH_COUR)


## 2. Text fields, cleaning, and programme texts by column
Selects the exact programme columns you listed.

Cleans each text field into a stable, lowercase, punctuation free form.



In [2]:

# Keep only the columns we care about
PROG_FIELDS = [
    "sg_description",
    "vunl_description",
    "vunl_description_curriculum",
    "vunl_future_description",
    "vunl_future_career",
    "year1_description",
    "year2_description",
    "year3_description",
]

# Ensure all columns exist
for c in PROG_FIELDS:
    if c not in df_prog.columns:
        df_prog[c] = ""

COURSE_FIELDS = [
    "course_objective",
    "course_content",
    "method_of_assessment",
    "recommended_background_knowledge",
]
for c in COURSE_FIELDS:
    if c not in df_cour.columns:
        df_cour[c] = ""

# Simple cleaner
def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.lower()
    s = re.sub(r"http[s]?://\S+", " ", s)
    s = re.sub(r"[^a-z\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s



## 3. RIASEC seed lexicon and a vectoriser

- We make a big text per programme ans another per course. 
- Then we define a seed lexicon for the six letters.
- Fits a TF IDF on the real corpus but with a restricted vocabulary. This enforces that only lexicon terms carry weight.
- Stores the column indices for each letter so we can sum the TF IDF weights per letter.

In [3]:
# One combined programme text per row
df_prog["programme_text_clean"] = (
    df_prog[PROG_FIELDS].fillna("").agg(" ".join, axis=1).apply(clean_text)
)

# One combined course text per row
df_cour["course_text_clean"] = (
    df_cour[COURSE_FIELDS].fillna("").agg(" ".join, axis=1).apply(clean_text)
)

# Small seed lexicon. Expand later with O*NET style terms.
LEX: Dict[str, List[str]] = {
    "R": ["lab","field","equipment","tools","build","repair","operate","install","measure",
          "laboratory","prototype","machinery","hardware","electronics","sample","specimen","safety"
          ,"construction","manual","physical","technician","maintenance","inspection","diagnose","weld"],
    "I": ["analyze","theory","model","proof","derive","experiment","hypothesis","data",
          "research","statistics","algorithm","simulate","evidence","inference","mathematics","physics","logic"
          ,"quantitative","scientific","compute","computation","evaluate","study","investigate"],
    "A": ["design","draw","sketch","compose","write","narrative","visual","media","art",
          "music","film","theatre","creative","story","photography","gallery","curation"
          ,"performance","aesthetic","illustrate","exhibit","craft","fashion","style"],
    "S": ["help","support","advise","coach","teach","tutor","counsel","community","team",
          "care","wellbeing","interview","facilitate","mentor","outreach","collaborate","group","clients"
          ,"service","social","develop","train","educate"],
    "E": ["business","lead","manage","strategy","sales","marketing","finance","entrepreneurship",
          "pitch","negotiate","market","revenue","growth","product","stakeholder","budget","plan"
          ,"customer","commercial","operation","organisational","investor","network"],
    "C": ["organize","detail","procedure","policy","regulation","compliance","audit","accounting",
          "schedule","record","document","database","spreadsheet","report","inventory","forms","workflow","quality"
          ,"administration","logistics","systematic","process","standard"],
}


LETTERS = ["R","I","A","S","E","C"]

# Restrict TF IDF to these words so the scores are easy to explain
vocab = sorted({w for terms in LEX.values() for w in terms})
corpus = pd.concat([
    df_prog["programme_text_clean"],
    df_cour["course_text_clean"]
], ignore_index=True)

tfidf = TfidfVectorizer(vocabulary=vocab, ngram_range=(1, 1), norm="l2")
tfidf.fit(corpus.tolist())

# Map each letter to the feature indices
feat_names = np.array(tfidf.get_feature_names_out())
IDX = {L: np.where(np.isin(feat_names, LEX[L]))[0] for L in LETTERS}



## 4. Scoring functions, then programme and course vectors

- With riasec_from_text we sum TF IDF weights for each letter and L2 normalizes the six numbers so the dot product can be used as cosine similarity. 
- Programme side uses one combined text. 
- Course side makes one vector per course, then averages with ECTS weights, then normalizes.

In [4]:

def l2_normalize(vec: np.ndarray, eps=1e-8) -> np.ndarray:
    z = np.sqrt((vec * vec).sum()) + eps
    return vec / z if z > 0 else vec

def riasec_from_text(text: str) -> np.ndarray:
    """Turn one cleaned text into a six number vector with L2 norm equal to 1."""
    if not isinstance(text, str) or not text.strip():
        return np.zeros(6, dtype=float)
    row = tfidf.transform([text])
    sums = []
    for L in LETTERS:
        idx = IDX[L]
        val = float(row[:, idx].sum()) if idx.size else 0.0
        sums.append(val)
    return l2_normalize(np.array(sums, dtype=float))


In [5]:

# Programme description vectors, one row per programme_title
V_progdesc = (
    df_prog
      .groupby("programme_title", as_index=False)
      .agg(programme_text_clean=("programme_text_clean","first"))
)
V_progdesc[LETTERS] = V_progdesc["programme_text_clean"].apply(riasec_from_text).apply(pd.Series)

# Course vectors, credit weighted by ECTS, then one row per programme_title
df_cour["ects"] = pd.to_numeric(df_cour["ects"], errors="coerce").fillna(0.0)

def programme_course_vector(group: pd.DataFrame) -> np.ndarray:
    g = group[group["course_text_clean"].str.len() > 0]
    if g.empty:
        return np.zeros(6, dtype=float)
    weights = g["ects"].to_numpy(dtype=float)
    total = weights.sum()
    if total <= 0:
        weights = np.ones(len(g), dtype=float) / len(g)
    else:
        weights = weights / total
    mats = np.vstack([riasec_from_text(t) for t in g["course_text_clean"]])
    vec = (weights.reshape(-1,1) * mats).sum(axis=0)
    return l2_normalize(vec)

V_courses = (
    df_cour
      .groupby("programme_title")
      .apply(programme_course_vector)
      .reset_index(name="vec")
)
V_courses[LETTERS] = V_courses["vec"].apply(pd.Series)
V_courses = V_courses.drop(columns=["vec"])


  .apply(programme_course_vector)


## Cell 5. Final blend and save


In [6]:

# Join by programme_title
P = V_progdesc.set_index("programme_title")[LETTERS]
C = V_courses.set_index("programme_title")[LETTERS]
titles = sorted(set(P.index) | set(C.index))
P = P.reindex(titles).fillna(0.0)
C = C.reindex(titles).fillna(0.0)

# L2 blend for matching
V_final_l2 = 0.5 * P.values + 0.5 * C.values
V_final_l2 = np.vstack([l2_normalize(r) for r in V_final_l2]) # re-normalize
DF_l2 = pd.DataFrame(V_final_l2, columns=LETTERS, index=titles).reset_index().rename(columns={"index":"programme_title"})

# Quick checks
assert np.allclose((DF_l2[LETTERS].to_numpy()**2).sum(axis=1), 1.0, atol=1e-6)


In [7]:
import numpy as np
import pandas as pd

# we list the columns that contain the vector entries
# change these names if your DataFrame uses different ones
VECTOR_COLS = ["R", "I", "A", "S", "E", "C"]

# we set the target entropy (natural log base, same as in your colleague's code)
TARGET_ENTROPY = 1.18
TOL = 0.01


def entropy_from_probs(p: np.ndarray) -> float:
    """
    We compute Shannon entropy using natural log.
    We ignore zero entries to avoid log(0).
    """
    p = np.array(p, dtype=float)
    p = p[p > 0]
    if p.size == 0:
        return 0.0
    return float(-np.sum(p * np.log(p)))


def adjust_vector_to_target_entropy(vec: np.ndarray,
                                   target_entropy: float = TARGET_ENTROPY,
                                   tol: float = TOL):
    """
    We take an existing vector and find a temperature T for softmax(v / T)
    so that the entropy of the resulting probability vector is close to target_entropy.

    We return:
    - p: the probability vector that sums to one
    - v_l2: the same vector rescaled to have L2 norm equal to one
    - entropy: the final entropy we achieved
    - T: the temperature we ended up using
    """
    # we convert to a simple one dimensional numpy array
    v = np.array(vec, dtype=float).reshape(-1)

    # we set the search interval for the temperature
    # small T gives very peaked distributions (low entropy)
    # large T gives flatter distributions (high entropy)
    T_low, T_high = 0.01, 10.0

    p = None
    entropy = None
    T = None

    # we run a fixed number of iterations of binary search
    for _ in range(50):
        T = (T_low + T_high) / 2.0

        # we compute softmax(v / T)
        x = v / T
        x = x - x.max()  # we improve numerical stability
        exp_x = np.exp(x)
        p = exp_x / exp_x.sum()

        # we compute entropy of this probability vector
        entropy = entropy_from_probs(p)

        # we check if we are close enough to the target
        if abs(entropy - target_entropy) < tol:
            break

        # if entropy is greater than target, we want lower entropy, so we reduce T
        if entropy > target_entropy:
            T_high = T
        # otherwise entropy is lower than target, we increase T
        else:
            T_low = T

    # we finally create the L2 normalized version of p
    norm = np.linalg.norm(p)
    if norm == 0:
        # we protect against division by zero, we fall back to uniform vector
        v_l2 = np.ones_like(p) / np.sqrt(len(p))
    else:
        v_l2 = p / norm

    return p, v_l2, entropy, T


def adjust_row(row: pd.Series) -> pd.Series:
    """
    We take one programme row, adjust its vector to have the target entropy,
    and return a new row with updated vector and extra info.
    """
    # we extract the original vector values as numpy array
    original_vec = row[VECTOR_COLS].to_numpy(dtype=float)

    # we compute the original entropy (for diagnostics)
    original_probs = original_vec / original_vec.sum() if original_vec.sum() != 0 else np.ones_like(original_vec) / len(original_vec)
    original_entropy = entropy_from_probs(original_probs)

    # we adjust to target entropy
    p, v_l2, new_entropy, T_used = adjust_vector_to_target_entropy(original_vec,
                                                                   target_entropy=TARGET_ENTROPY,
                                                                   tol=TOL)

    # we create a copy of the row so we do not overwrite the original in place
    new_row = row.copy()

    # we replace the vector columns with the new L2 normalized vector
    for col, value in zip(VECTOR_COLS, v_l2):
        new_row[col] = value

    # we store diagnostics for later inspection
    new_row["entropy_original"] = original_entropy
    new_row["entropy"] = new_entropy
    new_row["temperature_used"] = T_used

    return new_row


# we apply the adjustment to all programmes in the DataFrame
df_adjusted = DF_l2.apply(adjust_row, axis=1)

# we can quickly inspect the change in entropy
print(df_adjusted[["programme_title", "entropy_original", "entropy", "temperature_used"]])


                          programme_title  entropy_original   entropy  \
0                         Ancient Studies          1.643295  1.178040   
1                             Archaeology          1.562317  1.170860   
2                 Artificial Intelligence          1.619523  1.171021   
3                     Biomedical Sciences          1.317086  1.178910   
4                      Business Analytics          1.497401  1.171738   
5   Communication and Information Studies          1.447639  1.172476   
6                        Computer Science          1.502733  1.173959   
7           Econometrics and Data Science          1.424194  1.176652   
8    Econometrics and Operations Research          1.384730  1.178846   
9        Economics and Business Economics          1.526821  1.171210   
10                                History          1.563773  1.179533   
11  International Business Administration          1.485759  1.181811   
12                 Literature and Society          

In [8]:
# remove diagnostic columns before saving
df_adjusted = df_adjusted.drop(columns=["entropy_original", "temperature_used"]) 
# Save
outdir = Path(r"..\data_RIASEC")    
outdir.mkdir(parents=True, exist_ok=True)
df_adjusted.to_csv(outdir / "df_RIASEC_programmes_vectors_adjusted.csv", index=False)

In [None]:

# Save
outdir = Path(r"..\data_RIASEC")
outdir.mkdir(parents=True, exist_ok=True)
DF_l2.to_csv(outdir / "df_RIASEC_programmes_vectors.csv", index=False)

print("Saved:", outdir / "df_RIASEC_programmes_vectors.csv")