# 2. Column Generation and Cleaning of the Data

We have two datasets coming from the scraping:

1. df_programmes
2. df_courses

**Purpose**
 1. Load programme and course JSONs
 2. Build clean text fields per programme
 3. Create a small seed RIASEC lexicon
 4. Compute simple lexicon scores using counts and TF IDF
 5. Save tidy outputs for the vector notebook

In [1]:
import pandas as pd
import json, re
from pathlib import Path


## 1. Load data

In [25]:
bronze = Path("../data_programmes_courses/bronze") 
silver = Path("../data_programmes_courses/silver")

df_prog = pd.read_csv(bronze / "df_programmes_bronze.csv")
df_courses = pd.read_csv(bronze / "df_courses_bronze.csv")
print("df_programmes:", df_prog.shape)
print("df_courses:", df_courses.shape)


df_programmes: (17, 11)
df_courses: (845, 14)


In [9]:
# drop track_from_labl column
df_courses = df_courses.drop(columns=["track_from_label"], errors="ignore")
# remove duplicates
df_courses = df_courses.drop_duplicates()
print("df_courses after removing duplicates:", df_courses.shape)

df_courses after removing duplicates: (568, 13)


## 2. Course features
Builds a tidy table with level, methods, paragraph flags, and simple text metrics.

In [10]:
#  Clean course_programmes and build number_programmes

# extract the number inside "Hide full list (30)"
pat = r"Hide\s*full\s*list\s*\((\d+)\)"
df_courses["number_programmes"] = (
    df_courses["course_programmes"]
    .fillna("")
    .str.extract(pat, expand=False)
    .astype("Int64")
)

# remove the "Hide full list (30)" text
df_courses["course_programmes"] = (
    df_courses["course_programmes"]
    .fillna("")
    .str.replace(pat, "", regex=True)
    .str.replace(r"\s{2,}", " ", regex=True)
    .str.replace(r"(;\s*){2,}", "; ", regex=True)
    .str.strip(" ;")
)

# simple fallback when the number is missing
def guess_programme_count(s):
    s = (s or "").strip(" ;")
    if not s:
        return pd.NA
    parts = [p.strip() for p in s.split(";") if p.strip()]
    return len(parts) if parts else pd.NA

df_courses["number_programmes"] = df_courses["number_programmes"].fillna(
    df_courses["course_programmes"].map(guess_programme_count)
).astype("Int64")



# Expand course_paragraphs_json into tidy columns

targets = [
    "course_objective",
    "course_content",
    "additional_information_teaching_methods",
    "method_of_assessment",
    "literature",
    "additional_information_target_audience",
    "recommended_background_knowledge",
]

def norm_title(t):
    t = re.sub(r"[^\w\s]", " ", str(t)).lower()
    t = re.sub(r"\s+", " ", t).strip()
    return t.replace(" ", "_")

def parse_blocks(s):
    out = {k: None for k in targets}
    if not isinstance(s, str) or not s.strip():
        return out
    try:
        items = json.loads(s)
    except Exception:
        return out
    if not isinstance(items, list):
        return out
    for item in items:
        if not isinstance(item, str) or not item.strip():
            continue
        head, body = item.split("\n", 1) if "\n" in item else (item, "")
        key = norm_title(head)
        if key in out:
            out[key] = body.strip()
    return out

parsed = df_courses["course_paragraphs_json"].apply(parse_blocks)
for k in targets:
    df_courses[k] = parsed.map(lambda d: d.get(k))


# Make course_level an integer
df_courses["course_level"] = pd.to_numeric(df_courses["course_level"], errors="coerce").astype("Int64")


In [11]:
df_courses[
    ["course_programmes", "number_programmes"]
    + targets
].head(10)

Unnamed: 0,course_programmes,number_programmes,course_objective,course_content,additional_information_teaching_methods,method_of_assessment,literature,additional_information_target_audience,recommended_background_knowledge
0,Ancient Studies; Archaeology; Artificial Intel...,21,"At VU Amsterdam, having a strong command of th...",The test consists of multiplechoice and fill-i...,The language proficiency test will take place ...,The language proficiency test is a digital tes...,,,
1,Ancient Studies,1,A distinct feature of Ancient Studies is the c...,In this course you will familiarize yourself w...,Lectures and seminars (3 x 2hrs p/w),Multiple choice exam (40%)\nSet-up for writing...,,,
2,Ancient Studies; Archaeology; Communication an...,8,This first canon module (followed up by a modu...,This course prepares for the module ‘Canon II’...,"Lectures/seminars, 2 x 2 hours\nThe first half...",Oral presentation of a group project (40% of t...,A selection of articles and book chapters whic...,First year students --> Bachelor's in Ancient ...,
3,Ancient Studies; Archaeology; Greek and Latin ...,3,After succesfully completing this course you\n...,This course offers you an overview of the majo...,Lectures (twice a week) and seminars (once eve...,"Mid term exam (10%), final exam (75%) and assi...","De Blois, L. & R. J. van der Spek 2019: An Int...","First-year ACASA students of Classics, Archaeo...",
4,Ancient Studies; History,2,After completing this course:\nThe student is ...,"For a long time, Western scholars were mostly ...",One lecture and one seminar per week.,Written assignments (not graded)\nMid-term wri...,"A selection of articles and chapters, provided...",First year students in Bachelor Oudheidwetensc...,First year students in ACASA Bachelor in Oudhe...
5,Ancient Studies; Archaeology; Greek and Latin ...,3,The days of classical antiquity are irretrieva...,"In this module, students are introduced to the...","This course consists of lectures, seminars and...",Participation: assignments in preparation of a...,,"First-year ACASA students of Classics, Archaeo...",
6,Ancient Studies; Archaeology; Greek and Latin ...,3,"Upon completion of this course, students will:...",This course offers an overview of the major hi...,Lectures twice a week (2x2 hours/week); semina...,Two seminar assignments (individual written wo...,You're expected to buy the following handbooks...,"First-year ACASA students of Classics, Archaeo...",
7,Ancient Studies; Archaeology,2,After following this course the student\nHas a...,This course is a students’ first encounter wit...,The course consists of 6 lectures and 6 semina...,Written examination at the end of the course (...,Literature is provided at the start of the cou...,first year bachelor students of Archaeology\ns...,Knowledge from archaeology courses in the firs...
8,Ancient Studies; Archaeology; Communication an...,8,This is an UvA course. Please follow the link ...,This is an UvA course. Please follow the link ...,This is an UvA course. Please follow the link ...,This is an UvA course. Please follow the link ...,,tba,
9,Ancient Studies; Archaeology; Communication an...,8,"After successful completion of this course, st...",This course is designed to teach students with...,Seminars (twice a week).,Participation (40%)\nParticipation will be ass...,For this course you will need three items: a g...,BA students of Ancient Studies (Oudheidwetensc...,


## 3. Programme features
Counts links, finds a slug, and makes simple description length metrics.

In [12]:
# Parse 'vunl_firstyear_description_blocks' into three year description columns
import re, ast
import pandas as pd

def ensure_list(x):
    # accept real lists or a string that looks like a list
    if isinstance(x, list):
        return x
    if isinstance(x, str):
        x = x.strip()
        if x.startswith("[") and x.endswith("]"):
            try:
                return ast.literal_eval(x)
            except Exception:
                return [x]
        if x:
            return [x]
    return []

def extract_year_descriptions(blocks):
    out = {"year1_description": None, "year2_description": None, "year3_description": None}
    for raw in ensure_list(blocks):
        s = str(raw).strip()
        m = re.match(r"^\s*(first|second|third)\s*year", s, flags=re.I)
        if not m:
            continue
        which = m.group(1).lower()
        body = re.sub(r"^\s*(first|second|third)\s*year\s*", "", s, flags=re.I).strip()
        col = {"first": "year1_description", "second": "year2_description", "third": "year3_description"}[which]
        prev = out[col]
        # keep the longest version when duplicates exist
        if prev is None or len(body) > len(prev):
            out[col] = body
    # fill empty with empty string for easier use
    for k in out:
        if out[k] is None:
            out[k] = ""
    return pd.Series(out)

df_prog[["year1_description", "year2_description", "year3_description"]] = (
    df_prog["vunl_firstyear_description_blocks"].apply(extract_year_descriptions)
)

# drop the source column
df_prog = df_prog.drop(columns=["vunl_firstyear_description_blocks"])

# quick peek
df_prog[["programme_title", "year1_description", "year2_description", "year3_description"]].head(6)



Unnamed: 0,programme_title,year1_description,year2_description,year3_description
0,Ancient Studies,"In the first year, you acquire a solid groundi...","In your second and third year, you will follow...","In the third year, you will write a Bachelor’s..."
1,Archaeology,You acquire a solid foundation of historical k...,The second year deepens your knowledge of arch...,"In the third year, you can follow aminorto eit..."
2,Artificial Intelligence,"In the first year, students from both tracks t...","In the second year, 40% of the courses are com...","In the third year, you can take aminorof your ..."
3,Biomedical Sciences,,,
4,Business Analytics,"In the first year, you are introduced to compu...","In the second year, you dive deeper into the k...","In the first semester, you choose a minor to s..."
5,Communication and Information Studies,,,


## 4 Dealing with one specific missings

In [13]:
title_key = "Econometrics and Data Science"

vunl_description_txt = """n today’s society, massive amounts of data are collected. But how is all that data used? How can a bank efficiently combine econometric models and machine learning methods to predict the expected inflation in a country? Which time series and statistical methods can a supermarket use to forecast the inventory levels of fresh products, such as vegetables and fruits, based on seasonal influences and consumer demand? Additionally, how can a soft drink company make a reliable quantitative analysis of the impact of a television advertisement on the sales of a specific product?
If you’re curious to find out, we’re curious to meet you.

In years 1 and 2, you can choose to follow parts of the education, such as exams and tutorials, in Dutch. Econometrics and Data Science can also be followed entirely in English."""

vunl_description_curriculum_txt = """If you choose to study Econometrics and Data Science, you’ll first get a broad and solid foundation in mathematics, programming and data science. Afterwards, you will be further trained in computer science, econometrics, machine learning and statistics. These are important tools in our data driven society to analyze and understand, for instance, financial and economic data and to make predictions for such data.
For instance, how to design machine learning methods that ensure that all customers in the financial sector have fair access to financial services, regardless of their background? Thanks to super fast computers the Federal Reserve Bank of St. Louis has a large data set with hundreds of macroeconomic variables. Which traps are there when analyzing such big data and how to avoid them?

You will attend lectures covering theoretical concepts, engage in group assignments, analyze case studies, and gain firsthand experience with various companies. Your instructors are experts at the forefront of their respective fields, actively participating in research, and some also hold positions in the business sector when not teaching at the university. This dual involvement ensures that the program remains both relevant and up to date. The Econometrics and Data Science programme has been rated as a “topopleiding” high quality education by the Keuzegids Universiteiten four years in a row. Kraket, our active study association, adds a fun social side to your programme, organizes careers events, invites representatives from large companies such as KLM, and even plans trips to companies abroad!

Are you curious about the differences between the bachelor’s programmes in Econometrics and Data Science, Econometrics and Operations Research, and Business Analytics? Then check out this comparison chart!"""

vunl_future_description_txt = """Vast amounts of data are being collected every second. And businesses, governments and societies at large need people who can take these large data sets and summarise, analyse, interpret and present them, often to other stakeholders who are not experts. As a graduate in Econometrics and Data Science, you are the ideal person for the job.
The majority of graduates from the Bachelor’s programme in Econometrics and Data Science go on to do a Master’s or a double Master’s in a related field, or continue with a PhD at the VU.

Whatever you choose to do, you will have excellent quantitative and problem solving, communication and presentation skills. You will leave with a large network of like minded peers, which will put you ahead of the crowd in your career. And the experience you have gained during the programme will make you resilient and ready to take on the world.

Are you curious to know what kind of jobs are available after graduation? Then read about the different options at a brewery here. And immediately see the differences with related bachelor programmes."""

year1_txt = """In your first year you will receive a broad introduction to data science. You will develop your methodological skills in data analysis, linear algebra, probability, and statistics. You will receive an introduction to macroeconomics and to finance, and you will start learning how to program. You will also learn key skills such as academic writing and how to cite sources. Almost all the first year courses for operations research and data science are the same, so it is easy to switch tracks if you discover you are more interested in operations research after having started."""

year2_txt = """Your second year will build on your core foundation. You will deepen your methodological skills when it comes to econometrics, computer science, and statistics. You will learn how to set up and structure a database, for example, and how to create and work with algorithms. You will deal with statistical models for multivariate data. Plus, you will study the ethical dilemmas behind using data. You will work on real life case studies in small groups, in which you will use the data analysis techniques you have learned to develop practical solutions. You will also report on and present the results of your project, learning how to give and receive feedback."""

year3_txt = """In your third year, you will broaden your horizons by choosing a minor, either within the faculty or outside it. Alternatively, you can study abroad at one of VU Amsterdam’s partner universities. In the second half of the year, you will follow in depth courses on machine learning and multivariate econometrics, plus you will write a Bachelor’s thesis on a subject of your interest. For example, if a bank fails, what is the risk to other banks within the same financial system?"""

admissions_full_txt = """I have non Dutch previous education
Note that for diplomas obtained outside of the Netherlands an application fee of 100 euros applies.
Application fee payment options and possible exemptions

Admission Requirements
Applicants holding a non Dutch pre university diploma apply via the International Office. We check if your previous education meets a number of requirements. If you do not yet meet the requirements but expect to do so in the future, such as obtaining your diploma, you can already apply. We will evaluate your application and inform you of our admission decision.

Overview of IB Diploma requirements per programme PDF
Overview of GCE A level requirements per programme PDF
Overview of College Board Advanced Placement AP requirements per programme PDF

Requirements that apply
1. A diploma equivalent to the Dutch pre university VWO diploma
See the Diploma Requirement List for examples of accepted diplomas per country. This list is meant to give you an indication of admissibility. No rights can be derived from it.

2. Proof of sufficient proficiency in English
You can find all accepted tests and scores on our Language Requirements webpage. Although complete applications are preferred, you can begin your application before you have completed the test and then submit your passing score once you have been conditionally admitted.

3. Proof of sufficient proficiency in Mathematics
After you have applied for the programme and uploaded the required documents in your VU Dashboard, the International Office will determine whether your diploma is equivalent to the Dutch VWO diploma and whether your mathematics level is sufficient, equivalent to VWO Mathematics B. Examples of diplomas that demonstrate sufficient proficiency in mathematics
International Baccalaureate Mathematics HL, Analysis and Approaches HL
United Kingdom GCE A levels A level in mathematics completed with a grade A, B or C
Germany Zeugnis der allgemeinen Hochschulreife, including Mathematics on erhöhtem Anforderungsniveau eA or as Leistungsfach
European Baccalaureate Mathematics, written or oral examination, at least 5 hours during the Orientation Cycle
College Board AP scores AP Calculus BC minimum score 3

Application documents
Scan of your passport or national ID card ID for EEA students only valid at the start date of the programme
VU Application Form Bachelor. In the document you are asked to provide further details about your previous education level.
Proof of English language proficiency if already obtained. Upload your proof of English language proficiency or English language test results.

Application procedure and deadlines
The final application deadline for non EU EEA students is 1 April and for Dutch and EU students the final deadline is 1 May.
If you have a non Dutch nationality you may be eligible for housing via the International Office Accommodation Services. An early application is strongly recommended.

If your diploma is not considered to be at the right level and or if your proficiency in mathematics is considered to be insufficient, you may meet the requirements with additional certificates. Recognized options include
Boswell Beta English. Boswell Beta in Utrecht provides a mathematics B course in English with exams in December, May and July.
CCVX Dutch and English. CCVX offers mathematics exams equivalent to the Dutch VWO mathematics B level.
Online Mathematics Placement Test. OMPT B is an online mathematics test with proctoring. A positive result, 5.5 out of 10 or 60 percent in OMPT B, is compulsory for admission. A maximum of two attempts per year is allowed.

Additional entry exam
Applicants who do not meet the diploma requirement level will also be asked to pass the following test
History see the VU History test information page.
Apply before 15 December if you are applying via the 21 plus Entrance exam route. Apply before 1 April for all other routes.

I have Dutch previous education
VWO diploma
Natuur en Gezondheid, supplemented with Mathematics B
Natuur en Techniek
Economie en Maatschappij, supplemented with Mathematics B
Cultuur en Maatschappij, supplemented with Mathematics B

Higher professional education HBO propaedeutic year
Obtain additionally English at 6 VWO level and Mathematics B at 6 VWO level

Higher professional education HBO completed programme
Obtain additionally English at 6 VWO level and Mathematics B at 6 VWO level
This does not apply to a completed English taught HBO bachelor

Ready to apply
Click to see the application procedure. Complete your application to 100 percent in your VU dashboard within six weeks and no later than one week after the application deadline closes."""

# create columns if they do not exist yet
for col in ["vunl_description", "vunl_description_curriculum", "vunl_future_description",
            "year1_description", "year2_description", "year3_description",
            "vunl_admission_dutch_diploma"]:
    if col not in df_prog.columns:
        df_prog[col] = ""

mask = df_prog["programme_title"].eq(title_key)

df_prog.loc[mask, "vunl_description"] = vunl_description_txt
df_prog.loc[mask, "vunl_description_curriculum"] = vunl_description_curriculum_txt
df_prog.loc[mask, "vunl_future_description"] = vunl_future_description_txt
df_prog.loc[mask, "year1_description"] = year1_txt
df_prog.loc[mask, "year2_description"] = year2_txt
df_prog.loc[mask, "year3_description"] = year3_txt
df_prog.loc[mask, "vunl_admission_dutch_diploma"] = admissions_full_txt

# optional save
# df_prog.to_csv("./data/df_programmes_patched.csv", index=False, encoding="utf-8-sig")

# small check
df_prog.loc[mask, ["programme_title", "vunl_description", "vunl_description_curriculum",
                   "vunl_future_description", "year1_description", "year2_description",
                   "year3_description", "vunl_admission_dutch_diploma"]].head(1)

Unnamed: 0,programme_title,vunl_description,vunl_description_curriculum,vunl_future_description,year1_description,year2_description,year3_description,vunl_admission_dutch_diploma
7,Econometrics and Data Science,"n today’s society, massive amounts of data are...",If you choose to study Econometrics and Data S...,Vast amounts of data are being collected every...,In your first year you will receive a broad in...,Your second year will build on your core found...,"In your third year, you will broaden your hori...",I have non Dutch previous education\nNote that...


## 5 First analysis


### 5.1 Missig values

We first check the number of missings and thn we try to inpute them

In [14]:
# Check how many missings in every column
print("These are the missings in the df_programmes:\n",df_prog.isna().sum())
print("These are the missings in the df_courses:\n", df_courses.isna().sum()) 

These are the missings in the df_programmes:
 programme_title                 0
programme_url                   0
sg_description                  0
info_links                      0
vunl_base_url                   1
vunl_description                0
vunl_description_curriculum     2
vunl_future_description         2
vunl_future_career              3
vunl_admission_dutch_diploma    2
year1_description               0
year2_description               0
year3_description               0
dtype: int64
These are the missings in the df_courses:
 code                                         0
course_name                                  0
programme_title                              0
faculty                                     13
programme_url                                0
year_num                                   216
period                                      43
ects                                         0
course_level                                 0
course_coordinator               

### 5.2 year_num missing
This is related to the year of the course. Very likely there are courses/subjects for which there is not a pecific year. Maybe we cn give the value 0 or something specific.

#### 5.2.1 Relation between year_num and course level

In [15]:
## Chech the relation between year_num and other variables

### 5.2.1 number of year_num missing per programme
cols_view = ["programme_title", "code", "course_name", "ects", "period", "course_level"]
miss_year_rows = (
    df_courses[df_courses["year_num"].isna()][cols_view]
    .copy()
    .assign(
        ects=pd.to_numeric(df_courses.loc[df_courses["year_num"].isna(), "ects"], errors="coerce"),
        period=pd.to_numeric(df_courses.loc[df_courses["year_num"].isna(), "period"], errors="coerce")
    )
    .sort_values(["programme_title", "code"])
    .reset_index(drop=True)
)
miss_year_rows.head(20)  # preview first rows
# miss_year_rows.to_csv("missing_year_num_rows.csv", index=False)  # optional save

# Cell 3. Count missing year_num per programme
# One row per programme with the number of missing year_num
miss_year_counts = (
    df_courses["year_num"].isna()
    .groupby(df_courses["programme_title"])
    .sum()
    .rename("missing_year_num")
    .reset_index()
    .sort_values("missing_year_num", ascending=False)
    .reset_index(drop=True)
)
miss_year_counts.head(20)  # preview top programmes
# miss_year_counts.to_csv("missing_year_num_by_programme.csv", index=False)  # optional save



Unnamed: 0,programme_title,missing_year_num
0,Economics and Business Economics,50
1,Econometrics and Operations Research,35
2,Mathematics,29
3,"Media, Art, Design and Architecture",26
4,Econometrics and Data Science,24
5,International Business Administration,22
6,Business Analytics,10
7,Computer Science,6
8,Artificial Intelligence,6
9,"Philosophy, Politics and Economics",4


In [16]:
# Solution
## 5.2.2 Imputate year_num according to the level: 1 if 100, 2 if 200, 3 if 300 or 400
def imputate_year_num(row): 
    if pd.isna(row["year_num"]):
        level = row["course_level"]
        if level == 100:
            return 1
        elif level == 200:
            return 2
        elif level in [300, 400]:
            return 3
    return row["year_num"] 
df_courses["year_num"] = df_courses.apply(imputate_year_num, axis=1)



Take aways:

1. At the Vrije Universiteit Amsterdam (VU), course levels are indicated by numerical codes, where 100-level courses are for first-year bachelor's students, 200-level are for second-year, and 300-level are for third-year. Higher numbers generally indicate more advanced courses, with 400-level often representing master's-level courses and 500-level representing even more specialized postgraduate courses.\
**Solution**: imputate the year_num according to the level: 1 if 100, 2 if 200, 3 if 300 or 400


### 5.3 Relation between year_num and programme
Check the number of courses per prgogramme

In [17]:
# number of courses per programme
courses_per_programme = (
    df_courses.groupby("programme_title")["code"]
    .count()
    .rename("number_of_courses")
    .reset_index()
    .sort_values("number_of_courses", ascending=False)
    .reset_index(drop=True)
)
courses_per_programme.head(20)

Unnamed: 0,programme_title,number_of_courses
0,History,70
1,Communication and Information Studies,62
2,Economics and Business Economics,61
3,Literature and Society,57
4,Econometrics and Operations Research,57
5,Ancient Studies,56
6,Philosophy,44
7,Mathematics,29
8,Archaeology,27
9,"Media, Art, Design and Architecture",26


Take aways:
1. We have a problem with the numver of corses for each programme. Probably it is a craping code issue.\
**Solution 1**: remove, for now, th programmes with few courses. Keep only those with at least 2 programmes for the period 1 with ects 6 (core courses).\
**Solution 2**: modify the scraping code to fetch more courses

In [18]:
# Solution

# Keep only those with at least 2 courses for the period 1 with ects 6 (core courses)
mask_keep = (
    (df_courses['year_num'] == 1)  
    & (pd.to_numeric(df_courses["period"], errors="coerce") == 1)
    & (pd.to_numeric(df_courses["ects"], errors="coerce") == 6)
)   

df_courses_tasks = df_courses[mask_keep].reset_index(drop=True)
print(df_courses_tasks.shape)

print("These are the programmes kept:\n", df_courses_tasks["programme_title"].value_counts())


(23, 21)
These are the programmes kept:
 programme_title
Econometrics and Operations Research     4
Literature and Society                   4
Philosophy                               4
Communication and Information Studies    3
Ancient Studies                          2
History                                  2
Archaeology                              2
Economics and Business Economics         1
International Business Administration    1
Name: count, dtype: int64


#### Columns and Save the files

In [23]:
# keep only if for the same programme there are at least 2
programme_counts = df_courses_tasks["programme_title"].value_counts()
programmes_to_keep = programme_counts[programme_counts >= 2].index
df_courses_tasks = df_courses_tasks[df_courses_tasks["programme_title"].isin(programmes_to_keep)].reset_index(drop=True)
print("After keeping only programmes with at least 2 courses we have", len(programmes_to_keep), " programmes, the shape is:", df_courses_tasks.shape)

After keeping only programmes with at least 2 courses we have 7  programmes, the shape is: (21, 21)


In [24]:
# save final files (silver dataset)
df_prog.to_csv(silver / "df_programmes_silver.csv", index=False, encoding="utf-8-sig")
df_courses.to_csv(silver / "df_courses_silver.csv", index=False, encoding="utf-8-sig")

# This is the dataset we will use for the program vectors generation
## only the programmes that are inside the programmes_to_keep
df_prog_filtered = df_prog[df_prog["programme_title"].isin(programmes_to_keep)]
df_prog_filtered.to_csv(silver / "df_programmes_filtered_silver.csv", index=False, encoding="utf-8-sig")
## only the courses that are inside the programmes_to_keep
df_courses_filtered = df_courses[df_courses["programme_title"].isin(programmes_to_keep)]
df_courses_filtered.to_csv(silver / "df_courses_filtered_silver.csv", index=False, encoding="utf-8-sig")

# This is the dataset we will use for the tasks generation
df_courses_tasks.to_csv(silver / "df_courses_tasks_silver.csv", index=False, encoding="utf-8-sig")


