<a href="https://colab.research.google.com/github/WPHdamian/Data-Science-Curriculum-Analysis-/blob/main/01_Curriculum_DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SECTION 1 - Data Preparation & Text Preprocessing**

This section prepares the raw curriculum dataset for analysis.  
It includes:

- library imports  
- environment setup  
- dataset loading & validation  
- structural cleaning  
- shared-module handling  
- categorical normalisation  
- construction of `text_combined`  
- *full text preprocessing pipeline*, including:
  - tokenisation  
  - lemmatisation  
  - domain-specific stopwords  
  - synonym mapping  
  - entity resolution  
  - bigram/trigram phrase detection  

This ensures all downstream NLP, skill analysis, and topic modelling uses clean and consistent text.


## 1.1 Import Libraries and Setup

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
# Core libraries
import pandas as pd
import numpy as np
import re
import string

# NLP
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Gensim for bigrams/trigrams
from gensim.models.phrases import Phrases, Phraser

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Download corpora
nltk.download("stopwords")
nltk.download("wordnet")

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Set random seeds
import random
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


### 1.1.1 Log Versions

In [None]:
# Log versions for reproducibility
print("PACKAGE VERSIONS:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"spaCy: {spacy.__version__}")
print(f"NLTK: {nltk.__version__}")


PACKAGE VERSIONS:
pandas: 2.2.2
numpy: 2.0.2
spaCy: 3.8.11
NLTK: 3.9.1


## **1.2 Loading the Dataset**

We now load the dataset containing module-level curriculum information from 8 universities in London and Hong Kong.

### Validation steps include:
- verifying the schema  
- checking for missing values  
- confirming character encodings  
- retaining only necessary columns  

The dataset will serve as the core input for later NLP and topic modelling steps.


In [None]:
df = pd.read_excel('Data Science Curricula.xlsx')
df.head(2)

Unnamed: 0,City,Univeristy,Program_Name,Faculty,Level,Module_Code,Module_Title,Module_Description,Core_or_Elective,Link
0,HK,CityU,Bsc in Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4301,Corporate Accounting II,This course aims to:\n\ndevelop students' conc...,0,https://www.cb.cityu.edu.hk/dao/bbabda/intro
1,HK,CityU,Bsc in Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4342,Auditing,The primary objective of this course is to pro...,0,https://www.cb.cityu.edu.hk/dao/bbabda/intro


## **1.3 Structural Cleaning**

This step ensures the dataset is structurally consistent and analytically ready.

### Steos Include:
1. Drop modules with missing descriptions  
2. Identify and remove exact duplicates  
3. Create a robust composite module ID  
   - `module_id = university + module_title`
4. Identify and label shared modules across programmes  
   - Store programme counts for weighting later  
5. Explain consequences for downstream analysis  


In [None]:
# Standardise column names
df.columns = df.columns.str.lower().str.strip()
df.columns = ['city', 'university', 'program_name', 'faculty', 'level','module_code', 'module_title', 'module_description','core_or_elective', 'link']

# Drop unused column
df = df.drop(columns=['link'])

# Check missing values
print("Missing values:")
print(df.isnull().sum())

Missing values:
city                    0
university              0
program_name            0
faculty                 0
level                   0
module_code           431
module_title            0
module_description     35
core_or_elective        0
dtype: int64


In [None]:
# Remove exact duplicates and unusable text
before = len(df)
df = df.drop_duplicates()
df = df.dropna(subset=['module_description'])
after = len(df)
print(f"Removed {before - after} exact duplicate rows.")

Removed 37 exact duplicate rows.


In [None]:
# Create composite module_id
df["module_id"] = (
    df["university"].str.lower().str.strip() + "_" +
    df["module_title"].str.lower().str.strip() + "_" +
    df["level"].str.lower().str.strip()
)

# Identify shared modules
shared_counts = (
    df.groupby(["university", "module_id", "level"])
      .size()
      .reset_index(name="program_count")
)

df = df.merge(shared_counts, on=["university","module_id", "level"], how="left")
df["is_shared_module"] = df["program_count"].apply(lambda x: 1 if x > 1 else 0)

df.head(3)


Unnamed: 0,city,university,program_name,faculty,level,module_code,module_title,module_description,core_or_elective,module_id,program_count,is_shared_module
0,HK,CityU,Bsc in Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4301,Corporate Accounting II,This course aims to:\n\ndevelop students' conc...,0,cityu_corporate accounting ii_bachelor,1,0
1,HK,CityU,Bsc in Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4342,Auditing,The primary objective of this course is to pro...,0,cityu_auditing_bachelor,1,0
2,HK,CityU,Bsc in Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,CB2100,Introduction to Financial Accounting,This course aims to:\n\nprovide students with ...,1,cityu_introduction to financial accounting_bac...,1,0


In [None]:
# Check if the fucntion works
df["program_count"].unique()

array([1, 2, 3, 4, 5])

## **1.4 Mapping and Normalising Categorical Columns**

### 1. City Mapping  
- Convert "HK" → "Hong Kong" for clarity.

- Convert simplified univeristy names back to full names.

### 2. Core/Elective Cleaning  
Retained because core modules may later receive weighting in skill analysis.

### 3. Programme Name Normalisation  
Standardise programme names:
- remove extra whitespace  
- apply title-case  
- ensure consistent comparisons across universities  


In [None]:
df["university"].unique()

array(['CityU', 'CUHK', 'Univeristy College London', 'HKU', 'MU',
       'Imperial College London',
       'London School of Economics and Political Science',
       'University of East London'], dtype=object)

In [None]:
# City mapping
df["city"] = df["city"].replace({"HK": "Hong Kong"})

# University mapping

university_name_mapping = {
    "CityU": "City University of Hong Kong",
    "CUHK": "The Chinese University of Hong Kong",
    "HKU": "The University of Hong Kong",
    "MU": "Hong Kong Metropolitan University",
    "Univeristy College London": "University College London",
    "Imperial College London": "Imperial College London",
    "London School of Economics and Political Science": "London School of Economics and Political Science",
    "University of East London": "University of East London"
}

df["university"]=df["university"].map(university_name_mapping)

# Core/Elective mapping
df["core_or_elective"] = df["core_or_elective"].map({1: "Core", 0: "Elective"})

# Programme name cleaning
df["program_name"] = (
    df["program_name"]
    .astype(str)
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
    .str.title()
)

df.head(3)

Unnamed: 0,city,university,program_name,faculty,level,module_code,module_title,module_description,core_or_elective,module_id,program_count,is_shared_module
0,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4301,Corporate Accounting II,This course aims to:\n\ndevelop students' conc...,Elective,cityu_corporate accounting ii_bachelor,1,0
1,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4342,Auditing,The primary objective of this course is to pro...,Elective,cityu_auditing_bachelor,1,0
2,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,CB2100,Introduction to Financial Accounting,This course aims to:\n\nprovide students with ...,Core,cityu_introduction to financial accounting_bac...,1,0


## **1.5 Create `text_combined`**

This column merges module titles and descriptions into a single text field.

- Titles contain condensed keyword cues (e.g., “Machine Learning”, “Cloud Computing”)  
- Combining ensures these crucial terms are available during preprocessing

We also remove:
- newlines  
- tabs  
- excessive whitespace  


In [None]:
df["text_combined"] = (
    df["module_title"].fillna("") + " " +
    df["module_description"].fillna("")
)

# Remove newline + tabs + extra whitespace
df["text_combined"] = df["text_combined"].str.replace(r"[\n\t]+", " ", regex=True)
df["text_combined"] = df["text_combined"].str.replace(r"\s+", " ", regex=True).str.strip()

df["text_combined"].head(3)


Unnamed: 0,text_combined
0,Corporate Accounting II This course aims to: d...
1,Auditing The primary objective of this course ...
2,Introduction to Financial Accounting This cour...


## **1.6 Text Normalisation Pipeline**

This section performs the complete text-preprocessing workflow required for all downstream NLP tasks.  

It integrates:
- baseline text cleaning  
- standard & domain-specific stopwords  
- automated synonym expansion using word embeddings  
- automatic multi-word entity extraction (data-driven)  
- synonym + entity replacement  
- bigram & trigram phrase detection  
- final enriched text field for LDA, skill extraction, and TF–IDF  

This pipeline satisfies the supervisor’s requirements for:
- synonym resolution  
- entity normalisation  
- phrase detection  
- domain stopword justification  

The output of this section is the column **clean_text_final**, which will be used in all further analyses.

### **1.6.1 Baseline Linguistic Cleaning**

This step creates an initial clean version of each module description. It performs:

- lowercase conversion  
- spaCy tokenisation  
- removal of punctuation & numeric tokens  
- lemmatisation  
- removal of standard English stopwords  

This produces an initial field: `clean_text_basic`.

In [None]:
stop_words_standard = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def basic_clean(text):
    doc = nlp(text.lower())
    tokens = []
    for token in doc:
        if token.is_alpha and token.text not in stop_words_standard:
            if token.text == "data":      #prevent lemmatizer from changing data into datum
                tokens.append("data")
                continue

            lemma = token.lemma_.strip()
            tokens.append(lemma)

    return " ".join(tokens)

df["clean_text_basic"] = df["text_combined"].apply(basic_clean)
df[["text_combined", "clean_text_basic"]].head(1)

Unnamed: 0,text_combined,clean_text_basic
0,Corporate Accounting II This course aims to: d...,corporate accounting ii course aim develop stu...


### **1.6.2 Domain-Specific Stopwords**

Academic modules frequently contain high-frequency words that do not convey thematic or skill-related meaning.

Examples:
- *course, student, introduction, topic, understand*

These are removed **after lemmatization**, and they will be documented in **Appendix A**.

The output of this step is an updated field: `clean_text_domain`.

In [None]:
domain_stopwords = set([
    "course", "student", "module", "learning", "science",
    "concept", "topic", "study", "programme", "knowledge",
    "introduction", "including", "method", "approach", "example",
    "also", "basic", "cover", "include", "using", "use", "skill",
    "learn", "understanding", "principle", "aim", "provide", "well", "various",
    "relate", "focus", "aspect", "overview", "develop", "apply", "introduce", "different"
])

domain_stopwords = {
    lemmatizer.lemmatize(w.lower()) for w in domain_stopwords
}

def remove_domain_stops(text):
    return " ".join([w for w in text.split() if w not in domain_stopwords])

df["clean_text_domain"] = df["clean_text_basic"].apply(remove_domain_stops)
df[["clean_text_basic", "clean_text_domain"]].head()

Unnamed: 0,clean_text_basic,clean_text_domain
0,corporate accounting ii course aim develop stu...,corporate accounting ii conceptual professiona...
1,audit primary objective course provide student...,audit primary objective regulatory legal repor...
2,introduction financial accounting course aim p...,financial accounting technical processing prep...
3,introduction managerial accounting course aim ...,managerial accounting management account caree...
4,business statistic course aim facilitate stude...,business statistic facilitate statistical comm...


### **1.6.4 Automatic Multi-Word Entity Extraction**

Using Gensim's `Phrases`, we automatically extract statistically significant multi-word concepts such as:

- *machine learning*
- *data mining*
- *time series analysis*
- *cloud computing*
- *feature engineering*

These are essential for accurate topic modeling and skill extraction.

We extract:
- bigrams  
- trigrams  

Then convert them into canonical single-token forms:  
`time series analysis → time_series_analysis`

Output field: `clean_text_entity`.


In [None]:
from gensim.models.phrases import Phrases, Phraser

# Existing bigram and trigram creation
sentences = [text.split() for text in df["clean_text_domain"]]

bigram = Phrases(sentences, min_count=3, threshold=10)
trigram = Phrases(bigram[sentences], min_count=2, threshold=8)

bigram_phraser = Phraser(bigram)
trigram_phraser = Phraser(trigram)

# Step 1: Create a list of custom phrases
custom_bigrams = ["data_science", "machine_learning", "data_analysis","data_management","data_modeling","data_extraction","web_application"]
custom_trigrams = ["natural_language_processing", "deep_neural_network"]

# Step 2: Add custom phrases to the phrasegrams
for phrase in custom_bigrams:
    words = tuple(phrase.split("_"))
    bigram_phraser.phrasegrams[words] = 1e8  # Use a high score to make them always recognized

for phrase in custom_trigrams:
    words = tuple(phrase.split("_"))
    trigram_phraser.phrasegrams[words] = 1e8  # Use a high score to make them always recognized

# Step 3: Normalize multi-word entities
def normalise_multiword_entities(tokens):
    tokens = bigram_phraser[tokens]
    tokens = trigram_phraser[tokens]
    return tokens

df["entity_tokens"] = df["clean_text_domain"].apply(lambda x: x.split())
df["entity_tokens"] = df["entity_tokens"].apply(normalise_multiword_entities)

df["clean_text_entity"] = df["entity_tokens"].apply(lambda x: " ".join(x))
df[["clean_text_domain", "clean_text_entity"]].head()

Unnamed: 0,clean_text_domain,clean_text_entity
0,corporate accounting ii conceptual professiona...,corporate accounting ii conceptual professiona...
1,audit primary objective regulatory legal repor...,audit primary_objective regulatory legal repor...
2,financial accounting technical processing prep...,financial_accounting technical processing prep...
3,managerial accounting management account caree...,managerial_accounting management_account caree...
4,business statistic facilitate statistical comm...,business statistic facilitate statistical comm...


### **1.6.3 Automated Synonym Expansion**

Instead of manually constructing long synonym lists, we expand synonyms **automatically** using spaCy’s word embeddings.

### Steps:
1. Define “anchor skills” (core Data Science terms).  
2. For each anchor, find the top semantically similar tokens in the corpus.  
3. Keep only high-similarity candidates (similarity > 0.55).  
4. Merge into a synonym dictionary.  
5. Apply replacements globally.

This produces broad, curriculum-specific synonym coverage.

Output field: `clean_text_syn`.

In [None]:
CUSTOM_PHRASES = {
    ("data", "science"): "data_science",
    ("machine", "learning"): "machine_learning",
    ("data", "analysis"): "data_analysis",
    ("data", "management"): "data_management",
    ("data", "modeling"): "data_modeling",
    ("data", "extraction"): "data_extraction",
    ("data", "ecology"): "data_ecology",
    ("web", "application"): "web_application",
    ("machine", "learning"): "machine_learning",
    ("natural", "language", "processing"): "natural_language_processing",
}

def force_phrases(tokens):
    i = 0
    result = []
    while i < len(tokens):
        matched = False
        for size in (3, 2):
            chunk = tuple(tokens[i:i+size])
            if chunk in CUSTOM_PHRASES:
                result.append(CUSTOM_PHRASES[chunk])
                i += size
                matched = True
                break
        if not matched:
            result.append(tokens[i])
            i += 1
    return result


In [None]:
df["clean_tokens"] = df["clean_text_entity"].apply(lambda x: x.split())
df["clean_tokens"] = df["clean_tokens"].apply(lambda tokens: trigram_phraser[bigram_phraser[tokens]])
df["clean_tokens"] = df["clean_tokens"].apply(force_phrases)

In [None]:
manual_synonyms = {
    "data_analysis": "data_analytic",
}
anchor_terms = [
    "statistics", "hypothesis testing", "coding",
    "big data", "data engineering", "machine learning", "data analytics",
    "deep learning", "data preprocessing", "etl",
    "cloud computing", "model deployment", "mlops",
    "ai ethics", "data governance", "domain expertise",
    "experimental design", "dashboards", "stakeholder communication",
    "data visualisation", "storytelling", "version control", "git",
]


In [None]:
anchor_terms_norm = [t.replace(" ", "_").lower() for t in anchor_terms]

from collections import Counter

all_tokens = Counter(
    token
    for tokens in df["clean_tokens"]
    for token in tokens
)

candidate_tokens = [
    t for t in all_tokens
    if "_" in t or t.isalpha()   # allow phrases + words
]

similarity_threshold = 0.95
synonym_replacements = {}

anchor_docs = {
    term: nlp(term.replace("_", " "))
    for term in anchor_terms_norm
}

for token in candidate_tokens:
    token_doc = nlp(token.replace("_", " "))
    if not token_doc.vector_norm:
        continue

    for anchor, anchor_doc in anchor_docs.items():
        sim = token_doc.similarity(anchor_doc)
        if sim >= similarity_threshold:
            synonym_replacements[token] = anchor
            break  # map to best anchor only

def apply_automatic_synonyms(tokens):
    return [synonym_replacements.get(t, t) for t in tokens]

def apply_manual_synonyms(tokens):
    return [manual_synonyms.get(t, t) for t in tokens]

df["clean_tokens"] = df["clean_tokens"].apply(apply_automatic_synonyms)
df["clean_tokens"] = df["clean_tokens"].apply(apply_manual_synonyms)


  sim = token_doc.similarity(anchor_doc)


### **1.6.5 Phrase Detection (Bigrams & Trigrams)**

We enhance entity extraction by applying final phrase merging that captures:

- compound skills  
- methodological terms  
- statistical concepts  
- algorithm names  

The resulting tokens will be highly informative for LDA and keyword analysis.

Output field: `clean_tokens`.


In [None]:
df["clean_text_final"] = df["clean_tokens"].apply(lambda x: " ".join(x))
df[["clean_text_entity", "clean_text_final"]].head()


Unnamed: 0,clean_text_entity,clean_text_final
0,corporate accounting ii conceptual professiona...,corporate accounting ii conceptual professiona...
1,audit primary_objective regulatory legal repor...,audit primary_objective regulatory legal repor...
2,financial_accounting technical processing prep...,financial_accounting technical processing prep...
3,managerial_accounting management_account caree...,managerial_accounting management_account caree...
4,business statistic facilitate statistical comm...,business statistic facilitate statistical comm...


### **1.6.6 Final Output Fields**

At the end of the pipeline, we produce:

### **`clean_text_basic`**  
Lemmatized text without domain stopwords.

### **`clean_text_domain`**  
Domain stopwords removed.

### **`clean_text_syn`**  
Automatic synonym unification applied.

### **`clean_text_entity`**  
Automatic multi-word entities merged.

### **`clean_tokens`**  
Token list with phrase detection applied.

### **`clean_text_final`**  
**The final text used for:**
- frequency analysis  
- TF–IDF  
- LDA topic modelling  
- skill extraction  
- cross-city comparison  

This completes Section 1.6.


## **1.7 Final Review**

We conduct a final examination on the dataset and save it into `Curriculum_cleaned.csv`. Columns are kept in case of future usage

In [None]:
df.head()

Unnamed: 0,city,university,program_name,faculty,level,module_code,module_title,module_description,core_or_elective,module_id,...,is_shared_module,text_combined,clean_text_basic,clean_text_domain,clean_text_syn,entity_tokens,clean_text_entity,clean_tokens,clean_text_final,tokens_final
0,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4301,Corporate Accounting II,This course aims to:\n\ndevelop students' conc...,Elective,cityu_corporate accounting ii_bachelor,...,0,Corporate Accounting II This course aims to: d...,corporate accounting ii course aim develop stu...,corporate accounting ii conceptual professiona...,corporate accounting ii conceptual professiona...,"[corporate, accounting, ii, conceptual, profes...",corporate accounting ii conceptual professiona...,"[corporate, accounting, ii, conceptual, profes...",corporate accounting ii conceptual professiona...,"[corporate, accounting, ii, conceptual, profes..."
1,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,AC4342,Auditing,The primary objective of this course is to pro...,Elective,cityu_auditing_bachelor,...,0,Auditing The primary objective of this course ...,audit primary objective course provide student...,audit primary objective regulatory legal repor...,audit primary objective regulatory legal repor...,"[audit, primary_objective, regulatory, legal, ...",audit primary_objective regulatory legal repor...,"[audit, primary_objective, regulatory, legal, ...",audit primary_objective regulatory legal repor...,"[audit, primary_objective, regulatory, legal, ..."
2,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,CB2100,Introduction to Financial Accounting,This course aims to:\n\nprovide students with ...,Core,cityu_introduction to financial accounting_bac...,...,0,Introduction to Financial Accounting This cour...,introduction financial accounting course aim p...,financial accounting technical processing prep...,financial accounting technical processing prep...,"[financial_accounting, technical, processing, ...",financial_accounting technical processing prep...,"[financial_accounting, technical, processing, ...",financial_accounting technical processing prep...,"[financial_accounting, technical, processing, ..."
3,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,CB2101,Introduction to Managerial Accounting,This course aims to: 1.provide students with b...,Core,cityu_introduction to managerial accounting_ba...,...,0,Introduction to Managerial Accounting This cou...,introduction managerial accounting course aim ...,managerial accounting management account caree...,managerial accounting management account caree...,"[managerial_accounting, management_account, ca...",managerial_accounting management_account caree...,"[managerial_accounting, management_account, ca...",managerial_accounting management_account caree...,"[managerial_accounting, management_account, ca..."
4,Hong Kong,City University of Hong Kong,Bsc In Business Decision Analytics,Department of Decision Analytics and Operations,Bachelor,CB2200,Business Statistics,This course aims to facilitate students' learn...,Core,cityu_business statistics_bachelor,...,0,Business Statistics This course aims to facili...,business statistic course aim facilitate stude...,business statistic facilitate statistical comm...,business statistic facilitate statistical comm...,"[business, statistic, facilitate, statistical,...",business statistic facilitate statistical comm...,"[business, statistic, facilitate, statistical,...",business statistic facilitate statistical comm...,"[business, statistic, facilitate, statistical,..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1584 entries, 0 to 1583
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   city                1584 non-null   object
 1   university          1584 non-null   object
 2   program_name        1584 non-null   object
 3   faculty             1584 non-null   object
 4   level               1584 non-null   object
 5   module_code         1161 non-null   object
 6   module_title        1584 non-null   object
 7   module_description  1584 non-null   object
 8   core_or_elective    1584 non-null   object
 9   module_id           1584 non-null   object
 10  program_count       1584 non-null   int64 
 11  is_shared_module    1584 non-null   int64 
 12  text_combined       1584 non-null   object
 13  clean_text_basic    1584 non-null   object
 14  clean_text_domain   1584 non-null   object
 15  clean_text_syn      1584 non-null   object
 16  entity_tokens       1584

In [None]:
df.to_parquet("curriculum_cleaned.parquet", index=False)
df.to_csv("curriculum_cleaned.csv", index=False)