# UChicago MS in Applied Data Science – Markdown Data Cleaning

This notebook is used to clean and preprocess the markdown data files scraped from the DSI website. The goal is to prepare the data for downstream analysis and modeling.

In [13]:
import os
import shutil
import re
from datetime import datetime

In [45]:
# Define directories and output file paths
markdown_folder = '../data/markdown'
filenames_output = '../docs/all_filenames.txt'
combined_output = '../docs/combined_markdown_all.md'

# Write the list of file names from markdown folder into a text file
with open(filenames_output, 'w', encoding='utf-8') as f_out:
    for fname in sorted(os.listdir(markdown_folder)):
        filepath = os.path.join(markdown_folder, fname)
        if os.path.isfile(filepath):
            f_out.write(f"{fname}\n")
print(f"Markdown file names saved to: {filenames_output}")

# Combine all markdown contents from markdown folder into a single file
with open(combined_output, 'w', encoding='utf-8') as combined_f:
    for fname in sorted(os.listdir(markdown_folder)):
        fullpath = os.path.join(markdown_folder, fname)
        if os.path.isfile(fullpath) and fname.lower().endswith('.md'):
            with open(fullpath, 'r', encoding='utf-8') as f_in:
                content = f_in.read()
                combined_f.write(content)
                combined_f.write("\n\n")  # separate files with extra blank lines
print(f"Combined markdown saved to: {combined_output}")

Markdown file names saved to: ../docs/all_filenames.txt
Combined markdown saved to: ../docs/combined_markdown_all.md


## Markdown subset creation first pass

| Step | What we did | Why it matters |
|------|-------------|----------------|
| **1. Chose filename patterns** | Defined four stems <br>`_online-program`, `_tuition-fees-aid`, `_course-progressions`, `_instructors-staff` | They cover every fact the grading rubric checks (tuition, scholarships, curriculum, faculty/staff) while excluding news & marketing noise. |
| **2. Built `move_markdown_subset.py`** | *Scans* `/data/markdown/` → *moves* only matching files to `/data/markdown_subset/` using atomic `shutil.move`; prints a JSON list of moved files. | Creates a reproducible, source-controlled filter; no manual cleanup later. |
| **3. Two-stage pruning run** | **Stage 1 (filename filter)** → 999 files copied <br> **Stage 2 (content keywords)** → 675 files kept | Stage 2 drops empty or off-topic pages that slipped through by slug alone. |
| **4. Audited skipped files** | Reviewed 324 skipped markdowns (news, insights, outreach, “about” pages, etc.). None contain rubric-critical facts; only `about_contact.*md` might be nice for contact info. | Confirms we aren’t losing required content, keeps index small and precise. |
| **5. Rubric coverage check** | Verified that the retained set includes: <br>• Tuition `$5,967 / $71,604` <br>• Scholarships `DSI Scholarship` & `ADS Alumni Scholarship` <br>• English minimums `TOEFL 102 / IELTS 7` | Guarantees the bot can answer the three ground-truth questions with citations. |
| **6. Optional tweak** | Add a stem like `_about_contact` and rerun the script if we want the bot to answer “How do I contact the program?” | Polishing only; not required for the midterm. |


* **Lean corpus (675 md pages)** → faster embedding, cheaper storage, higher retrieval precision.  
* **Fully automated** → rerun anytime pages change.  
* **Meets all midterm acceptance criteria** while staying extensible for future polish.


In [15]:
# --- Configuration ---
SOURCE_DIR = "../data/markdown"  # Original source of all markdown files
TARGET_DIR_STEP_1 = "../data/markdown_subset_1_filename_filtered" # Intermediate: after filename filtering
TARGET_DIR_STEP_2 = "../data/markdown_subset_2_content_filtered"  # Final: after content filtering

In [None]:
# --- STEP 1: Filename Filtering Configuration ---
# CRITICAL: Define patterns to identify potentially relevant filenames.
# A file will be copied to TARGET_DIR_STEP_1 if its name (lowercase) contains any of these substrings.
# REVIEW AND CUSTOMIZE THIS LIST CAREFULLY!
FILENAME_PATTERNS_STEP_1 = [
    "course",       # e.g., test-course-progressions.md
    "people",       # e.g., people_zihan-jack-zhang.md
    "faculty",
    "instructor",
    "student",      # e.g., student-voices.md
    "admissions",
    "application",
    "apply",
    "curriculum",
    "overview",
    "program",      # e.g., postdocprograminfosession.md
    "capstone",
    "project",      # e.g., research_projects.md (will need content filter to narrow down)
    "financial",
    "tuition",
    "scholarship",
    "fee",
    "career",
    "alumni",
    "faq",
    "event",
    "deadline",
    "seminar",
    "webinar",
    "pilot",        # e.g., pilot.md
    "research_",    # Broad pattern – refine in Step 2.
    # Add more specific filenames or patterns if needed.
]

# --- STEP 2: Content Filtering Configuration (Keywords for Categories) ---
# These are the same categories and keywords as in the previous script.
# Adjust these based on your actual file content.
CATEGORIES_KEYWORDS_STEP_2 = {
    "Program Overview": [
        "ms in applied data science", "master of science in applied data science", "m.s. in applied data science",
        "program overview", "about the program", "degree description", "program goals",
        "mission statement", "program aims", "about ms ads", "applied data science master",
        "what is applied data science"
    ],
    "Curriculum Details": [
        "courses", "core courses", "elective courses", "foundational courses",
        "course descriptions", "curriculum", "degree plan", "course sequence",
        "syllabi", "syllabus", "learning outcomes", "specializations", "modules",
        "course progressions", "academic path", "degree requirements", "program structure",
        "required coursework", "concentration areas"
    ],
    "Faculty Profiles": [
        "faculty", "instructors", "professor", "professors", "instructor bios",
        "faculty profiles", "biography", "expertise", "research interests", "teaching staff",
        "faculty research", "instructor profile", "our researchers", "meet the team",
        "publications by faculty", "teacher voices", "instructor perspective", "faculty directory",
        "academic staff", "faculty expertise"
    ],
    "Admissions and Application": [
        "admissions", "application", "apply", "how to apply", "admission requirements",
        "application process", "application deadlines", "prerequisites", "application fee",
        "admission criteria", "submit application", "prospective students", "enrollment",
        "apply to ms ads", "application for applied data science", "statement of purpose", "letters of recommendation"
    ],
    "Financial Information": [
        "tuition", "fees", "scholarships", "financial aid", "funding", "cost",
        "payment", "financial assistance", "program cost", "fellowships", "stipends",
        "cost of attendance", "program fees", "funding for ms program", "tuition details"
    ],
    "Career Outcomes": [
        "careers", "employment", "job placement", "alumni", "career paths",
        "outcomes", "employment statistics", "career services", "alumni network",
        "job prospects", "ms ads alumni careers", "data science job placement",
        "career development", "alumni success stories", "post-graduation employment"
    ],
    "Capstone Projects": [
        "capstone", "capstone project", "final project", "industry project",
        "practicum", "project examples", "capstone structure", "thesis alternative",
        "student capstone", "capstone opportunities", "industry-relevant projects",
        "applied projects", "capstone showcase", "real-world projects"
    ],
    "Student Experience and Support": [
        "student life", "student services", "student support", "immersion", "orientation",
        "student events", "resources for students", "online format", "on-campus format",
        "cohort", "student community", "academic advising", "mentorship", "student organizations",
        "student voices", "student perspective", "ms ads community", "campus life", "student testimonials"
    ],
    "Frequently Asked Questions (FAQs)": [
        "faq", "frequently asked questions", "common questions", "q&a", "q and a",
        "ms ads faq", "questions about applied data science master", "program inquiries"
    ],
    "Events and Deadlines": [
        "events", "webinars", "information session", "application deadline", "program calendar",
        "program start date", "calendar", "key dates", "upcoming events", "registration",
        "conference on ai", "data science seminar", "program workshop", "ms ads events",
        "key dates for ms ads", "admission events", "open house", "recruitment events"
    ]
}

def ensure_dir_exists(dir_path):
    """Creates a directory if it doesn't exist."""
    if not os.path.exists(dir_path):
        try:
            os.makedirs(dir_path)
            print(f"Created directory: {dir_path}")
        except OSError as e:
            print(f"Error creating directory {dir_path}: {e}")
            return False
    return True

def get_file_content(filepath):
    """Reads the content of a file."""
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read().lower()  # Convert content to lowercase for matching
    except Exception as e:
        print(f"Error reading file {filepath}: {e}")
        return ""

def identify_category_for_content(content, categories_keywords):
    """
    Identifies which categories the content matches based on keywords.
    Returns a list of matched category names.
    Uses regex for whole word matching.
    """
    matched_categories = []
    if not content:
        return matched_categories

    for category, keywords in categories_keywords.items():
        for keyword in keywords:
            if re.search(r'\b' + re.escape(keyword) + r'\b', content):
                if category not in matched_categories:
                    matched_categories.append(category)
                break  # Move to next category once a keyword is found for the current one
    return matched_categories

# --- Step 1: Filter by Filename ---
def filter_by_filename(source_folder, target_folder_step_1, filename_patterns):
    """
    Copies files from source_folder to target_folder_step_1 if their filenames contain any of the specified patterns.
    Also prints the files that were skipped because they did not match any pattern.
    Returns the number of files copied.
    """
    print(f"\n--- Starting Step 1: Filename Filtering ---")
    print(f"Reading from: {source_folder}")
    print(f"Writing to:   {target_folder_step_1}")

    if not ensure_dir_exists(target_folder_step_1):
        print(f"Critical Error: Could not create target directory for Step 1: {target_folder_step_1}. Aborting Step 1.")
        return 0

    if not os.path.isdir(source_folder):
        print(f"Error: Source directory for Step 1 ({source_folder}) does not exist.")
        return 0

    copied_count = 0
    processed_count = 0
    skipped_non_markdown_count = 0
    skipped_files = []  # Collect markdown files that didn't match any pattern

    for filename in os.listdir(source_folder):
        source_filepath = os.path.join(source_folder, filename)
        if os.path.isfile(source_filepath) and filename.lower().endswith(".md"):
            processed_count += 1a
            lowercase_filename = filename.lower()
            matched_pattern = any(pattern.lower() in lowercase_filename for pattern in filename_patterns)
            if matched_pattern:
                target_filepath = os.path.join(target_folder_step_1, filename)
                try:
                    shutil.copy2(source_filepath, target_filepath)
                    copied_count += 1
                except Exception as e:
                    print(f"  Step 1: Error copying '{filename}': {e}")
            else:
                skipped_files.append(filename)
        elif os.path.isfile(source_filepath):
            skipped_non_markdown_count += 1

    print(f"--- Step 1 Summary ---")
    print(f"Markdown files processed in Step 1: {processed_count}")
    print(f"Files copied to '{target_folder_step_1}' based on filename: {copied_count}")
    print(f"Markdown files skipped (no filename match): {processed_count - copied_count}")
    print(f"Non-markdown files skipped: {skipped_non_markdown_count}")
    
    if skipped_files:
        print("\nFiles skipped (no filename pattern match):")
        for f in skipped_files:
            print(f"  {f}")
    
    return copied_count

# --- Step 2: Filter by Content ---
def filter_by_content(source_folder_step_1, target_folder_step_2, categories_keywords):
    """
    Copies files from source_folder_step_1 to target_folder_step_2 if their content matches any of the specified categories/keywords.
    Returns the number of files copied.
    """
    print(f"\n--- Starting Step 2: Content Filtering ---")
    print(f"Reading from: {source_folder_step_1} (output of Step 1)")
    print(f"Writing to:   {target_folder_step_2}")

    if not ensure_dir_exists(target_folder_step_2):
        print(f"Critical Error: Could not create target directory for Step 2: {target_folder_step_2}. Aborting Step 2.")
        return 0

    if not os.path.isdir(source_folder_step_1):
        print(f"Error: Source directory for Step 2 ({source_folder_step_1}) does not exist. Did Step 1 run correctly and create its target directory?")
        return 0

    copied_count = 0
    processed_count = 0

    if not os.listdir(source_folder_step_1):
        print(f"Info: Source directory for Step 2 ({source_folder_step_1}) is empty. No files to process for content.")
    else:
        for filename in os.listdir(source_folder_step_1):
            source_filepath = os.path.join(source_folder_step_1, filename)
            if os.path.isfile(source_filepath) and filename.lower().endswith(".md"):
                processed_count += 1
                content = get_file_content(source_filepath)
                if not content:
                    print(f"    Skipping '{filename}' due to read error or empty content.")
                    continue
                matched_categories = identify_category_for_content(content, categories_keywords)
                if matched_categories:
                    target_filepath = os.path.join(target_folder_step_2, filename)
                    try:
                        shutil.copy2(source_filepath, target_filepath)
                        copied_count += 1
                    except Exception as e:
                        print(f"    Error copying file {filename}: {e}")

    print(f"--- Step 2 Summary ---")
    print(f"Files processed in Step 2 (from '{source_folder_step_1}'): {processed_count}")
    print(f"Files copied to '{target_folder_step_2}' based on content: {copied_count}")
    print(f"Files skipped (no content match or empty source): {processed_count - copied_count}")
    return copied_count

def main():
    """Main function to orchestrate the two-step filtering process."""
    current_date_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"--- Two-Step Markdown File Filtering Process ---")
    print(f"Starting at: {current_date_str}")
    print(f"Original Source Directory: {SOURCE_DIR}")
    print(f"Intermediate Directory (Filename Filtered): {TARGET_DIR_STEP_1}")
    print(f"Final Target Directory (Content Filtered): {TARGET_DIR_STEP_2}")
    print("-" * 50)
    print("IMPORTANT: Review and customize FILENAME_PATTERNS_STEP_1 in the script for Step 1 to be effective.")
    print("Ensure you have write permissions for the target directories.")
    print("-" * 50)

    step1_copied_count = filter_by_filename(SOURCE_DIR, TARGET_DIR_STEP_1, FILENAME_PATTERNS_STEP_1)

    if step1_copied_count == 0:
        if not os.path.exists(TARGET_DIR_STEP_1):
            print(f"\nCritical Warning: Step 1 did not copy any files AND its target directory '{TARGET_DIR_STEP_1}' could not be created or does not exist. Step 2 cannot proceed.")
            print("Please check permissions and path for TARGET_DIR_STEP_1.")
        elif not os.listdir(TARGET_DIR_STEP_1):
            print(f"\nWarning: Step 1 did not copy any files, and its target directory '{TARGET_DIR_STEP_1}' is empty. This could be due to restrictive filename patterns or an empty source directory.")
            print("Step 2 will process an empty directory.")

    if os.path.exists(TARGET_DIR_STEP_1):
        step2_copied_count = filter_by_content(TARGET_DIR_STEP_1, TARGET_DIR_STEP_2, CATEGORIES_KEYWORDS_STEP_2)
    else:
        print(f"\nSkipping Step 2 because the intermediate directory '{TARGET_DIR_STEP_1}' does not exist.")
        step2_copied_count = 0

    print(f"\n--- Overall Process Summary ---")
    print(f"Execution Date: {current_date_str}")
    print(f"Total files copied after Step 1 (filename filter) to '{TARGET_DIR_STEP_1}': {step1_copied_count}")
    print(f"Total files copied after Step 2 (content filter) to '{TARGET_DIR_STEP_2}': {step2_copied_count}")
    print("Two-step filtering task completed.")

if __name__ == "__main__":
    main()

--- Two-Step Markdown File Filtering Process ---
Starting at: 2025-05-11 00:37:01
Original Source Directory: ../data/markdown
Intermediate Directory (Filename Filtered): ../data/markdown_subset_1_filename_filtered
Final Target Directory (Content Filtered): ../data/markdown_subset_2_content_filtered
--------------------------------------------------
IMPORTANT: Review and customize FILENAME_PATTERNS_STEP_1 in the script for Step 1 to be effective.
Ensure you have write permissions for the target directories.
--------------------------------------------------

--- Starting Step 1: Filename Filtering ---
Reading from: ../data/markdown
Writing to:   ../data/markdown_subset_1_filename_filtered
Created directory: ../data/markdown_subset_1_filename_filtered
--- Step 1 Summary ---
Markdown files processed in Step 1: 1235
Files copied to '../data/markdown_subset_1_filename_filtered' based on filename: 999
Markdown files skipped (no filename match): 236
Non-markdown files skipped: 0

Files skippe

In [22]:
"""
Iterate through ../data/markdown_subset_2_content_filtered,
and write one JSON object per file to kept_markdown.jsonl.gz
"""
import json, gzip, os
from pathlib import Path

INPUT_DIR = Path("../data/markdown_subset_2_content_filtered")
JSONL_OUT = Path("../data/kept_markdown.jsonl.gz")   # .jsonl for uncompressed

records = 0
with gzip.open(JSONL_OUT, "wt", encoding="utf-8") as out:   # use open(...) for plain
    for fname in sorted(INPUT_DIR.glob("*.md")):
        content = fname.read_text(encoding="utf-8")
        out.write(json.dumps({
            "file": fname.name,
            "content": content
        }) + "\n")
        records += 1

print(f"✅  Wrote {records} markdown docs → {JSONL_OUT}")


✅  Wrote 675 markdown docs → ../data/kept_markdown.jsonl.gz


In [62]:
# Define directories and output file paths
markdown_folder = '../data/markdown_subset_2_content_filtered'
filenames_output = '../docs/filenames_subset_2.txt'
combined_output = '../docs/combined_markdown_subset_2.md'

# Write the list of file names from markdown folder into a text file
with open(filenames_output, 'w', encoding='utf-8') as f_out:
    for fname in sorted(os.listdir(markdown_folder)):
        filepath = os.path.join(markdown_folder, fname)
        if os.path.isfile(filepath):
            f_out.write(f"{fname}\n")
print(f"Markdown file names saved to: {filenames_output}")

# Combine all markdown contents from markdown folder into a single file
with open(combined_output, 'w', encoding='utf-8') as combined_f:
    for fname in sorted(os.listdir(markdown_folder)):
        fullpath = os.path.join(markdown_folder, fname)
        if os.path.isfile(fullpath) and fname.lower().endswith('.md'):
            with open(fullpath, 'r', encoding='utf-8') as f_in:
                content = f_in.read()
                combined_f.write(content)
                combined_f.write("\n\n")  # separate files with extra blank lines
print(f"Combined markdown saved to: {combined_output}")

Markdown file names saved to: ../docs/filenames_subset_2.txt
Combined markdown saved to: ../docs/combined_markdown_subset_2.md


Subset 3 filters
| Include ✅                                   | Exclude ❌                             |
| ------------------------------------------- | ------------------------------------- |
| MS in Applied Data Science-specific content | High school curricula details         |
| UChicago Faculty & Staff (specific to MS)   | Faculty unrelated to MS-ADS program   |
| Core/Elective course information            | General outreach projects (11th Hour) |
| Capstone project specifics                  | Repeated navigation/menu elements     |
| Financial aid, tuition, career outcomes     | Unrelated technology/software details |


In [149]:
import os, re, shutil
from pathlib import Path
from datetime import datetime

# ── paths ─────────────────────────────────────────────────────
SRC_DIR   = Path("../data/markdown_subset_2_content_filtered")
DST_DIR   = Path("../data/markdown_subset_3_filtered")
DOCS_DIR  = Path("../docs")

DST_DIR.mkdir(parents=True, exist_ok=True)
DOCS_DIR.mkdir(parents=True, exist_ok=True)

FILES_TXT = DOCS_DIR / "filenames_subset_3.txt"
COMBINED  = DOCS_DIR / "combined_markdown_subset_3.md"

# ── filename rules ────────────────────────────────────────────
KEEP_FILE_PATTERNS = [
    "ms-in-applied-data-science", "capstone-projects", "course-progressions",
    "career-outcomes", "tuition-fees-aid", "how-to-apply", "events-deadlines",
    "faqs", "our-students", "instructors-staff", "online-program",
    "in-person-program", "people_", "dsi-leadership", "immersion",
    "10-years", "alumni",
]

DROP_FILE_PATTERNS = [
    "bootstrap-data-science", "course-kata", "youcubed",
    "11th-hour-project", "ids-introduction", "high-school",
    "ai-science-summer-school", "events_",          # keep only MS-ADS events
]

NEWS_ADS_KEYWORDS = [
    "masters-in-applied-data-science",
    "ms-in-applied-data-science",
]

def keep_file(name: str) -> bool:
    """Return True if the file should be copied into subset-3."""
    n = name.lower()

    # 1. News rule: keep `news_*.md` that mention the MS-ADS program,
    #    unless they also mention “high-school”.
    if n.startswith("news_"):
        if "high-school" in n:
            return False
        if any(k in n for k in NEWS_ADS_KEYWORDS):
            return True
        # fall through → other rules

    # 2. Drop-pattern veto.
    if any(pat in n for pat in DROP_FILE_PATTERNS):
        return False

    # 3. Normal keep patterns.
    return any(pat in n for pat in KEEP_FILE_PATTERNS)

# ── bio relevance (whole-file gate) ───────────────────────────
ADS_RX      = re.compile(r"applied[\s\-]+data[\s\-]+science", re.I)
MS_INSTR_RX = re.compile(r"ms[\s\-]?instructors?", re.I)
ALUMNI_RX   = re.compile(r"people\s*/\s*alumni", re.I)

def bio_is_relevant(text: str) -> bool:
    head = text[:3000].lower()  # inspect first ~3 kB only
    return any(rx.search(head) for rx in (ADS_RX, MS_INSTR_RX, ALUMNI_RX))

# ── build subset-3 ────────────────────────────────────────────
print(f"Subset-3 build started  {datetime.now():%Y-%m-%d %H:%M:%S}")

copied = skipped = bios_dropped = 0
kept_names = []

for src in SRC_DIR.glob("*.md"):
    if not keep_file(src.name):
        skipped += 1
        continue

    if src.name.lower().startswith("people_"):
        if not bio_is_relevant(src.read_text(encoding="utf-8", errors="ignore")):
            bios_dropped += 1
            continue

    shutil.copy2(src, DST_DIR / src.name)
    copied += 1
    kept_names.append(src.name)
    print(f"✔︎ copied {src.name}")

# ── helper files ──────────────────────────────────────────────
FILES_TXT.write_text("\n".join(sorted(kept_names)) + "\n", encoding="utf-8")

with COMBINED.open("w", encoding="utf-8") as out_md:
    for name in sorted(kept_names):
        out_md.write((DST_DIR / name).read_text(encoding="utf-8"))
        out_md.write("\n\n")  # separator between docs

# ── sanity-check report ───────────────────────────────────────
print("\n── sanity check ──────────────────────────────────────")
print(f"Total markdown in source : {len(list(SRC_DIR.glob('*.md')))}")
print(f"  • copied to subset-3   : {copied}")
print(f"  • skipped by patterns  : {skipped}")
print(f"  • bios dropped         : {bios_dropped}")
print(f"\nOutput directory        : {DST_DIR}")
print(f"File list               : {FILES_TXT}")
print(f"Combined markdown       : {COMBINED}")
print("✅ subset-3 build finished (no text altered).")


Subset-3 build started  2025-05-11 18:31:29
✔︎ copied education_masters-programs_ms-in-applied-data-science_course-progressions.md
✔︎ copied news_masters-in-applied-data-science-summer-2024-capstone-winners.md
✔︎ copied capstone-projects.md
✔︎ copied people_igor-yakushin-phd.md
✔︎ copied people_daniel-truesdale.md
✔︎ copied people_greg-green.md
✔︎ copied people_dmitri-sidorov.md
✔︎ copied people_francisco-azeredo-phd.md
✔︎ copied people_samantha-kruse.md
✔︎ copied people_kendall-cox.md
✔︎ copied education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
✔︎ copied people_jose-alvarado.md
✔︎ copied people_stephen-barry.md
✔︎ copied education_masters-programs_ms-in-applied-data-science.md
✔︎ copied news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
✔︎ copied people_ej-kang.md
✔︎ copied people_maggie-wu.md
✔︎ copied people_nick-diantonio-m-ed.md
✔︎ copied people_gizem-agar-phd.md
✔︎ copied people_dujuan-smith.md
✔︎ co

In [None]:
#!/usr/bin/env python3
# subset3_clean_1.py
#
# Clean the markdown files in ../data/markdown_subset_3_filtered
# and write the results into   ../data/markdown_subset_3_clean_1
# plus helper artefacts in      ../docs/
#
# The script prints a “sanity check” preview (first diff line per file)
# and shows all output paths **relative** to the current working dir.

from __future__ import annotations
import difflib, os, re, sys, textwrap
from pathlib import Path
from typing import List, Tuple

# ───────────────────────── configuration ──────────────────────────────────
SRC_DIR   = Path("../data/markdown_subset_3_filtered")
DEST_DIR  = Path("../data/markdown_subset_3_clean_1")
DOC_DIR   = Path("../docs")
FILES_TXT = DOC_DIR / "filenames_subset_3_clean_1.txt"
COMBINED  = DOC_DIR / "combined_markdown_subset_3_clean_1.md"

DEST_DIR.mkdir(parents=True, exist_ok=True)
DOC_DIR .mkdir(parents=True, exist_ok=True)

# ────────────────────────── regex helpers ─────────────────────────────────
YAML_FRONT_RX = re.compile(r"(?s)\A---\n.*?\n---\n?")
READ_MORE_RX  = re.compile(r"^\s*(Read\s+More\s*\+|Show\s+Less\s*-)\s*$", re.I|re.M)
CSS_BLOCK_RX  = re.compile(r"/\*.*?\*/", re.S)  # crude – matches /* … */ blocks

# 1) '* [Something](#accordion-tab-N)'           → '* Something'
ACCORDION_BULLET_RX = re.compile(
    r'^\s*\*\s*\[([^\]]+?)\]\(#accordion-tab-[^)]+\)[^\n]*$', re.M)

# 2) '* [![Image of NAME](…]'                    → '* NAME'
IMAGE_BULLET_RX = re.compile(
    r'^\s*\*\s*\[!\[Image of ([^\]]+?)\]\([^)]+\)[^\n]*$', re.M)

# ───────────────────────── cleaning pipeline ──────────────────────────────
def split_front_matter(text: str) -> Tuple[str, str]:
    """Return (front, body) where `front` ends with exactly one newline."""
    m = YAML_FRONT_RX.match(text)
    if m:
        front = m.group(0).rstrip() + "\n"  # ensure exactly one NL afterward
        return front, text[m.end():]
    return "", text.lstrip("\n")            # keep body tidy if no YAML

def clean_body(body: str) -> str:
    # remove "Read More +" & "Show Less -" junk lines
    body = READ_MORE_RX.sub("", body)

    # delete CSS comments of the form /* ….. */
    body = CSS_BLOCK_RX.sub("", body)

    # collapse accordion bullets
    body = ACCORDION_BULLET_RX.sub(lambda m: f"* {m.group(1).strip()}", body)

    # collapse image-alt bullets
    body = IMAGE_BULLET_RX.sub(lambda m: f"* {m.group(1).strip()}", body)

    # squeeze multiple blank lines to max-two (optional, keeps files tidy)
    body = re.sub(r"\n{3,}", "\n\n", body)

    return body

def clean_text(text: str) -> str:
    front, body = split_front_matter(text)
    return front + clean_body(body)

# ──────────────────────── diff/preview helpers ───────────────────────────
def first_diff(original: List[str], cleaned: List[str]) -> Tuple[str, str]:
    for o, c in zip(original, cleaned):
        if o != c:
            return o.rstrip("\n"), c.rstrip("\n")
    return "(no changes)", ""

# ────────────────────────────── main ✨ ───────────────────────────────────
def main() -> None:
    previews: List[str] = []
    combined_md_parts: List[str] = []
    filenames: List[str] = []

    for src_path in sorted(SRC_DIR.glob("*.md")):
        raw = src_path.read_text(encoding="utf-8")
        cleaned = clean_text(raw)

        rel_name = src_path.name
        dest_path = DEST_DIR / rel_name
        dest_path.write_text(cleaned, encoding="utf-8")

        # gather artefacts
        combined_md_parts.append(cleaned + "\n\n")
        filenames.append(rel_name)

        # build sanity-check preview
        before, after = first_diff(raw.splitlines(), cleaned.splitlines())
        previews.append(
            textwrap.dedent(f"""\
            === {rel_name} ===
            BEFORE: {before}
            AFTER : {after}
            """).rstrip()
        )

    # write helper files
    FILES_TXT.write_text("\n".join(filenames), encoding="utf-8")
    COMBINED.write_text("".join(combined_md_parts), encoding="utf-8")

    # ── console report ───────────────────────────────────────────────────
    print("Sanity check – first change per file:\n")
    print("\n\n".join(previews))

    cwd = Path.cwd()
    print("\n" + "-"*70)
    print(f"Cleaned files → {os.path.relpath(DEST_DIR, cwd)}")
    print(f"Filename list  → {os.path.relpath(FILES_TXT, cwd)}")
    print(f"Combined md    → {os.path.relpath(COMBINED, cwd)}")

# entry-point guard
if __name__ == "__main__":
    if not SRC_DIR.exists():
        sys.exit(f"Source directory not found: {SRC_DIR}")
    main()


Sanity check – first change per file:

=== capstone-projects.md ===
BEFORE: (no changes)
AFTER :

=== education_masters-programs_in-person-program.md ===
BEFORE: Read More +
AFTER : ### Career Seminar (noncredit, required)

=== education_masters-programs_ms-in-applied-data-science.md ===
BEFORE: (no changes)
AFTER :

=== education_masters-programs_ms-in-applied-data-science_capstone-projects.md ===
BEFORE: (no changes)
AFTER :

=== education_masters-programs_ms-in-applied-data-science_career-outcomes.md ===
BEFORE: (no changes)
AFTER :

=== education_masters-programs_ms-in-applied-data-science_course-progressions.md ===
BEFORE: 
AFTER : * Introduction to Statistical Concepts (Foundational) Optional

=== education_masters-programs_ms-in-applied-data-science_events-deadlines.md ===
BEFORE: (no changes)
AFTER :

=== education_masters-programs_ms-in-applied-data-science_faqs.md ===
BEFORE: 
AFTER : * When will I receive my Master's in Applied Data Science admission decision?

=== education

## Subset 3: Clean iteration 1

In [171]:
"""
Clean subset-3 markdown files (“clean_1” pass) and rebuild combined artifacts
with FULL nav lists – duplicates removed (keep first instance).
"""

from pathlib import Path
import os, re, shutil, sys

# ──────────────────────────────  paths  ────────────────────────────────
def infer_root() -> Path:
    if "__file__" in globals():
        return Path(__file__).resolve().parents[1]
    cwd = Path.cwd()
    for p in [cwd] + list(cwd.parents):
        if (p / "data").is_dir() and (p / "docs").is_dir():
            return p
    return cwd

ROOT      = infer_root()
SRC_DIR   = ROOT / "data" / "markdown_subset_3_filtered"
DEST_DIR  = ROOT / "data" / "markdown_subset_3_clean_1"
DOCS_DIR  = ROOT / "docs"

FILES_TXT = DOCS_DIR / "filenames_subset_3_clean_1.txt"
COMBINED  = DOCS_DIR / "combined_markdown_subset_3_clean_1.md"

# ────────────────────────────  regexes  ────────────────────────────────
RE_READ_MORE    = re.compile(r"\b(Read More \+|Show Less -)\b", re.I)
RE_HTML_COMMENT = re.compile(r"<!--.*?-->", re.S)
RE_FACET_ARROW  = re.compile(r"facet-arrow-down")
RE_EMPTY_LINES  = re.compile(r"\n{3,}")

RE_ACCORDION    = re.compile(r"\*\s+\[([^\]]+)]\(#[^)]+\)", re.M)
RE_IMG_BULLET   = re.compile(r"^\*\s+\[!\[Image of ([^\]]+)]\([^)]+\)\]", re.M)

RE_INLINE_IMG   = re.compile(r"^\s*!\[.*?]\(.*?\)\s*$", re.M)
RE_LONE_BANG    = re.compile(r"^\s*!\s*$", re.M)

RE_SHARE_BLOCK  = re.compile(
    r"^##\s*Share\s*\n" r"(?:^\s*\*.*\n?)+" r"\s*<!--\s*Table-like structure detected\s*-->\s*\n?",
    re.M,
)

RE_EMAIL        = re.compile(
    r"\[([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})]\(mailto:[^)]+\)"
)

# NEW – match a whole MS-ADS nav block (header + ≥5 bullets)
RE_NAV_BLOCK = re.compile(
    r"^##[^\n]*MS\s+in\s+Applied\s+Data\s+Science[^\n]*\n"  # heading line
    r"(?:^\s*\*.*\n){5,30}",                               # 5-30 bullet lines
    re.M,
)

# ──────────────────────────  cleaning  ────────────────────────────────
def clean_markdown(text: str) -> str:
    header, body = split_header_body(text)

    body = RE_READ_MORE.sub("", body)
    body = RE_HTML_COMMENT.sub("", body)
    body = RE_FACET_ARROW.sub("facet-arrow-dow", body)
    body = RE_ACCORDION.sub(r"* \1", body)
    body = RE_IMG_BULLET.sub(r"* \1", body)
    body = RE_INLINE_IMG.sub("", body)
    body = RE_LONE_BANG.sub("", body)
    body = RE_SHARE_BLOCK.sub("", body)
    body = RE_EMAIL.sub(r"\1", body)

    body = dedupe_nav_blocks(body)
    body = RE_EMPTY_LINES.sub("\n\n", body).strip()

    return f"{header.rstrip()}\n\n{body}\n" if header else f"{body}\n"

# ──────────────────────────  helpers  ────────────────────────────────
def split_header_body(text: str) -> tuple[str, str]:
    if text.lstrip().startswith("---"):
        _, _, rest = text.partition("---\n")
        header, sep, body = rest.partition("\n---")
        if sep:
            return f"---\n{header}---", body.lstrip()
    return "", text

def dedupe_nav_blocks(md: str) -> str:
    """Keep first nav block; drop any subsequent identical copies."""
    seen_first = False
    pieces = []
    last_end = 0

    for m in RE_NAV_BLOCK.finditer(md):
        start, end = m.span()
        if not seen_first:
            pieces.append(md[last_end:end])  # keep first occurrence
            seen_first = True
        else:
            pieces.append(md[last_end:start])  # skip duplicate
        last_end = end
    pieces.append(md[last_end:])               # tail
    return "".join(pieces)

def copy_and_clean():
    if DEST_DIR.exists():
        shutil.rmtree(DEST_DIR)
    DEST_DIR.mkdir(parents=True)

    previews = []
    for md in sorted(SRC_DIR.glob("*.md")):
        original = md.read_text(encoding="utf-8")
        cleaned  = clean_markdown(original)
        (DEST_DIR / md.name).write_text(cleaned, encoding="utf-8")

        for o, n in zip(original.splitlines(), cleaned.splitlines()):
            if o != n:
                previews.append(
                    f"=== {md.name} ===\nBEFORE: {o[:120]}\nAFTER : {n[:120]}"
                )
                break
    return previews

def write_filename_list():
    FILES_TXT.write_text(
        "\n".join(sorted(p.name for p in DEST_DIR.glob("*.md"))) + "\n",
        encoding="utf-8"
    )

def write_combined_md():
    with COMBINED.open("w", encoding="utf-8") as out:
        for md in sorted(DEST_DIR.glob("*.md")):
            out.write(md.read_text(encoding="utf-8").rstrip() + "\n\n")

# ─────────────────────────────  main  ────────────────────────────────
def main():
    if not SRC_DIR.exists():
        sys.exit(f"Source directory not found: {SRC_DIR}")

    previews = copy_and_clean()
    write_filename_list()
    write_combined_md()

    print("\nSanity check – first change per file:\n")
    print("\n".join(previews))

    print("\n" + "-"*70)
    cwd = Path.cwd()
    print(f"Cleaned files → {os.path.relpath(DEST_DIR, cwd)}")
    print(f"Filename list  → {os.path.relpath(FILES_TXT, cwd)}")
    print(f"Combined md    → {os.path.relpath(COMBINED, cwd)}")

if __name__ == "__main__":
    main()



Sanity check – first change per file:

=== capstone-projects.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_in-person-program.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science_capstone-projects.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science_career-outcomes.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science_course-progressions.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science_events-deadlines.md ===
BEFORE: date: 2025-05-04
AFTER : date: 2025-05-04---
=== education_masters-programs_ms-in-applied-data-science_faqs.md ===
BEFORE: date: 2025-05-04
AFTER : date:

In [None]:
import re
from pathlib import Path

# Define the regex to match the share block starting with "## Share"
RE_SHARE_BLOCK = re.compile(
    r"^##\s*Share\s*\n"       # Match "## Share" at beginning of line
    r"(?:^\s*\*.*\n?)+",       # Followed by one or more bullet lines
    re.M
)

# Path to the clean_1 directory
CLEAN_DIR = Path("../data/markdown_subset_3_clean_1")

for md_file in CLEAN_DIR.glob("*.md"):
    text = md_file.read_text(encoding="utf-8")
    # Remove the share block matching the regex
    cleaned_text = RE_SHARE_BLOCK.sub("", text)
    md_file.write_text(cleaned_text, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Processed capstone-projects.md
Processed people_igor-yakushin-phd.md
Processed people_daniel-truesdale.md
Processed people_greg-green.md
Processed people_dmitri-sidorov.md
Processed people_francisco-azeredo-phd.md
Processed people_samantha-kruse.md
Processed people_kendall-cox.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed people_jose-alvarado.md
Processed people_stephen-barry.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
Processed people_ej-kang.md
Processed people_maggie-wu.md
Processed people_nick-diantonio-m-ed.md
Processed people_gizem-agar-phd.md
Processed people_dujuan-smith.md
Processed test-course-progressions.md
Processed p

In [None]:
import re
from pathlib import Path

# Regex to match any HTML tag (e.g. <br />, <p>, etc.)
RE_HTML_TAG = re.compile(r"<[^>]+>")

# Path to the clean_1 directory
CLEAN_DIR = Path("../data/markdown_subset_3_clean_1")

for md_file in CLEAN_DIR.glob("*.md"):
    text = md_file.read_text(encoding="utf-8")
    # Remove all HTML tags
    cleaned_text = RE_HTML_TAG.sub("", text)
    md_file.write_text(cleaned_text, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Processed capstone-projects.md
Processed people_igor-yakushin-phd.md
Processed people_daniel-truesdale.md
Processed people_greg-green.md
Processed people_dmitri-sidorov.md
Processed people_francisco-azeredo-phd.md
Processed people_samantha-kruse.md
Processed people_kendall-cox.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed people_jose-alvarado.md
Processed people_stephen-barry.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
Processed people_ej-kang.md
Processed people_maggie-wu.md
Processed people_nick-diantonio-m-ed.md
Processed people_gizem-agar-phd.md
Processed people_dujuan-smith.md
Processed test-course-progressions.md
Processed p

In [None]:
import re
from pathlib import Path

# Define the directory containing your news markdown files.
NEWS_DIR = Path("../data/markdown_subset_3_clean_1")

# Define a regex that matches a newline followed by "Related" and everything thereafter.
RE_RELATED = re.compile(r"\nRelated[\s\S]*", re.DOTALL)

# Process each markdown file that starts with "news_"
for md_file in NEWS_DIR.glob("news_*.md"):
    text = md_file.read_text(encoding="utf-8")
    # Remove the "Related" section: everything from the first occurrence of "\nRelated" to the end.
    cleaned_text = RE_RELATED.sub("", text)
    md_file.write_text(cleaned_text, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Processed news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
Processed news_university-of-chicago-launches-joint-ms-in-applied-data-science-and-booth-mba-program.md
Processed news_the-ms-in-applied-data-science-program-celebrates-10-years.md
Processed news_masters-in-applied-data-science-autumn-2023-capstone-winners.md
Processed news_masters-in-applied-data-science-autumn-2024-capstone-showcase.md
Processed news_masters-in-applied-data-science-summer-2023-capstone-winner.md
Processed news_ms-in-applied-data-science-students-present-research-at-national-conference.md
Processed news_a-day-at-google-for-uchicagos-ms-in-applied-data-science-students.md
Processed news_applying-to-the-ms-in-applied-data-science-program-heres-what-we-look-for.md
Processed news_masters-in-applied-data-science-winter-2025-capstone-winners.md
Processed news_masters-in-applied-d

In [None]:
import re
from pathlib import Path

# Regex that matches "## Follow" followed by "MS in Applied Data Science"
RE_FOLLOW_BLOCK = re.compile(
    r"^##\s*Follow\s*\n\s*MS in Applied Data Science\s*\n?",
    re.M
)

# Define the directory containing your education_ md files
EDU_DIR = Path("../data/markdown_subset_3_clean_1")

for md_file in EDU_DIR.glob("education_*.md"):
    text = md_file.read_text(encoding="utf-8")
    cleaned_text = RE_FOLLOW_BLOCK.sub("", text)
    md_file.write_text(cleaned_text, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed education_masters-programs_ms-in-applied-data-science_our-students.md
Processed education_masters-programs_ms-in-applied-data-science_instructors-staff.md
Processed education_masters-programs_ms-in-applied-data-science_in-person-program.md
Processed education_masters-programs_ms-in-applied-data-science_online-program.md
Processed education_masters-programs_ms-in-applied-data-science_faqs.md
Processed education_masters-programs_in-person-program.md
Processed education_masters-programs_ms-in-applied-data-science_career-outcomes.md
Processed education_summerlab_alumni-and-projects.md
Processed education_masters-programs_online-program.md
Processed education_masters-programs_ms-in-applied-data-science_events-deadlines.md
Processed education_maste

In [None]:
import re
from pathlib import Path

# Define the directory containing your education_ markdown files
EDU_DIR = Path("..data/markdown_subset_3_clean_1")

# This regex looks for "* [![Image" not already preceded by a newline.
PATTERN = re.compile(r"(?<!\n\n)(\*\s+\[!\[Image)", re.M)

for md_file in EDU_DIR.glob("education_*.md"):
    text = md_file.read_text(encoding="utf-8")
    new_text = PATTERN.sub(r"\n\1", text)
    md_file.write_text(new_text, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed education_masters-programs_ms-in-applied-data-science_our-students.md
Processed education_masters-programs_ms-in-applied-data-science_instructors-staff.md
Processed education_masters-programs_ms-in-applied-data-science_in-person-program.md
Processed education_masters-programs_ms-in-applied-data-science_online-program.md
Processed education_masters-programs_ms-in-applied-data-science_faqs.md
Processed education_masters-programs_in-person-program.md
Processed education_masters-programs_ms-in-applied-data-science_career-outcomes.md
Processed education_summerlab_alumni-and-projects.md
Processed education_masters-programs_online-program.md
Processed education_masters-programs_ms-in-applied-data-science_events-deadlines.md
Processed education_maste

In [None]:
import re
from pathlib import Path

# Pattern 1: Remove from a line starting with "* [![Image of " up to the first closing parenthesis ")".
pattern1 = re.compile(r"(?s)^\*\s+\[!\[Image of .*?\)", re.M)

# Pattern 2: Remove any text from "[Image:" to the next "]".
pattern2 = re.compile(r"\[Image:[^\]]+\]")

# Pattern 3: Collapse three or more consecutive newlines into two.
pattern3 = re.compile(r"\n{3,}")

# Pattern 4: Remove extra newlines for the Kristin McCann block.
pattern4 = re.compile(
    r"\n\s*# Kristin McCann, PhD\s*\n\s*Chief of Staff, Executive Director, "
    r"MS in Applied Data Science\]\(https://datascience\.uchicago\.edu/people/kristin-i-mccann/\)\s*(\n\s*)+",
    re.M
)

# Pattern 5: Delete all newline characters immediately following a URL starting with
# "(https://datascience.uchicago.edu/people/" and ending with ")"
# when the next non‑whitespace character is a "#".
pattern5 = re.compile(r"(\(https://datascience\.uchicago\.edu/people/[^)]+\))\n+(?=\s*#)", re.M)

# Pattern 6: Delete the specific bullet lines for Overview, In‑Person Program, etc.
pattern6 = re.compile(
    r"^\*\s+\[(?:Overview|In-Person Program|Online Program|Capstone Projects|Course Progressions|How to Apply|"
    r"Events & Deadlines|Tuition, Fees, & Aid|Our Students|Faculty, Instructors, Staff|FAQs|"
    r"Explore the MS-ADS Campus|Career Outcomes|Get In Touch)\]"
    r"\(https://(?:datascience\.uchicago\.edu/education/masters-programs/(?:ms-in-applied-data-science/|in-person-program/|online-program/|ms-in-applied-data-science/capstone-projects/|ms-in-applied-data-science/course-progressions/|ms-in-applied-data-science/how-to-apply/|ms-in-applied-data-science/events-deadlines/|ms-in-applied-data-science/our-students/|ms-in-applied-data-science/instructors-staff/|ms-in-applied-data-science/faqs/|tuition-fees-aid/)?"
    r"|datascience\.uchicago\.edu/explore-the-ms-ads-campus/|datascience\.uchicago\.edu/education/masters-programs/ms-in-applied-data-science/career-outcomes/|apply-psd\.uchicago\.edu/register/\?id=[^)]+)\)"
    r"\s*\n?", re.M
)

# Pattern 7: Delete the specific block:
# "## MS in Applied Data Science facet-arrow-dow
# 
# * [Tuition, Fees, & Aid](https://datascience.uchicago.edu/education/tuition-fees-aid/)"
pattern7 = re.compile(
    r"^##\s+MS\s+in\s+Applied\s+Data\s+Science\s+facet-arrow-dow\s*\n\s*\*\s+\[Tuition, Fees, & Aid\]\(https://datascience\.uchicago\.edu/education/tuition-fees-aid/\)\s*\n?",
    re.M
)

# Pattern 8: Delete the "]" in a substring "](https://" by replacing "](https://" with "(https://"
pattern8 = re.compile(r"\]\(https://")

# Define the folder path containing your markdown files.
folder_path = Path("../data/markdown_subset_3_clean_1")

for md_file in folder_path.glob("*.md"):
    original_text = md_file.read_text(encoding="utf-8")
    
    # Apply the substitutions in sequence
    modified_text = pattern1.sub("", original_text)
    modified_text = pattern2.sub("", modified_text)
    modified_text = pattern4.sub("\n", modified_text)
    modified_text = pattern3.sub("\n\n", modified_text)
    modified_text = pattern5.sub(r"\1", modified_text)
    modified_text = pattern6.sub("", modified_text)
    modified_text = pattern7.sub("", modified_text)
    modified_text = pattern8.sub("(https://", modified_text)
    
    print("===== File:", md_file.name, "=====")
    print("===== BEFORE =====")
    print(original_text)
    print("\n===== AFTER =====")
    print(modified_text)
    
    # Write the modified text back to the file
    md_file.write_text(modified_text, encoding="utf-8")
    print("File", md_file.name, "has been updated.\n")

===== File: education_masters-programs_ms-in-applied-data-science_course-progressions.md =====
===== BEFORE =====
---
title: Course Progressions – DSI
original_url: https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/course-progressions
category: education
date: 2025-05-04---

## MS in Applied Data Science facet-arrow-dow

* [Overview](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/)
* [In-Person Program](https://datascience.uchicago.edu/education/masters-programs/in-person-program/)
* [Online Program](https://datascience.uchicago.edu/education/masters-programs/online-program/)
* [Capstone Projects](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/capstone-projects/)
* [Course Progressions](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/course-progressions/)
* [How to Apply](https://datascience.uchicago.edu/education/masters-programs/m

In [None]:
import os
import shutil
import re
from pathlib import Path

# Define source (clean_1) and destination (clean_2) directories
source_dir = Path("../data/markdown_subset_3_clean_1")
dest_dir = Path("..data/markdown_subset_3_clean_2")

# Create destination directory if it doesn't exist
dest_dir.mkdir(parents=True, exist_ok=True)

# Copy all Markdown files from clean_1 to clean_2
for file in source_dir.glob("*.md"):
    dst_file = dest_dir / file.name
    shutil.copy2(file, dst_file)
    print(f"Copied {file.name} to {dst_file}")

# ----- Define flexible removal patterns for CSS blocks -----
# Pattern for the first CSS block
pattern_css1 = re.compile(
    r"body\s*\{\s*font-family:\s*Arial,\s*sans-serif;\s*\}\s*\.container\s*\{[^}]+\}\s*iframe\s*\{[^}]+\}\s*\.logo-section\s*\{[^}]+\}\s*\.logo-grid\s*\{[^}]+\}\s*\.logo-item\s*\{[^}]+\}\s*\.logo-item\s+img\s*\{[^}]+\}\s*\.logo-item:hover\s*\{[^}]+\}",
    re.IGNORECASE | re.DOTALL
)
# Pattern for the second CSS block (with video-container)
pattern_css2 = re.compile(
    r"body\s*\{\s*font-family:\s*Arial,\s*sans-serif;\s*\}\s*\.container\s*\{[^}]+\}\s*/\*\s*Large YouTube Video\s*\*/\s*\.video-container\s*\{[^}]+\}\s*\.video-container\s+iframe\s*\{[^}]+\}\s*\.logo-section\s*\{[^}]+\}\s*\.logo-grid\s*\{[^}]+\}\s*\.logo-item\s*\{[^}]+\}\s*\.logo-item\s+img\s*\{[^}]+\}\s*\.logo-item:hover\s*\{[^}]+\}",
    re.IGNORECASE | re.DOTALL
)

# Other patterns to remove specific sections, as before
pattern_follow = re.compile(r"^##\s*Follow\s*\n", re.M)
pattern_graduates = re.compile(r"^#\s*\*\*MS-ADS program graduates work at the following places:\*\*.*\n?", re.M)
pattern_line_to_remove = re.compile(
    re.escape(
        r'body { font-family: Arial, sans-serif; } .logo-section { text-align: center; margin: 40px 0; } .logo-grid { display: grid; grid-template-columns: repeat(5, 1fr); gap: 20px; max-width: 1000px; margin: 0 auto; /* Ensures center alignment */ padding: 20px; } .logo-item { display: flex; align-items: center; justify-content: center; background: #f8f8f8; border-radius: 8px; padding: 15px; height: 120px; width: 180px; } .logo-item img { max-height: 80px; max-width: 160px; width: auto; height: auto; object-fit: contain; } .logo-item img[alt="Amazon"], .logo-item img[alt="CVS Health"] { max-height: 100px; max-width: 180px; } .logo-item:hover { transform: scale(1.1); }'
    ),
    re.M
)
pattern_get_in_touch = re.compile(
    r"##\s*Get in Touch\s*\n\s*Loading…\s*\n\s*/\\\*\\\*/\s*\n\s*Read More \+\s*\n\s*Show Less -",
    re.DOTALL
)
pattern_read_more_show_less = re.compile(r"\n?\s*Read More \+\s*Show Less -", re.DOTALL)

# ----- Function to process a single file without inserting YAML -----
def process_file(file_path: Path):
    original_text = file_path.read_text(encoding="utf-8")
    text = original_text

    # Remove both CSS block variants
    text = pattern_css1.sub("", text)
    text = pattern_css2.sub("", text)

    text = pattern_follow.sub("", text)
    text = pattern_graduates.sub("", text)
    text = pattern_line_to_remove.sub("", text)
    text = pattern_get_in_touch.sub("", text)
    text = pattern_read_more_show_less.sub("", text)
    
    print(f"===== File: {file_path.name} =====")
    print("----- BEFORE -----")
    print(original_text)
    print("----- AFTER -----")
    print(text)
    print("\n")
    
    file_path.write_text(text, encoding="utf-8")

def main():
    # Process each Markdown file in the clean_2 folder
    for md_file in dest_dir.glob("*.md"):
        process_file(md_file)

if __name__ == "__main__":
    main()

Copied education_masters-programs_ms-in-applied-data-science_course-progressions.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/education_masters-programs_ms-in-applied-data-science_course-progressions.md
Copied news_masters-in-applied-data-science-summer-2024-capstone-winners.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Copied capstone-projects.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/capstone-projects.md
Copied people_igor-yakushin-phd.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/people_igor-yakushin-phd.md
Copied people_daniel-truesdale.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/people_daniel-truesdale.md
Copied people_greg-green.md to /Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2/people_greg-green.md
Copied people_dmitri-sidorov

In [210]:
import re
from pathlib import Path

# Directory containing your markdown files
MD_DIR = Path("/Users/danielkim/gen-ai-midterm-project/data/markdown_subset_3_clean_2")

# Regex: matches lines like "##   " or "###    " with nothing else after
EMPTY_HEADING = re.compile(r"^#{2,}\s*$", re.MULTILINE)

for md_file in MD_DIR.glob("*.md"):
    text = md_file.read_text(encoding="utf-8")
    cleaned = EMPTY_HEADING.sub("", text)
    md_file.write_text(cleaned, encoding="utf-8")
    print(f"Processed {md_file.name}")

Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Processed capstone-projects.md
Processed people_igor-yakushin-phd.md
Processed people_daniel-truesdale.md
Processed people_greg-green.md
Processed people_dmitri-sidorov.md
Processed people_francisco-azeredo-phd.md
Processed people_samantha-kruse.md
Processed people_kendall-cox.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed people_jose-alvarado.md
Processed people_stephen-barry.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
Processed people_ej-kang.md
Processed people_maggie-wu.md
Processed people_nick-diantonio-m-ed.md
Processed people_gizem-agar-phd.md
Processed people_dujuan-smith.md
Processed test-course-progressions.md
Processed p

In [212]:
import re
from pathlib import Path

# ── locate markdown directory (relative to this script or current cwd) ──
def project_root() -> Path:
    if "__file__" in globals():          # running as a script
        return Path(__file__).resolve().parent
    return Path.cwd()                    # likely a notebook / REPL

ROOT   = project_root()
MD_DIR = (ROOT / "../data/markdown_subset_3_clean_2").resolve()

# ── regexes ────────────────────────────────────────────────────────────
EMPTY_HEADING   = re.compile(r"^#{2,}\s*$", re.MULTILINE)          # “##   ”
HASH_PAREN_LINE = re.compile(r"^\s*\(#\)\s*$", re.MULTILINE)       # “(#)”
ANCHOR_LINK     = re.compile(r"\[([^\]]+)]\s*\(#\)")               # “[text](#)”

# ── clean every markdown file ─────────────────────────────────────────
for md_file in MD_DIR.glob("*.md"):
    text     = md_file.read_text(encoding="utf-8")
    cleaned  = EMPTY_HEADING.sub("", text)
    cleaned  = HASH_PAREN_LINE.sub("", cleaned)
    cleaned  = ANCHOR_LINK.sub(r"\1", cleaned)   # keep text, drop link + brackets
    md_file.write_text(cleaned, encoding="utf-8")
    print(f"Processed {md_file.name}")


Processed education_masters-programs_ms-in-applied-data-science_course-progressions.md
Processed news_masters-in-applied-data-science-summer-2024-capstone-winners.md
Processed capstone-projects.md
Processed people_igor-yakushin-phd.md
Processed people_daniel-truesdale.md
Processed people_greg-green.md
Processed people_dmitri-sidorov.md
Processed people_francisco-azeredo-phd.md
Processed people_samantha-kruse.md
Processed people_kendall-cox.md
Processed education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md
Processed people_jose-alvarado.md
Processed people_stephen-barry.md
Processed education_masters-programs_ms-in-applied-data-science.md
Processed news_jan-mrozowskis-path-from-industry-professional-to-teaching-in-the-ms-in-applied-data-science-program.md
Processed people_ej-kang.md
Processed people_maggie-wu.md
Processed people_nick-diantonio-m-ed.md
Processed people_gizem-agar-phd.md
Processed people_dujuan-smith.md
Processed test-course-progressions.md
Processed p

In [None]:
import os
import re
import shutil
from pathlib import Path

# ----------------------------------------------------------------------
# source / destination
src_dir = Path("../data/markdown_subset_3_clean_2")
dst_dir = Path("../data/markdown_subset_3_clean_3")
dst_dir.mkdir(parents=True, exist_ok=True)

exclude_file = "education_summerlab_alumni-and-projects.md"

# ----------------------------------------------------------------------
# helpers
def fix_yaml(text: str) -> str:
    """Ensure YAML front-matter ends with a blank line."""
    m = re.match(r"(?s)^---.*?---", text)
    if not m:
        return text
    yaml = m.group(0)
    yaml_fixed = re.sub(r"([^\n])---", r"\1\n---", yaml)  # make sure final fence starts on its own line
    if not yaml_fixed.endswith("\n"):
        yaml_fixed += "\n"
    return yaml_fixed + text[len(yaml):]


def ensure_single_blank_after_header(body: str) -> str:
    return re.sub(r"^(#{1,6} .+?)\n{2,}", r"\1\n\n", body, flags=re.MULTILINE)


def remove_unmatched_brackets(line: str) -> str:
    line = re.sub(r"\([^\)]*$", lambda m: m.group(0).replace("(", ""), line)
    line = re.sub(r"\[[^\]]*$", lambda m: m.group(0).replace("[", ""), line)
    return line


def fix_malformed_links(body: str) -> str:
    # “[Text](https://…)” is fine.  “[Text(https://…)” → “[Text](https://…)”
    pattern = re.compile(r"\[([^\[\]()]+)\((https?://[^\s\)]+)\)")
    return pattern.sub(r"[\1](\2)", body)


def ensure_blank_before_bullet(body: str) -> str:
    lines, out = body.splitlines(), []
    for i, ln in enumerate(lines):
        if re.match(r"^\s*[\*\-\+]\s+", ln) and i > 0:
            prev = lines[i - 1]
            if prev.strip() and not re.match(r"^\s*[\*\-\+]\s+", prev) and not prev.strip().startswith("#"):
                out.append("")
        out.append(ln)
    return "\n".join(out)


# ----------------------------------------------------------------------
# master cleaner
def clean_markdown_content(text: str) -> str:
    text = fix_yaml(text)

    # ——— split YAML / body ———
    m = re.match(r"(?s)^---.*?---\s*", text)
    yaml = m.group(0) if m else ""
    body = text[len(yaml):] if yaml else text

    # ------------------------------------------------------------------
    # PATTERN-BASED REMOVALS / REPLACEMENTS
    patterns = [
        r"^\s*[\*\-\+]\s*$\n?",                         # bullet only
        r"^\[.*?\([^\)]*$\n?",                          # broken links
        r"^\[!\[.*?\([^\)]*$\n?",                       # broken image links
        r"^\s*!\s*$\n?",                                # lone !
        r"^\[\s*\n?", r"^\(\s*\n?", r"^\{\s*\n?", r"^<\s*\n?",  # stray opening brackets
        r"^.*[\[\(\{<]\s*$\n?",                         # line ending with an open bracket
        # social / share / follow / misc footers
        r"(?mi)^\s*##?\s*(share|follow|contact info|linkedin|email|office|"
        r"register|apply|deadline|event|instagram|tiktok|youtube|facebook|twitter).*\n"
        r"(?:.*\n)*?",                                   # consume following lines too
        r"(?mi)^\s*(linkedin|email|office|register|apply|deadline|event|"
        r"instagram|tiktok|youtube|facebook|twitter).*\n",
        r"\\u[\da-fA-F]{4}",                             # hex escapes
        r"\*{1,2}Follow us.*",
        r"(?mi)^\s*contact info.*\n",
        r"(?mi)^\s*email.*\n",
        r"(?mi)^\s*office.*\n",
        r"(?mi)^\s*linkedin.*\n",
        r"^.*Loading….*$\n?",
        # NEW: image-only URL lines
        r"^\s*https?://\S+\\.(?:jpg|jpeg|png|gif|svg)(?:\\?[^\\s]*)?\s*$\n?",
        # NEW: lines starting with ! that are **not** ![alt](...)
        r"^\s*!(?!\\[).*?\n?",
        # NEW: remove facet-arrow-dow placeholder
        r"facet-arrow-dow",
    ]
    for pat in patterns:
        body = re.sub(pat, "", body, flags=re.MULTILINE)

    # fix malformed markdown links
    body = fix_malformed_links(body)

    # replace unicode escape leftovers
    replacements = {
        "\\u2013": "-", "\\u2014": "-", "\\u201c": '"', "\\u201d": '"',
        "\\u2018": "'", "\\u2019": "'", "\\u00a0": " ", "\\u2026": "...",
        "\\u200b": "",
    }
    for k, v in replacements.items():
        body = body.replace(k, v)

    # strip trailing/leading spaces per line
    body = "\n".join(ln.rstrip() for ln in body.splitlines())

    # collapse >2 blank lines
    body = re.sub(r"\n{2,}", "\n\n", body)

    # remove unmatched [ ( at EOL
    body = "\n".join(remove_unmatched_brackets(ln) for ln in body.splitlines())

    # ensure exactly one blank after YAML
    if yaml:
        body = body.lstrip("\n")
        cleaned = yaml.rstrip("\n") + "\n\n" + body
    else:
        cleaned = body.lstrip("\n")

    # one blank after any header
    cleaned = ensure_single_blank_after_header(cleaned)

    # blank line before any bullet following a text line
    cleaned = ensure_blank_before_bullet(cleaned)

    # final collapse of blank lines & removal of bullet-only lines
    cleaned = re.sub(r"\n{2,}", "\n\n", cleaned).strip()
    cleaned = re.sub(r"^\s*[\*\-\+]\s*$\n?", "", cleaned, flags=re.MULTILINE)

    return cleaned


# ----------------------------------------------------------------------
# MAIN
for file in src_dir.glob("*.md"):
    if file.name == exclude_file:
        print(f"Skipping excluded file: {file.name}")
        continue

    dst_file = dst_dir / file.name
    shutil.copy(file, dst_file)

    raw = dst_file.read_text(encoding="utf-8")
    cleaned = clean_markdown_content(raw)
    dst_file.write_text(cleaned, encoding="utf-8")

    # preview first 600 chars before / after
    print("\n" + "=" * 40)
    print(f"File: {dst_file.name}")
    print("=" * 40)
    print("BEFORE:\n", raw[:600], "\n...\n")
    print("AFTER:\n", cleaned[:600], "\n...\n")

print("Done! All files cleaned and extra artefacts removed.")


In [None]:
# Define directories and output file paths
markdown_folder = '../data/markdown_subset_3_clean_3'
filenames_output = '../docs/filenames_subset_3_clean_3.txt'
combined_output = '../docs/combined_markdown_subset_3_clean_3.md'

# Write the list of file names from markdown folder into a text file
with open(filenames_output, 'w', encoding='utf-8') as f_out:
    for fname in sorted(os.listdir(markdown_folder)):
        filepath = os.path.join(markdown_folder, fname)
        if os.path.isfile(filepath):
            f_out.write(f"{fname}\n")
print(f"Markdown file names saved to: {filenames_output}")

# Combine all markdown contents from markdown folder into a single file
with open(combined_output, 'w', encoding='utf-8') as combined_f:
    for fname in sorted(os.listdir(markdown_folder)):
        fullpath = os.path.join(markdown_folder, fname)
        if os.path.isfile(fullpath) and fname.lower().endswith('.md'):
            with open(fullpath, 'r', encoding='utf-8') as f_in:
                content = f_in.read()
                combined_f.write(content)
                combined_f.write("\n\n")  # separate files with extra blank lines
print(f"Combined markdown saved to: {combined_output}")

## Added data
```c

"MS in Applied Data Science
How to Apply
The application for 2025 - 2026 is now open!
Master’s in Applied Data Science Application Requirements
The application portal for entrance in Autumn 2025 is now open! Check out our blog post here on tips for applying to the program.

The In-Person program admits full- and part-time students for entrance in the autumn quarter annually. The Online program admits full- and part-time students for entrance in autumn quarter. Online program part-time students may also begin the program in the spring quarter.

Only completed applications are sent for committee review.



Next Application Deadlines:
In-Person:

November 7, 2024 – Priority Application Deadline
December 4, 2024 – Scholarship Priority Deadline
January 21, 2025 – International Application Deadline (requiring via sponsorship from UChicago)
March 4, 2025 – Second Priority Application Deadline
May 6, 2025 – Third Priority Application Deadline
June 23, 2025 – Final Application Deadline
Online:

Spring 2025 Part-Time, Online Only
January 9, 2025 – Final Application Deadline
Autumn 2025
June 23, 2025 – Final Application Deadline
* The application portal may close early if the cohort is filled."
```

Files changed: ../data/markdown_clean_final/education_masters-programs_ms-in-applied-data-science_how-to-apply.md

* Website data was taken from a web **November 9, 2024 archive** of the University of Chicago website applied data science website
* November 9, 2024 website URL: https://web.archive.org/web/20241109185827/https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/how-to-apply/




In [2]:
import os
# Define directories and output file paths
markdown_folder = '../data/markdown_clean_final'
filenames_output = '../docs/filenames_final.txt'
combined_output = '../docs/combined_markdown_final.md'

# Write the list of file names from markdown folder into a text file
with open(filenames_output, 'w', encoding='utf-8') as f_out:
    for fname in sorted(os.listdir(markdown_folder)):
        filepath = os.path.join(markdown_folder, fname)
        if os.path.isfile(filepath):
            f_out.write(f"{fname}\n")
print(f"Markdown file names saved to: {filenames_output}")

# Combine all markdown contents from markdown folder into a single file
with open(combined_output, 'w', encoding='utf-8') as combined_f:
    for fname in sorted(os.listdir(markdown_folder)):
        fullpath = os.path.join(markdown_folder, fname)
        if os.path.isfile(fullpath) and fname.lower().endswith('.md'):
            with open(fullpath, 'r', encoding='utf-8') as f_in:
                content = f_in.read()
                combined_f.write(content)
                combined_f.write("\n\n")  # separate files with extra blank lines
print(f"Combined markdown saved to: {combined_output}")

Markdown file names saved to: ../docs/filenames_final.txt
Combined markdown saved to: ../docs/combined_markdown_final.md


In [7]:
import os
import openai
from dotenv import load_dotenv
import time # For potential rate limiting

# --- Configuration ---

# Load environment variables from .env file
# Ensure your .env file is in the same directory as your notebook or your project's root.
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
print("OpenAI API Key loaded.")

if not openai_api_key:
    print("Error: OPENAI_API_KEY not found in .env file or environment variables.")
    print("Please ensure your .env file is correctly set up.")
    # You might want to raise an exception or exit if the key is critical
    # For a notebook, printing the error might be sufficient for the user to fix.
    # For now, we'll allow the script to define the client,
    # but API calls will fail later if the key isn't truly set.
    
# Initialize OpenAI client
# This should be done after checking for the API key.
# If the key is None, the client initialization will likely fail or API calls will fail.
try:
    client = openai.OpenAI(api_key=openai_api_key)
    print("OpenAI client initialized successfully.")
except Exception as e:
    print(f"Error initializing OpenAI client: {e}")
    print("Please ensure your OPENAI_API_KEY is valid.")
    client = None # Set client to None if initialization fails

# Define the input and output directories
# These paths are relative to the location of your Jupyter Notebook.
# If your notebook is at '/project_root/notebooks/my_notebook.ipynb',
# then '../data/' refers to '/project_root/data/'.
INPUT_DIR = "../data/markdown_clean_final_2"
OUTPUT_DIR = "../data/markdown_clean_final_3"

# Choose your OpenAI model
# Options: "gpt-3.5-turbo", "gpt-4", "gpt-4-turbo-preview", "gpt-4o", "gpt-4o-mini"
# "gpt-4o-mini" or "gpt-3.5-turbo" are generally good for cost/performance balance.
# "gpt-4" or "gpt-4o" might yield higher quality but at a higher cost and slower speed.
MODEL_NAME = "gpt-4.1"

# --- LLM Prompt for Markdown Refinement ---
MARKDOWN_REFINEMENT_PROMPT_TEMPLATE = """
You are an expert technical editor tasked with refining Markdown content for a knowledge base. Your goal is to improve the clarity, conciseness, and consistency of the text while strictly preserving all factual information and the original Markdown structure (especially headers).

**Instructions:**

1.  **Read the entire provided Markdown text carefully.**
2.  **Focus on the textual content under each existing header.**
3.  **Improve Clarity and Conciseness:**
    * Rephrase sentences or paragraphs for better readability if they are awkward, verbose, or unclear.
    * Remove redundant words or phrases without changing the meaning.
    * Ensure smooth transitions between ideas within a section.
4.  **Maintain Factual Accuracy:**
    * **Crucially, do NOT add any new information, facts, statistics, or opinions that are not explicitly present in the original text.**
    * **Do NOT alter any existing factual information, data, numbers, dates, or specific names.**
    * Preserve all URLs and link text in Markdown links (e.g., `[text](url)`) exactly as they are.
    * Preserve all code blocks (```) and inline code (` `` `) exactly as they are.
5.  **Preserve Structure:**
    * **Do NOT change the existing Markdown header levels (e.g., `#`, `##`, `###`) or the text of the headers themselves.** The current header structure is vital for subsequent processing.
    * Do NOT add new headers or sections.
    * Do NOT remove existing headers or sections, even if the content under them is brief.
    * Maintain the original order of sections and paragraphs.
6.  **Formatting:**
    * Ensure consistent Markdown formatting for elements like lists, bolding, and italics, but only if the original intent is clear and the change improves consistency. Do not introduce new formatting styles arbitrarily.
    * Correct any minor Markdown syntax errors if found (e.g., a list item not starting correctly).
7.  **Output:**
    * Provide only the refined Markdown text as your output. Do not include any commentary, preamble, or explanation.

**Input Markdown Text:**
```markdown
{markdown_content}
```

**Refined Markdown Text:**
"""

# --- Function to call OpenAI API ---
def refine_markdown_with_llm(markdown_text, retries=3, delay=5):
    """
    Sends the markdown text to the OpenAI API for refinement using the predefined prompt.
    Includes basic retry logic.
    """
    if not client or not openai_api_key: # Check if client was initialized and key exists
        print("OpenAI client not configured or API key missing. Skipping API call.")
        return None 

    full_prompt = MARKDOWN_REFINEMENT_PROMPT_TEMPLATE.format(markdown_content=markdown_text)
    
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=MODEL_NAME,
                messages=[
                    # System role can provide high-level context about the AI's persona
                    {"role": "system", "content": "You are an expert technical editor specializing in Markdown. Follow user instructions precisely."},
                    {"role": "user", "content": full_prompt}
                ],
                temperature=0.1,  # Lower temperature for more deterministic, less creative output
                max_tokens=4000   # Adjust based on model and expected output length.
                                  # Ensure this is less than model's max context (input+output).
                                  # gpt-4o-mini has a 128k context window, gpt-3.5-turbo has 4k or 16k.
            )
            refined_content = response.choices[0].message.content.strip()
            
            # Clean potential ```markdown ... ``` wrappers
            if refined_content.startswith("```markdown"):
                refined_content = refined_content[len("```markdown"):].strip()
            elif refined_content.startswith("```"): # More general backtick removal
                 refined_content = refined_content[len("```"):].strip()
            
            if refined_content.endswith("```"):
                refined_content = refined_content[:-len("```")].strip()
                
            return refined_content
            
        except openai.RateLimitError as e:
            print(f"OpenAI API Rate Limit Error: {e}. Retrying in {delay} seconds... (Attempt {attempt+1}/{retries})")
            time.sleep(delay)
            delay *= 2 # Exponential backoff
        except openai.APIError as e:
            print(f"OpenAI API Error (Attempt {attempt+1}/{retries}): {e}")
            if attempt == retries - 1: # Last attempt
                return None # Or re-raise the exception
        except Exception as e:
            print(f"An unexpected error occurred during API call (Attempt {attempt+1}/{retries}): {e}")
            if attempt == retries - 1:
                return None # Or re-raise
    
    print("Failed to get a response after multiple retries.")
    return None

# --- Main Processing Logic ---
def process_markdown_files_in_notebook():
    """
    Main function to process markdown files. Designed to be called in a Jupyter Notebook.
    """
    if not client or not openai_api_key:
        print("OpenAI client is not initialized or API key is missing. Please check your .env file and API key.")
        return

    # Create output directory if it doesn't exist
    try:
        os.makedirs(OUTPUT_DIR, exist_ok=True)
        print(f"Input directory: {os.path.abspath(INPUT_DIR)}")
        print(f"Output directory: {os.path.abspath(OUTPUT_DIR)}")
    except OSError as e:
        print(f"Error creating output directory {OUTPUT_DIR}: {e}")
        return

    if not os.path.isdir(INPUT_DIR):
        print(f"Error: Input directory '{INPUT_DIR}' not found. Please check the path.")
        return

    processed_count = 0
    failed_to_refine_count = 0
    error_processing_count = 0
    skipped_empty_count = 0

    markdown_files = [f for f in os.listdir(INPUT_DIR) if f.endswith(".md")]
    total_files = len(markdown_files)
    print(f"Found {total_files} Markdown files to process.")

    for i, filename in enumerate(markdown_files):
        input_filepath = os.path.join(INPUT_DIR, filename)
        output_filepath = os.path.join(OUTPUT_DIR, filename)
        
        print(f"\n[{i+1}/{total_files}] Processing file: {filename}...")
        
        try:
            with open(input_filepath, 'r', encoding='utf-8') as f_in:
                original_content = f_in.read()
            
            if not original_content.strip():
                print(f"File {filename} is empty. Skipping and creating an empty output file.")
                with open(output_filepath, 'w', encoding='utf-8') as f_out:
                    f_out.write("") 
                skipped_empty_count += 1
                continue

            print(f"Sending content of '{filename}' to OpenAI API (model: {MODEL_NAME})...")
            refined_content = refine_markdown_with_llm(original_content)
            
            if refined_content:
                with open(output_filepath, 'w', encoding='utf-8') as f_out:
                    f_out.write(refined_content)
                print(f"Successfully refined and saved: {output_filepath}")
                processed_count += 1
            else:
                print(f"Failed to refine content for {filename}. Original content NOT saved to output directory.")
                # If you want to copy the original on failure, uncomment below:
                # import shutil
                # shutil.copy(input_filepath, output_filepath)
                # print(f"Copied original file to: {output_filepath} due to refinement failure.")
                failed_to_refine_count +=1
                
        except Exception as e:
            print(f"Error processing file {filename}: {e}")
            error_processing_count += 1
            
    print(f"\n--- Processing Complete ---")
    print(f"Successfully refined files: {processed_count}")
    print(f"Skipped empty files (empty output created): {skipped_empty_count}")
    print(f"Failed to refine (no output file created for these): {failed_to_refine_count}")
    print(f"Errors during file processing: {error_processing_count}")

if client and openai_api_key: # Check if client and key are available
    process_markdown_files_in_notebook()
else:
    print("OpenAI client or API key not properly set up. Cannot run processing.")

OpenAI API Key loaded.
OpenAI client initialized successfully.
Input directory: /Users/danielkim/gen-ai-midterm-project/data/markdown_clean_final_2
Output directory: /Users/danielkim/gen-ai-midterm-project/data/markdown_clean_final_3
Found 79 Markdown files to process.

[1/79] Processing file: education_masters-programs_ms-in-applied-data-science_course-progressions.md...
Sending content of 'education_masters-programs_ms-in-applied-data-science_course-progressions.md' to OpenAI API (model: gpt-4.1)...
Successfully refined and saved: ../data/markdown_clean_final_3/education_masters-programs_ms-in-applied-data-science_course-progressions.md

[2/79] Processing file: news_masters-in-applied-data-science-summer-2024-capstone-winners.md...
Sending content of 'news_masters-in-applied-data-science-summer-2024-capstone-winners.md' to OpenAI API (model: gpt-4.1)...
Successfully refined and saved: ../data/markdown_clean_final_3/news_masters-in-applied-data-science-summer-2024-capstone-winners.md


In [None]:
import os

def combine_markdown_files(input_dir, output_file):
    """
    Combines all .md files in input_dir into a single output_file.
    Each file's content is separated and labeled with its filename.
    """
    md_files = [f for f in sorted(os.listdir(input_dir)) if f.endswith('.md')]
    with open(output_file, 'w', encoding='utf-8') as out_f:
        for fname in md_files:
            out_f.write(f"\n\n---\n# FILE: {fname}\n---\n\n")
            with open(os.path.join(input_dir, fname), 'r', encoding='utf-8') as in_f:
                out_f.write(in_f.read())
                out_f.write('\n')

# Combine markdown_clean_final_2
combine_markdown_files(
    "../data/markdown_clean_final_2",
    "../data/combined_markdown_clean_final_2.md"
)

# Combine markdown_clean_final_3
combine_markdown_files(
    "../data/markdown_clean_final_3",
    "../data/combined_markdown_clean_final_3.md"
)

print("Combined files created: combined_markdown_clean_final_2.md and combined_markdown_clean_final_3.md")

Combined files created: combined_markdown_clean_final_2.md and combined_markdown_clean_final_3.md


In [9]:
import os

def combine_markdown_files_no_separators(input_dir, output_file):
    """
    Combines all .md files in input_dir into a single output_file.
    No separators or filenames are added—just the raw content.
    """
    md_files = [f for f in sorted(os.listdir(input_dir)) if f.endswith('.md')]
    with open(output_file, 'w', encoding='utf-8') as out_f:
        for fname in md_files:
            with open(os.path.join(input_dir, fname), 'r', encoding='utf-8') as in_f:
                out_f.write(in_f.read())
                out_f.write('\n')  # Add a blank line between files for readability

# Example usage:
combine_markdown_files_no_separators(
    "../data/markdown_clean_final_3",
    "../data/combined_markdown_clean_final_3_nosep.md"
)

print("Combined file created: combined_markdown_clean_final_3_nosep.md")

Combined file created: combined_markdown_clean_final_3_nosep.md


In [None]:
"""
Pass-4: Make Markdown link text descriptive
────────────────────────────────────────────
• Reads every .md in  ../data/markdown_clean_final_3
• Writes edited files to ../data/markdown_clean_final
• Rewrites anchors like “[here](url)” → “[MSADS application form](url)”
• Skips files that contain no vague links (fast copy)
"""
import os, re, time, shutil, textwrap, frontmatter
from pathlib import Path
from dotenv import load_dotenv
import openai

# ── 0. basic config ──────────────────────────────────────────────────────────
INPUT_DIR  = Path("../data/markdown_clean_final_3")
OUTPUT_DIR = Path("../data/markdown_clean_final")
MODEL_NAME = "gpt-4.1"
VAGUE_ANCHOR_RX = re.compile(
    r"""\[
        \s*(?:here|click\W*here|this\s+link|learn\s+more|view|link|url)\s*
    \]""",
    re.I | re.X,
)

LINK_REWRITE_PROMPT = textwrap.dedent("""\
    You are an expert technical editor.

    **Task:** In the Markdown below, find every link whose anchor text is non-descriptive
    (e.g. “here”, “click here”, “this link”, “learn more”, “view”, “URL”, “link”).
    Rewrite ONLY those anchor texts so they concisely describe the destination,
    using hints from the surrounding sentence or the URL slug.

    **Rules**
    1. Do NOT change links that are already descriptive.
    2. Keep every URL exactly the same.
    3. Preserve all other Markdown (headers, lists, code, etc.).
    4. Do NOT add or delete content except for changing those anchor texts.

    Markdown to edit:
    ```markdown
    {markdown_content}
    ```
    Return ONLY the updated Markdown.
""")

# ── 1. env + OpenAI client ───────────────────────────────────────────────────
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY not found in your environment or .env.")

client = openai.OpenAI(api_key=api_key)

# ── 2. helpers ───────────────────────────────────────────────────────────────
def needs_rewrite(text: str) -> bool:
    """True if any generic link anchor is present."""
    return bool(VAGUE_ANCHOR_RX.search(text))

def rewrite_vague_links(md_text: str, retries: int = 3, backoff: int = 5) -> str | None:
    prompt = LINK_REWRITE_PROMPT.format(markdown_content=md_text)
    for attempt in range(retries):
        try:
            resp = client.chat.completions.create(
                model       = MODEL_NAME,
                messages    = [
                    {"role": "system",
                     "content":
                     "You are an expert Markdown editor. Follow the instructions exactly."},
                    {"role": "user", "content": prompt},
                ],
                temperature = 0.0,
                max_tokens  = 4000,
            )
            txt = resp.choices[0].message.content.strip()

            # strip ``` wrappers if the model added them
            for fence in ("```markdown", "```"):
                if txt.startswith(fence):
                    txt = txt[len(fence):].strip()
            if txt.endswith("```"):
                txt = txt[:-3].strip()
            return txt
        except openai.RateLimitError:
            time.sleep(backoff)
            backoff *= 2                    # exponential back-off
        except Exception as e:
            print(f"⚠️  LLM error (attempt {attempt+1}): {e}")
            time.sleep(backoff)
    return None

# ── 3. main processing loop ──────────────────────────────────────────────────
def process_links():
    INPUT_DIR.mkdir(parents=True, exist_ok=True)
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

    files = sorted(p for p in INPUT_DIR.glob("*.md"))
    print(f"Scanning {len(files)} Markdown files…")

    stats = {"rewritten": 0, "copied": 0, "failed": 0}

    for i, src in enumerate(files, 1):
        dst = OUTPUT_DIR / src.name
        text = src.read_text(encoding="utf-8")

        print(f"[{i:>3}/{len(files)}] {src.name}", end=" — ")

        if not needs_rewrite(text):
            shutil.copy(src, dst)
            stats["copied"] += 1
            print("no vague links → copied")
            continue

        new_text = rewrite_vague_links(text)
        if new_text:
            dst.write_text(new_text, encoding="utf-8")
            stats["rewritten"] += 1
            print("rewritten ✅")
        else:
            shutil.copy(src, dst)           # keep original if LLM failed
            stats["failed"] += 1
            print("LLM failed → original copied ⚠️")

    # summary
    print("\n— pass-4 complete —")
    for k, v in stats.items():
        print(f"{k.capitalize():<10}: {v}")

if __name__ == "__main__":
    process_links()


Scanning 78 Markdown files…
[  1/78] education_masters-programs_in-person-program.md — no vague links → copied
[  2/78] education_masters-programs_ms-in-applied-data-science.md — rewritten ✅
[  3/78] education_masters-programs_ms-in-applied-data-science_capstone-projects.md — no vague links → copied
[  4/78] education_masters-programs_ms-in-applied-data-science_career-outcomes.md — no vague links → copied
[  5/78] education_masters-programs_ms-in-applied-data-science_course-progressions.md — no vague links → copied
[  6/78] education_masters-programs_ms-in-applied-data-science_events-deadlines.md — rewritten ✅
[  7/78] education_masters-programs_ms-in-applied-data-science_faqs.md — no vague links → copied
[  8/78] education_masters-programs_ms-in-applied-data-science_how-to-apply.md — no vague links → copied
[  9/78] education_masters-programs_ms-in-applied-data-science_in-person-program.md — no vague links → copied
[ 10/78] education_masters-programs_ms-in-applied-data-science_instruc

In [12]:
# !pip install python-frontmatter

Cleanup of folder names

In [17]:
import os

# Define your base directory
base_dir = "../data"  # Change if your folders are elsewhere

# Mapping of old to new folder names
rename_map = {
    "markdown_clean_final_1": "markdown_clean_final_draft_1",
    "markdown_clean_final_2": "markdown_clean_final_draft_2",
    "markdown_clean_final_3": "markdown_clean_final_draft_3",
}

for old, new in rename_map.items():
    old_path = os.path.join(base_dir, old)
    new_path = os.path.join(base_dir, new)
    if os.path.exists(old_path):
        os.rename(old_path, new_path)
        print(f"Renamed {old_path} -> {new_path}")
    else:
        print(f"Folder not found: {old_path}")

Renamed ../data/markdown_clean_final_1 -> ../data/markdown_clean_final_draft_1
Renamed ../data/markdown_clean_final_2 -> ../data/markdown_clean_final_draft_2
Renamed ../data/markdown_clean_final_3 -> ../data/markdown_clean_final_draft_3


**Final cleaned data is in the `../data/markdown_clean_final` directory**