<a href="https://colab.research.google.com/github/annek77/llamaindex-chat-with-streamlit-docs/blob/main/Party_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. System Imports (Path Handling & Dynamic Filenames)

**Purpose:** Provide OS-level utilities for reliable file and environment management.

- **WHAT:** Import Python’s built-in `os` module.  
- **WHY:** Enables cross-platform path construction, directory creation, existence checks, and secure access to environment variables.  
- **HOW:** Throughout the pipeline we will use `os.path.join`, `os.makedirs`, `os.path.exists`, `os.getenv()` etc.


In [3]:
# Import os module for operating-system interactions
# Enables robust path handling, directory creation, existence checks, and secure environment variable access
import os


## 0. System Imports (Path Handling & Dynamic Filenames)

**Purpose:** Provide essential OS-level utilities to guarantee consistent file operations and environment configuration across the entire pipeline.

* **What:**

  ```python
  import os
  ```

  Load Python’s built-in `os` module.

* **Why:**
  Enables cross-platform file and directory management, path construction, existence checks, and secure access to environment variables.

* **How:**
  Use

  * `os.path.join()` to assemble file paths
  * `os.makedirs()` to create required directories
  * `os.path.exists()` to verify file or folder presence
  * `os.getenv()` to fetch environment variables (e.g. API keys) without hard-coding


In [4]:
# Specify the PDF to analyze and backup settings
# Enables dynamic PDF selection, optional text backup, and controlled extraction stopping
ACTIVE_PDF = "Introduction_Why_Data_Science_Needs_Feminism.pdf"
EXTRACTED_TEXT_FILE = "extracted_text.txt"
EXTRACT_MARKERS = ["references", "bibliography"]


## 1. PDF Settings

**Purpose:** Define the source PDF file and control text extraction parameters.

* **What:**

  ```python
  ACTIVE_PDF = "Introduction_Why_Data_Science_Needs_Feminism.pdf"
  EXTRACTED_TEXT_FILE = "extracted_text.txt"
  EXTRACT_MARKERS = ["references", "bibliography"]
  ```

  Specify the input PDF filename, optional backup text file, and extraction stop markers.

* **Why:**

  * Allows switching to different PDF inputs without modifying code.
  * Ensures raw extracted text is saved for debugging or further analysis.
  * Prevents inclusion of unwanted sections (e.g., references or bibliography) in the main text.

* **How:**

  * Use `ACTIVE_PDF` to locate and open the target PDF.
  * Write full extracted text to `EXTRACTED_TEXT_FILE` when needed.
  * Halt extraction upon encountering any term in `EXTRACT_MARKERS` during page processing.


In [None]:
# Configure spaCy model and pipeline components
# Use lightweight small model for speed and memory efficiency; disable parser to reduce computation
SPACY_MODEL = "en_core_web_sm"
SPACY_DISABLE = ["parser"]
SPACY_BATCH_SIZE = 1000

# Define output path for tokenization JSON
# Persist tokenized data for debugging and further analysis
TOKENS_JSON_OUTPUT = "tokens_step2.json"


## 2. Tokenization & Linguistics (spaCy)

**Purpose:** Configure the NLP pipeline to efficiently process and structure the extracted text.

* **What:**

  ```python
  SPACY_MODEL = "en_core_web_sm"
  SPACY_DISABLE = ["parser"]
  SPACY_BATCH_SIZE = 1000
  TOKENS_JSON_OUTPUT = "tokens_step2.json"
  ```

  Select the spaCy model, disable unused components, set batch size, and define the JSON export path for tokenized data.

* **Why:**

  * Use the lightweight `en_core_web_sm` model to balance speed and accuracy in tokenization, POS-tagging, and NER.
  * Disable the parser to reduce computational overhead when dependency parsing is not required.
  * Process text in batches of 1,000 tokens to optimize memory usage and throughput.
  * Export token data to a JSON file for downstream analysis, debugging, and reproducibility.

* **How:**

  1. Load the spaCy model via `spacy.load(SPACY_MODEL, disable=SPACY_DISABLE)`.
  2. Iterate over text in chunks of `SPACY_BATCH_SIZE` tokens using `nlp.pipe(...)`.
  3. Serialize each document’s tokens and annotations to `TOKENS_JSON_OUTPUT`.


In [None]:
# Define CSV export columns for token data
# Ensure consistent ordering across all CSV outputs for correct parsing
CSV_COLUMNS = [
    "word", "lemma", "pos", "ner",
    "cefr_predicted", "translation_de",
    "anki_front", "anki_back", "anki_type",
    "status_learning", "repetition_stage"
]


## 3. CSV Export: Columns for Token Data

**Purpose:** Define a consistent schema for CSV outputs to ensure correct parsing and interoperability across the pipeline.

* **What:**

  ```python
  CSV_COLUMNS = [
      "word", "lemma", "pos", "ner",
      "cefr_predicted", "translation_de",
      "anki_front", "anki_back", "anki_type",
      "status_learning", "repetition_stage"
  ]
  ```

  Central list of column names used in all generated CSV files.

* **Why:**

  * Maintains a uniform column order in every CSV export, preventing misalignment when loading data.
  * Simplifies maintenance by defining column schema in a single location, avoiding duplication.
  * Enhances readability and clarity of data structure for further analysis and tool integration.

* **How:**

  * Reference `CSV_COLUMNS` in export scripts (e.g., `pandas.DataFrame.to_csv(columns=CSV_COLUMNS)`).
  * Update this list as new fields are added to the processing pipeline to automatically propagate changes to all outputs.


## 4. Translation Configuration

**Purpose:** Define translation service preferences, fallback behavior, and caching mechanism to ensure reliable and efficient translations.

* **What:**

  ```python
  TRANSLATION_SERVICE = "wiktionary"
  ENABLE_FALLBACK = True
  FALLBACK_SERVICE = "libre"
  TRANSLATION_CACHE_FILE = "translation_cache.json"
  TRANSLATE_TARGET = "token"
  ```

  Configure primary offline translation source, enable fallback to an online API, specify cache file, and set translation granularity.

* **Why:**

  * Word translations are essential for creating bilingual vocabulary entries.
  * Offline-first approach reduces dependence on external APIs and speeds up common lookups.
  * Fallback to LibreTranslate ensures coverage if offline lookup fails.
  * Caching avoids redundant API calls and accelerates repeated translations.
  * Translation target controls whether tokens or entire vocabulary entries are translated.

* **How:**

  * Use the `TRANSLATION_SERVICE` setting to choose between WiktionaryParser and fallback API.
  * Check `ENABLE_FALLBACK` and call `_translate_api` when offline lookup yields no result.
  * Store and retrieve translations from `TRANSLATION_CACHE_FILE` to minimize network requests.
  * Apply translations at the token level as specified by `TRANSLATE_TARGET`.


In [None]:
# Section 5: Export Files (Dynamic Naming)
# Output filenames are derived from the active PDF name.
# No fixed constants are defined here; generated names follow the pattern:
#    <PDF_BASE_NAME>_<descriptor>.csv, .txt, etc.
# Example: "Introduction_Why_Data_Science_Needs_Feminism_all_tokens.csv"


## 5. Export Files (Dynamic Naming)

**Purpose:** Automatically derive output filenames from the active PDF name to maintain consistency and avoid manual renaming.

* **What:**

  * Dynamic naming pattern for all export files (tokens, CSVs, Anki decks, metadata).
  * No hard-coded filenames; base names are constructed from `ACTIVE_PDF`.

* **Why:**

  * Ensures uniform naming conventions across different export types.
  * Reduces risk of filename collisions and manual errors.
  * Simplifies batch processing when handling multiple PDFs.

* **How:**

  * Extract the base name from `ACTIVE_PDF` (e.g., `Introduction_Why_Data_Science_Needs_Feminism`).
  * Append descriptors (e.g., `_all_tokens.csv`, `_vocab_list.csv`).
  * Use `os.path.join()` to construct full paths for each output file in the desired directory structure.


In [None]:
# Section 6: Interactive Quiz & Anki Control
# Enable or disable the terminal quiz and Anki deck generation
ENABLE_TERMINAL_QUIZ = True
EXPORT_ANKI_DECK = True


## 6. Interactive Quiz & Anki Control

**Purpose:** Configure whether to activate the terminal-based vocabulary quiz and whether to generate Anki flashcard decks for spaced repetition.

* **What:**

  ```python
  ENABLE_TERMINAL_QUIZ = True
  EXPORT_ANKI_DECK = True
  ```

  Toggle flags for enabling the built-in terminal quiz and exporting Anki decks.

* **Why:**

  * Provides an interactive practice mode directly in the console for quick vocabulary reinforcement.
  * Generates Anki decks to support long-term retention through spaced repetition.
  * Allows flexibility to disable one or both features during development or batch processing.

* **How:**

  * Check `ENABLE_TERMINAL_QUIZ` before launching the quiz loop.
  * Use `EXPORT_ANKI_DECK` to conditionally call the Anki deck export function (e.g., via `genanki`).
  * Maintain separation between data extraction and learning interface to ensure modularity.


In [None]:
# Section 7: Vocabulary Filters, Stopwords & Phrases
# Configure custom stopwords, phrase lists, and raw vocabulary sources
EXTRA_STOPWORDS = []
PHRASE_LIST_PATH = "phrases.txt"
RAW_VOCAB_LIST_PATHS = [
    "The_Oxford_5000_by_CEFR_level.pdf",
    "digital_humanities_terms.txt"
]
VOCAB_LIST_OUTPUT = "vocab_list.csv"


## 7. Vocabulary Filters, Stopwords & Phrases

**Purpose:** Configure filtering of tokens using custom stopwords and phrase lists, and specify raw vocabulary sources for list construction.

* **What:**

  ```python
  EXTRA_STOPWORDS = []
  PHRASE_LIST_PATH = "phrases.txt"
  RAW_VOCAB_LIST_PATHS = [
      "The_Oxford_5000_by_CEFR_level.pdf",
      "digital_humanities_terms.txt"
  ]
  VOCAB_LIST_OUTPUT = "vocab_list.csv"
  ```

  Define additional stopwords to exclude, path to custom phrase list, sources for raw vocabulary extraction, and output file for consolidated vocabulary list.

* **Why:**

  * Exclude irrelevant or high-frequency words via `EXTRA_STOPWORDS`.
  * Identify multi-word expressions using a phrase list for more accurate token grouping.
  * Incorporate external vocabulary sources (e.g., Oxford 5000 list, domain-specific terms) to enrich the final list.
  * Export filtered and merged vocabulary as a CSV for downstream quiz and flashcard generation.

* **How:**

  * Load `EXTRA_STOPWORDS` and remove matching tokens during filtering.
  * Read phrases from `PHRASE_LIST_PATH` and detect them in the token stream.
  * Iterate through each path in `RAW_VOCAB_LIST_PATHS`, parse entries, and merge lists.
  * Write the consolidated list to `VOCAB_LIST_OUTPUT` using `pandas.DataFrame.to_csv` with predefined `CSV_COLUMNS`.


In [None]:
# Section 8: Metadata & Source Management
# Configure metadata fields and output files for citations and vocabulary identifiers
DUBLIN_CORE_FIELDS = [
    "title", "creator", "publisher", "date",
    "format", "identifier", "language"
]
DUBLIN_CORE_OUTPUT = "dublin_core_metadata.txt"
APA_CITATION_OUTPUT = "citation_APA.txt"
VOCAB_ID_PREFIX = "vocab_"
SOURCE_ID_PREFIX = "source_"


## 8. Metadata & Source Management

**Purpose:** Define citation metadata fields and identifier prefixes for vocabulary entries and data sources, ensuring standardized reference handling and export.

* **What:**

  ```python
  DUBLIN_CORE_FIELDS = [
      "title", "creator", "publisher", "date",  
      "format", "identifier", "language"
  ]
  DUBLIN_CORE_OUTPUT = "dublin_core_metadata.txt"
  APA_CITATION_OUTPUT = "citation_APA.txt"
  VOCAB_ID_PREFIX = "vocab_"
  SOURCE_ID_PREFIX = "source_"
  ```

  Specify the metadata fields for Dublin Core, output filenames for metadata and APA citations, and prefixes for generated IDs.

* **Why:**

  * Ensure consistent metadata formatting across all exports.
  * Automate creation of standardized citation files for documentation and publication.
  * Generate clear, unique identifiers for vocabulary items and sources, avoiding naming conflicts.

* **How:**

  * Iterate over `DUBLIN_CORE_FIELDS` to build metadata entries in the correct order.
  * Write metadata lines to `DUBLIN_CORE_OUTPUT` and log APA-formatted citation to `APA_CITATION_OUTPUT`.
  * Prepend `VOCAB_ID_PREFIX` and `SOURCE_ID_PREFIX` to each generated identifier when exporting vocabulary or citation data.


In [None]:
# Section 9: Debug & Logging
# Configure debug mode and logging outputs for development and troubleshooting
DEBUG_MODE = True
LOG_TO_CONSOLE = True
LOG_TO_FILE = False


## 9. Debug & Logging

**Purpose:** Configure debugging behavior and logging destinations to facilitate development, troubleshooting, and audit trail creation.

* **What:**

  ```python
  DEBUG_MODE = True
  LOG_TO_CONSOLE = True
  LOG_TO_FILE = False
  ```

  Toggle flags for enabling verbose debug output and choosing between console or file logging.

* **Why:**

  * Activate `DEBUG_MODE` to print detailed internal state and error messages during development.
  * Use `LOG_TO_CONSOLE` for immediate feedback in the terminal.
  * Enable `LOG_TO_FILE` to persist logs for later analysis or auditing.

* **How:**

  * Wrap key functions with conditional logging statements based on `DEBUG_MODE`.
  * Route log messages to stdout when `LOG_TO_CONSOLE` is `True`.
  * If `LOG_TO_FILE` is `True`, write log entries to a configured log file using Python’s `logging` module with a file handler.



In [5]:
# Install required packages for PDF processing
%pip install tools
%pip install PyMuPDF



Collecting tools
  Downloading tools-1.0.2-py3-none-any.whl.metadata (1.4 kB)
Downloading tools-1.0.2-py3-none-any.whl (37 kB)
Installing collected packages: tools
Successfully installed tools-1.0.2
Collecting PyMuPDF
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.3


## Install Dependencies

**Purpose:** Ensure required packages are installed for PDF processing and text sanitization.

* **What:**

  ```python
  %pip install tools
  %pip install PyMuPDF
  ```

  Install the `tools` utility package and the `PyMuPDF` library for PDF parsing.

* **Why:**

  * `tools` offers helper functions used across the project.
  * `PyMuPDF` (imported as `fitz`) provides reliable text extraction capabilities from PDF documents.

* **How:**

  * Run the `%pip install` commands in the first cell of the notebook.
  * Use Colab’s magic commands to ensure packages install into the active environment without restarting the runtime.



In [6]:
# Section 10: PDF Extraction & Sanitization
# Install required packages for PDF processing
%pip install tools
%pip install PyMuPDF

# Import libraries for PDF parsing and text cleaning
import fitz  # PyMuPDF for PDF reading
import re    # Regular expressions for text sanitization

# Function: sanitize_text
# Remove footnote markers, back-reference arrows, and extraneous lines
def sanitize_text(text: str) -> str:
    """
    Clean the raw text by:
    1. Removing bracketed footnote numbers ([1], [23], …)
    2. Stripping back-reference symbols (↩)
    3. Dropping lines composed solely of digits
    4. Excluding lines like 'Page 7' or 'Seite 8'
    5. Skipping lines starting with numbered lists (e.g., '1. ', '2. ')
    """
    # (1) Remove bracketed footnote numbers
    text = re.sub(r'\[\d+\]', '', text)
    # (2) Strip back-reference arrows
    text = text.replace('↩', '')

    filtered_lines = []
    for line in text.splitlines():
        stripped = line.strip()
        # (3) Skip lines composed only of digits
        if re.fullmatch(r'\d+', stripped):
            continue
        # (4) Skip page labels like 'Page 7' or 'Seite 8'
        if re.fullmatch(r'(page|seite)\s*\d+', stripped.lower()):
            continue
        # (5) Skip lines starting with numbered lists
        if re.match(r'^\d+\.\s', stripped):
            continue
        filtered_lines.append(line)
    return "\n".join(filtered_lines)

# Main function: extract_clean_text
# Parse PDF until a marker is found, then sanitize and optionally save
def extract_clean_text(
    pdf_path: str = ACTIVE_PDF,
    markers: list = EXTRACT_MARKERS,
    save_to_file: bool = False
) -> str:
    """
    Read text from a PDF, stop at defined markers, sanitize content, and optionally write to file.
    """
    full_text = ""
    stops = [m.lower() for m in markers] + ['footnotes']

    try:
        doc = fitz.open(pdf_path)
    except Exception as e:
        raise RuntimeError(f"Error opening PDF '{pdf_path}': {e}")

    try:
        for page in doc:
            txt = page.get_text("text")
            low = txt.lower()
            hits = [(low.find(m), m) for m in stops if m in low]
            if hits:
                idx, mark = min(hits, key=lambda x: x[0])
                full_text += txt[:idx]
                print(f"Marker '{mark}' found on page {page.number+1}, stopping extraction.")
                break
            full_text += txt
    finally:
        doc.close()

    # Clean the concatenated text
    clean = sanitize_text(full_text)

    # Optionally save to file
    if save_to_file:
        try:
            with open(EXTRACTED_TEXT_FILE, "w", encoding="utf-8") as f:
                f.write(clean)
            print(f"Text saved to: {EXTRACTED_TEXT_FILE}")
        except Exception as e:
            print(f"Error writing file '{EXTRACTED_TEXT_FILE}': {e}")
    return clean

# Script entry point
if __name__ == "__main__":
    result = extract_clean_text(save_to_file=True)
    print("\n--- Preview (first 500 characters) ---\n")
    print(result[:500])
    print("\n-------------------------------------")




RuntimeError: Error opening PDF 'Introduction_Why_Data_Science_Needs_Feminism.pdf': no such file: 'Introduction_Why_Data_Science_Needs_Feminism.pdf'

## 10. PDF Extraction & Sanitization

**Purpose:** Extract text from the source PDF up to defined stop markers, then clean it by removing footnotes, page numbers, and other noise to prepare for further analysis.

* **What:**

  ```python
  # Install dependencies
  %pip install tools
  %pip install PyMuPDF

  # Imports for PDF parsing and sanitization
  import fitz  # PyMuPDF for PDF reading
  import re    # Regular expressions for text cleaning

  # sanitize_text(text: str) -> str
  #   Remove bracketed footnotes, back-reference arrows, numeric-only lines, page labels, and list markers.

  # extract_clean_text(pdf_path, markers, save_to_file) -> str
  #   Open PDF, read until a marker is found, concatenate page text, apply sanitize_text, optionally save to file.

  # Script entry point for standalone execution
  ```

* **Why:**

  * Ensure reliable extraction of main content by stopping at markers (e.g., 'references', 'footnotes').
  * Use `PyMuPDF` for accurate, page-level text retrieval.
  * Clean common artifacts (footnotes, page numbers) to improve text quality.
  * Provide optional file output for auditing and iterative debugging.

* **How:**

  1. Run installation commands at the top of the notebook using Colab magics.
  2. Import `fitz` and `re` for core functionality.
  3. Define `sanitize_text` to apply regex-based cleaning and line filtering.
  4. Define `extract_clean_text` to iterate through pages, detect stop markers, accumulate raw text, and clean.
  5. Use the `__main__` guard to allow direct script execution and preview of cleaned text.



In [None]:
import spacy
import json

def load_text(path: str) -> str:
    """
    Read cleaned text from a file and return it as a string.
    """
    try:
        with open(path, "r", encoding="utf-8") as f:
            return f.read()
    except Exception as e:
        raise RuntimeError(f"Error loading file '{path}': {e}")


def tokenize_text(text: str) -> list:
    """
    Perform linguistic analysis using spaCy and generate a list of token dictionaries with lemma, POS, and NER.
    """
    nlp = spacy.load(SPACY_MODEL, disable=SPACY_DISABLE)
    doc = nlp(text)

    token_data = []
    for token in doc:
        # Skip stopwords
        if token.is_stop:
            continue
        # Keep only alphabetic tokens
        if not token.is_alpha:
            continue
        token_entry = {
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "ner": token.ent_type_
        }
        token_data.append(token_entry)
    return token_data


def save_tokens_json(tokens: list, output_path: str):
    """
    Save the token list as a JSON file.
    """
    try:
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(tokens, f, ensure_ascii=False, indent=2)
        print(f"💾 Tokens saved to: {output_path}")
    except Exception as e:
        print(f"❌ Error saving tokens to '{output_path}': {e}")


def preview_tokens(tokens: list, limit: int = 10):
    """
    Display a preview of the first N tokens.
    """
    print(f"\n🧾 Preview of first {limit} tokens:")
    for t in tokens[:limit]:
        print(f"{t['word']:15} | Lemma: {t['lemma']:15} | POS: {t['pos']:5} | NER: {t['ner']}")


if __name__ == "__main__":
    print("🧪 Starting tokenization...")
    raw_text = load_text(EXTRACTED_TEXT_FILE)
    tokens = tokenize_text(raw_text)
    print(f"✅ {len(tokens)} valid tokens extracted.")
    preview_tokens(tokens)
    save_tokens_json(tokens, TOKENS_JSON_OUTPUT)


## 11. Text Loading & Tokenization

**Purpose:** Read the cleaned text from file, perform NLP tokenization using spaCy, and manage token persistence and preview for further processing.

* **What:**

  ```python
  import spacy
  import json

  def load_text(path: str) -> str:
      # Read cleaned text file and return its content as a string
      ...

  def tokenize_text(text: str) -> list:
      # Load spaCy model and generate a list of token dicts with word, lemma, POS, and NER
      ...

  def save_tokens_json(tokens: list, output_path: str):
      # Serialize token list to a JSON file at the given path
      ...

  def preview_tokens(tokens: list, limit: int = 10):
      # Print a preview of the first `limit` tokens for inspection
      ...

  if __name__ == "__main__":
      # Orchestrate loading, tokenization, preview, and saving when run as a script
      ...
  ```

* **Why:**

  * **Reproducible I/O:** Ensures consistent loading of preprocessed text.
  * **Linguistic Analysis:** Extracts lexical features (lemmas, POS tags, named entities) for each token.
  * **Noise Reduction:** Filters out stopwords and non-alphabetic tokens for clean vocabulary lists.
  * **Persistence:** Saves token data in JSON format for downstream modules or manual inspection.
  * **Feedback Loop:** Provides a quick console preview of tokens to validate processing steps.

* **How:**

  1. Call `load_text(EXTRACTED_TEXT_FILE)` to fetch cleaned text.
  2. Initialize `nlp = spacy.load(SPACY_MODEL, disable=SPACY_DISABLE)` and process text.
  3. Iterate over `nlp(text)` results, skipping stopwords and non-alpha tokens, and build `token_data`.
  4. Use `json.dump(tokens, f)` to write `TOKENS_JSON_OUTPUT`.
  5. Invoke `preview_tokens` to display the first few tokens and verify output.


In [7]:
# Install required packages for PDF processing and translation
%pip install tools
%pip install PyMuPDF
%pip install deep_translator


Collecting deep_translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: deep_translator
Successfully installed deep_translator-1.11.4


## Install Dependencies

**Purpose:** Ensure required packages are installed for PDF processing, text sanitization, and translation functionality.

* **What:**

  ```python
  %pip install tools
  %pip install PyMuPDF
  %pip install deep_translator
  ```

  Install the `tools` utility package, `PyMuPDF` for PDF parsing, and `deep_translator` for translation services.

* **Why:**

  * `tools` provides helper functions across the project.
  * `PyMuPDF` (imported as `fitz`) offers accurate PDF text extraction.
  * `deep_translator` enables seamless integration with translation APIs (e.g., Google Translate).

* **How:**

  * Execute these `%pip install` commands in the notebook’s first cell.
  * Use Colab magic commands to install packages into the active runtime without restarting the kernel.


In [None]:
# translator.py – central module for translations with offline lookup, API fallback, and caching
# ================================================================
# Bundles all translation functions:
# 1. Cache management to reuse previously translated words.
# 2. Offline translation via WiktionaryParser (with fallback if methods are missing).
# 3. API fallback via LibreTranslate.
# 4. Return and update cache.

import os
import json
import requests
from deep_translator import GoogleTranslator

# Optional import for offline Wiktionary parsing
try:
    from wiktionaryparser import WiktionaryParser
except ImportError:
    WiktionaryParser = None

# ----------------------------------------------------------------
# Load translation cache from file
# ----------------------------------------------------------------
def _load_cache() -> dict:
    if os.path.exists(TRANSLATION_CACHE_FILE):
        try:
            with open(TRANSLATION_CACHE_FILE, "r", encoding="utf-8") as f:
                return json.load(f)
        except json.JSONDecodeError:
            return {}
    return {}

# ----------------------------------------------------------------
# Save translation cache to file
# ----------------------------------------------------------------
def _save_cache(cache: dict):
    os.makedirs(os.path.dirname(TRANSLATION_CACHE_FILE) or '.', exist_ok=True)
    with open(TRANSLATION_CACHE_FILE, "w", encoding="utf-8") as f:
        json.dump(cache, f, ensure_ascii=False, indent=2)

# ----------------------------------------------------------------
# Attempt offline translation via WiktionaryParser
# ----------------------------------------------------------------
def _translate_offline(word: str) -> str:
    if WiktionaryParser is None:
        return ""

    parser = WiktionaryParser()
    try:
        parser.set_default_language("english")
    except Exception:
        pass
    try:
        parser.set_languages(["german"])
    except Exception:
        pass

    try:
        entries = parser.fetch(word)
        if entries:
            definitions = entries[0].get("definitions", []) or []
            for d in definitions:
                translations = d.get("translations") or {}
                german = translations.get("german") or translations.get("deutsch")
                if german:
                    return ", ".join(german)
    except Exception:
        pass
    return ""

# ----------------------------------------------------------------
# API fallback via LibreTranslate
# ----------------------------------------------------------------
def _translate_api(word: str) -> str:
    if FALLBACK_SERVICE.lower() != "libre":
        return ""
    try:
        resp = requests.post(
            "https://libretranslate.com/translate",
            data={"q": word, "source": "en", "target": "de", "format": "text"},
            timeout=10
        )
        if resp.ok:
            return resp.json().get("translatedText", "")
    except Exception:
        pass
    return ""

# ----------------------------------------------------------------
# Main translation function: translate_word
# ----------------------------------------------------------------
def translate_word(word: str) -> str:
    cache = _load_cache()
    key = word.strip().lower()
    # 1) Check cache
    if key in cache:
        return cache[key]

    # 2) Primary translation via GoogleTranslator
    translation = GoogleTranslator(source='en', target='de').translate(word)

    # 3) (Optional) Offline or API fallback
    # if TRANSLATION_SERVICE.lower() == "wiktionary":
    #     offline = _translate_offline(word)
    #     if offline:
    #         translation = offline
    # if not translation and ENABLE_FALLBACK:
    #     translation = _translate_api(word)

    # 4) Update and save cache
    cache[key] = translation
    _save_cache(cache)

    return translation


## 12. Translation Module (translator.py)

**Purpose:** Centralize word translation logic with caching, offline lookup via Wiktionary, and online API fallback to ensure efficient and reliable English→German translations.

* **What:**

  ```python
  import os
  import json
  import requests
  from deep_translator import GoogleTranslator
  try:
      from wiktionaryparser import WiktionaryParser
  except ImportError:
      WiktionaryParser = None

  def _load_cache() -> dict: ...
  def _save_cache(cache: dict): ...
  def _translate_offline(word: str) -> str: ...
  def _translate_api(word: str) -> str: ...
  def translate_word(word: str) -> str: ...
  ```

  Functions handle cache I/O, offline Wiktionary lookup, API calls to LibreTranslate, and orchestration via GoogleTranslator with transparent fallback.

* **Why:**

  * `_load_cache / _save_cache`: Reuse previous translations and minimize API usage.
  * `_translate_offline`: Attempt local dictionary lookup to reduce network dependency.
  * `_translate_api`: Provide online fallback when offline lookup fails.
  * `translate_word`: Offer a single entry point that checks cache, uses GoogleTranslator, and updates the cache.

* **How:**

  1. Load translation cache from `TRANSLATION_CACHE_FILE`.
  2. If the word exists in the cache, return the cached result.
  3. Use `GoogleTranslator(source='en', target='de')` for primary translation.
  4. (Optional) Uncomment offline or API fallback code to enable additional lookup methods.
  5. Store the final translation in the cache and save it to disk.


In [None]:
# ===============================================================
# 📄 step3a_cefr_parser.py – Oxford 5000 PDF Parsing and Filtering
# ===============================================================
# Extract B2 and C1 level vocabulary from the "The Oxford 5000 by CEFR level" PDF,
# clean footers, parse word + POS, and write results to CSV per CSV_COLUMNS in config.py.
# ===============================================================

import fitz  # PyMuPDF
import re
import csv

# Path to the Oxford 5000 PDF
PDF_PATH = "The_Oxford_5000_by_CEFR_level.pdf"
# Output CSV file for filtered vocabulary
OUTPUT_CSV = "oxford_5000_cefr.csv"

# Regex to remove footers (e.g., "© Oxford University Press The Oxford 5000™ by CEFR level 2 / 8")
FOOTER_PATTERN = re.compile(r"© Oxford University Press.*\d+\s*/\s*\d+")

# Regex for entries: word followed by POS notation, e.g., "absorb v." or "boost v., n."
ENTRY_PATTERN = re.compile(r"^([a-zA-Z\-]+)\s+([a-z\.,\s]+)\.")

# CEFR levels to extract
VALID_LEVELS = {"B2", "C1"}

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Load the PDF and extract raw text, removing defined footer patterns.
    """
    doc = fitz.open(pdf_path)
    all_text = []

    for page in doc:
        text = page.get_text("text")
        # Remove footer if present
        text = FOOTER_PATTERN.sub("", text)
        all_text.append(text.strip())

    doc.close()
    return "\n".join(all_text)

def parse_cefr_vocab(text: str) -> list:
    """
    Parse the extracted text line by line, detect level markers and entries,
    and build a list of vocabulary entry dicts.
    """
    rows = []
    current_level = None

    for line in text.splitlines():
        line = line.strip()
        # Skip empty lines and headings
        if not line or line.lower().startswith("the oxford"):
            continue

        # Detect level markers
        if line in VALID_LEVELS:
            current_level = line
            continue

        # Match vocabulary entries
        match = ENTRY_PATTERN.match(line)
        if match and current_level in VALID_LEVELS:
            word = match.group(1)
            pos_raw = match.group(2).replace(" ", "")
            pos_list = [pos.strip() for pos in pos_raw.split(",") if pos.strip()]

            for pos in pos_list:
                entry = {
                    "word": word,
                    "lemma": "",
                    "pos": pos,
                    "ner": "",
                    "cefr_predicted": current_level,
                    "translation_de": "",
                    "anki_front": "",
                    "anki_back": "",
                    "anki_type": "",
                    "status_learning": "new",
                    "repetition_stage": ""
                }
                rows.append(entry)

    return rows

def write_csv(rows: list, csv_path: str):
    """
    Write the vocabulary rows to a CSV file using CSV_COLUMNS from config.
    """
    with open(csv_path, "w", encoding="utf-8", newline="") as f_out:
        writer = csv.DictWriter(f_out, fieldnames=CSV_COLUMNS)
        writer.writeheader()
        writer.writerows(rows)

    print(f"✅ Completed! Wrote {len(rows)} entries to '{csv_path}'")

# -------------------- Main Script --------------------
if __name__ == "__main__":
    print("📥 Extracting text from PDF...")
    pdf_text = extract_text_from_pdf(PDF_PATH)

    print("🧩 Parsing CEFR vocabulary entries...")
    vocab_entries = parse_cefr_vocab(pdf_text)

    print("💾 Writing entries to CSV...")
    write_csv(vocab_entries, OUTPUT_CSV)


## step3a\_cefr\_parser.py – Oxford 5000 PDF Parsing and Filtering

**Purpose:** Extract and filter B2/C1 level vocabulary entries from the “The Oxford 5000 by CEFR level” PDF, clean page footers, and export structured entries to CSV using the schema defined in `config.py`.

* **What:**

  ```python
  # Open PDF and remove footer patterns
  def extract_text_from_pdf(pdf_path: str) -> str: ...

  # Parse cleaned text, detect CEFR level markers, and build vocabulary entries
  def parse_cefr_vocab(text: str) -> list: ...

  # Write the resulting entries to CSV using CSV_COLUMNS
  def write_csv(rows: list, csv_path: str): ...

  # Main script: orchestrate extraction, parsing, and CSV writing
  ```

* **Why:**

  * Automate extraction of target-level vocabulary (B2 and C1) from a standardized CEFR list.
  * Remove recurring footer noise to prevent false matches and ensure data cleanliness.
  * Structure output in CSV format for seamless integration with quiz and flashcard modules.

* **How:**

  1. Load the PDF using `fitz.open()` and extract raw text, applying `FOOTER_PATTERN` to strip out footers.
  2. Iterate over each line of text: identify level markers, apply `ENTRY_PATTERN` to match word + POS, and accumulate entries in a list of dictionaries.
  3. Use `csv.DictWriter` with `CSV_COLUMNS` to serialize the list of entry dicts into the output CSV file.
  4. Run as a standalone script, printing progress messages at each stage.


In [None]:
# ===============================================================
# 📄 step3aa_translate_oxford.py – One-time Oxford Vocabulary Translation
# ===============================================================
# Reads 'oxford_5000_cefr.csv', translates each 'word' field into German,
# and writes the output to 'oxford_5000_translated.csv'.
# Uses primary translation via GoogleTranslator, with optional offline or API fallback.
# ===============================================================

import csv

INPUT_CSV = "oxford_5000_cefr.csv"
OUTPUT_CSV = "oxford_5000_translated.csv"


def load_vocab(csv_path: str) -> list:
    """
    Read vocabulary entries from a CSV file and return them as a list of dictionaries.
    """
    with open(csv_path, encoding="utf-8") as f_in:
        reader = csv.DictReader(f_in)
        return list(reader)


def translate_vocab(rows: list) -> list:
    """
    Translate entries where 'translation_de' is empty, skipping already translated words.
    """
    for row in rows:
        word = row["word"].strip()
        if row.get("translation_de"):
            continue
        try:
            row["translation_de"] = translate_word(word)
        except Exception as e:
            row["translation_de"] = "[ERROR]"
            print(f"⚠️ Error translating '{word}': {e}")
    return rows


def save_vocab(rows: list, csv_path: str):
    """
    Save the translated vocabulary list to a new CSV file.
    """
    with open(csv_path, "w", encoding="utf-8", newline="") as f_out:
        writer = csv.DictWriter(f_out, fieldnames=CSV_COLUMNS)
        writer.writeheader()
        writer.writerows(rows)
    print(f"✅ Translated file saved to: {csv_path}")


if __name__ == "__main__":
    print("📥 Loading Oxford vocabulary list...")
    data = load_vocab(INPUT_CSV)
    print("🌍 Translating vocabulary entries...")
    translated = translate_vocab(data)
    print("💾 Saving translated list...")
    save_vocab(translated, OUTPUT_CSV)


step3aa_translate_oxford.py – One-time Oxford Vocabulary Translation

Purpose: Translate B2/C1 vocabulary entries from the extracted Oxford list into German and export results for downstream usage.

What:

import csv
INPUT_CSV = "oxford_5000_cefr.csv"
OUTPUT_CSV = "oxford_5000_translated.csv"

def load_vocab(csv_path: str) -> list: ...
def translate_vocab(rows: list) -> list: ...
def save_vocab(rows: list, csv_path: str): ...

if __name__ == "__main__":
    # Orchestrate loading, translating, and saving

Why:

Automate translation of extracted vocabulary without manual effort.

Skip entries already translated to minimize API calls.

Handle translation errors gracefully with placeholders.

Produce a cleaned CSV file ready for integration into quiz or flashcard modules.

How:

Read input CSV using csv.DictReader.

Loop through rows, calling translate_word for each word needing translation.

Populate translation_de field and catch exceptions.

Write updated entries to OUTPUT_CSV with csv.DictWriter.

Execute logic in the __main__ block, printing status messages at each step.



In [None]:
# ===============================================================
# 📄 step3b_active-pdf-translate.py – Complete Step 3b: Oxford List, Token Loading & Translation
# ===============================================================
# Combines the Oxford CEFR list with token data, translates new tokens, and exports an extended vocabulary CSV.
# ===============================================================

import csv
import json
import time
import logging
from pathlib import Path
from deep_translator import GoogleTranslator

# Configure logging for translation errors
logging.basicConfig(
    filename='translation_errors.log',
    filemode='a',
    format='%(asctime)s %(levelname)s: %(message)s',
    level=logging.WARNING
)

# Load or initialize translation cache
cache_path = Path(TRANSLATION_CACHE_FILE)
translation_cache = {}
if cache_path.exists():
    try:
        translation_cache = json.loads(cache_path.read_text(encoding='utf-8'))
    except json.JSONDecodeError:
        translation_cache = {}

# ----------------------------------------------------------------
# Load Oxford vocabulary as a set
# ----------------------------------------------------------------
def load_oxford_vocab(csv_path: str) -> set:
    vocab = set()
    with open(csv_path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            word = row.get('word')
            if word:
                vocab.add(word)
    return vocab

# ----------------------------------------------------------------
# Get translation with caching and logging
# ----------------------------------------------------------------
def get_translation(word: str) -> str:
    key = word.lower()
    if key in translation_cache:
        return translation_cache[key].get('translation_de', '')
    try:
        translation = GoogleTranslator(source='en', target='de').translate(word)
    except Exception as e:
        logging.warning(f"GoogleTranslator failed for {word}: {e}")
        translation = ''
    translation_cache[key] = {'translation_de': translation}
    return translation

# ----------------------------------------------------------------
# Enrich token list with translations for unknown words
# ----------------------------------------------------------------
def enrich_tokens(tokens: list, known_vocab: set) -> list:
    new_entries = []
    for token in tokens:
        word = token['word']
        if word in known_vocab:
            continue
        translation = get_translation(word)
        if not translation:
            print(f"⚠️ No translation for '{word}', skipping.")
            continue
        entry = {
            'word': word,
            'lemma': token.get('lemma', ''),
            'pos': token.get('pos', ''),
            'ner': token.get('ner', ''),
            'cefr_predicted': '',
            'translation_de': translation,
            'anki_front': '',
            'anki_back': '',
            'anki_type': '',
            'status_learning': 'new',
            'repetition_stage': ''
        }
        new_entries.append(entry)
        print(f"Translated: {word} → {translation}")
        time.sleep(0.2)
    return new_entries

# ----------------------------------------------------------------
# Save translation cache back to file
# ----------------------------------------------------------------
def save_cache():
    with open(TRANSLATION_CACHE_FILE, 'w', encoding='utf-8') as f:
        json.dump(translation_cache, f, ensure_ascii=False, indent=2)
    print(f"💾 Cache saved to '{TRANSLATION_CACHE_FILE}'")

# ----------------------------------------------------------------
# Save enriched entries to CSV
# ----------------------------------------------------------------
def save_to_csv(entries: list, csv_path: str):
    with open(csv_path, 'w', encoding='utf-8', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=CSV_COLUMNS)
        writer.writeheader()
        writer.writerows(entries)
    print(f"✅ Saved {len(entries)} entries to '{csv_path}'")

# -------------------- Main --------------------
def main():
    oxford_csv = 'oxford_5000_cefr.csv'
    tokens_json = 'tokens_step2.json'
    output_csv = 'extended_vocab.csv'

    with open(tokens_json, 'r', encoding='utf-8') as f:
        tokens = json.load(f)

    print('🔍 Loading Oxford vocabulary...')
    known_vocab = load_oxford_vocab(oxford_csv)

    print('🔄 Enriching tokens with translations...')
    new_entries = enrich_tokens(tokens, known_vocab)

    print('💾 Saving enriched vocabulary to CSV...')
    save_to_csv(new_entries, output_csv)

    save_cache()

if __name__ == '__main__':
    main()


step3b_active-pdf-translate.py – Combined Vocabulary Enrichment

Purpose: Integrate Oxford CEFR list and tokenized text to translate and export new vocabulary entries, extending the core dataset.

What:

import csv, json, time, logging
from pathlib import Path
from deep_translator import GoogleTranslator

# load_oxford_vocab(csv_path) -> set
# get_translation(word) -> str
# enrich_tokens(tokens, known_vocab) -> list
# save_cache() -> None
# save_to_csv(entries, csv_path) -> None
# main() orchestrates the above steps

Functions handle loading existing vocabulary, caching translations, enriching tokens, and persisting results.

Why:

Leverage existing Oxford list to avoid duplicate translations.

Cache results to minimize API calls and handle transient failures.

Provide logging of translation errors for later review.

Generate an extended vocabulary CSV for quiz and flashcard modules.

How:

Initialize logging and load translation cache.

Load existing Oxford words into a set via load_oxford_vocab.

Read tokens from tokens_step2.json.

For each token not in the known set, call get_translation and build entry dicts.

Write enriched entries to extended_vocab.csv and save cache back to TRANSLATION_CACHE_FILE.

Execute all logic within a main() function guarded by if __name__ == '__main__'.



## Install genanki

**Purpose:** Ensure the `genanki` library is available for programmatic creation of Anki flashcard decks.

* **What:**

  ```python
  %pip install genanki
  ```

  Install the `genanki` package into the current notebook environment.

* **Why:**

  * Enables automated generation of spaced-repetition flashcards.
  * Integrates with the project’s export scripts to produce `.apkg` files directly from code.
  * Simplifies the workflow by removing manual steps for Anki deck creation.

* **How:**

  1. Run the `%pip install genanki` command in a code cell at the start of the notebook.
  2. Colab magic ensures installation into the active kernel without restarting the runtime.


In [None]:
import csv
from genanki import Model, Note, Deck, Package

# ---------------------------------------------------------------
# 1. Load translated CSV (from step3b/3aa)
# ---------------------------------------------------------------
def load_translated_csv(path: str) -> list:
    with open(path, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        return [
            row for row in reader
            if row.get("translation_de") and not row["translation_de"].startswith("[")
        ]

# ---------------------------------------------------------------
# 2. Export minimal CSV (english/german)
# ---------------------------------------------------------------
def export_to_csv(data: list, path: str):
    with open(path, "w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["english", "german"])
        for item in data:
            writer.writerow([item["word"], item["translation_de"]])

# ---------------------------------------------------------------
# 3. Export Anki deck (.apkg) via genanki
# ---------------------------------------------------------------
def export_to_anki(data: list, deck_name: str, output_file: str):
    model = Model(
        1607392319,
        "Simple Model",
        fields=[{"name": "Question"}, {"name": "Answer"}],
        templates=[{
            "name": "Card 1",
            "qfmt": "{{Question}}",
            "afmt": "{{FrontSide}}<hr id='answer'>{{Answer}}",
        }]
    )

    deck = Deck(2059400110, deck_name)

    for item in data:
        note = Note(
            model=model,
            fields=[item["word"], item["translation_de"]]
        )
        deck.add_note(note)

    Package(deck).write_to_file(output_file)

# ---------------------------------------------------------------
# ▶️ Main (script entry)
# ---------------------------------------------------------------
if __name__ == "__main__":
    print("🧪 Starting Anki export...")

    # 1. Load translated vocabulary
    tokens = load_translated_csv("oxford_5000_translated.csv")
    print(f"📚 {len(tokens)} translated entries loaded.")

    # 2. Export to simple CSV for manual use
    export_to_csv(tokens, "anki_cards.csv")
    print("✅ CSV saved: anki_cards.csv")

    # 3. Create Anki deck
    export_to_anki(tokens, "English-German Vocabulary Training", "anki_deck.apkg")
    print("🎉 Anki deck saved: anki_deck.apkg")


## step3c\_anki\_export.py – CSV and Anki Deck Export

**Purpose:** Generate a simple bilingual CSV and a fully formatted Anki deck (`.apkg`) from translated vocabulary entries to support interactive learning.

* **What:**

  ```python
  import csv
  from genanki import Model, Note, Deck, Package

  def load_translated_csv(path: str) -> list: ...
  def export_to_csv(data: list, path: str): ...
  def export_to_anki(data: list, deck_name: str, output_file: str): ...

  if __name__ == "__main__":
      # Load translated entries, save CSV, create Anki deck
  ```

* **Why:**

  * **load\_translated\_csv:** Filter out entries lacking valid translations to focus on usable vocabulary.
  * **export\_to\_csv:** Provide a lightweight CSV of English–German pairs for manual review or import into other systems.
  * **export\_to\_anki:** Automate creation of a spaced-repetition deck in Anki format, removing manual deck-building steps.

* **How:**

  1. Read the translated CSV using `csv.DictReader` and filter rows where `translation_de` is non-empty and not marked as error.
  2. Write a minimal CSV with headers `english,german` via `csv.writer`.
  3. Define an Anki `Model` with basic front/back fields and a template.
  4. Create a `Deck` and iterate through data to add `Note` objects with question-answer fields.
  5. Use `genanki.Package(deck).write_to_file(output_file)` to save the deck as `.apkg`.


In [None]:
%pip install genanki