Learn English with real TV dialogue. Powered by LLMs.
VocaFlick is an experimental pipeline that uses large language models (LLMs) to help English learners acquire context-rich vocabulary from subtitles of TV series. For every episode, the system selects 50 relevant words and provides their definitions — all based on real-world usage from subtitle data.
- Extracts real vocabulary from TV show episodes
- Uses LLMs and clustering to group words by semantic similarity
- Automatically assigns definitions from English WordNet
- Filters and selects relevant learning vocabulary per episode
- Outputs per-episode JSON files with 50 learning words + definitions
The project is structured as a modular pipeline of 5 main steps:
| Step | Script | Purpose |
|---|---|---|
| 1️⃣ | 01_extract_dict.py |
Extracts word–definition pairs from English WordNet XML |
| 2️⃣ | 02_extract_phrases.py |
Filters multiword expressions from dictionary |
| 3️⃣ | 03_extract_tokens.py |
Processes subtitle data, filters tokens, clusters by meaning |
| 4️⃣ | 04_filter_tokens.py |
Uses LLM to assign semantic categories and filter vocabulary |
| 5️⃣ | 05_generate_output.py |
Assigns per-episode vocabulary and exports final JSON |
- Dictionary: English WordNet (Global WordNet)
- Subtitles: Taiga Subtitles Corpus
Each episode is processed into a .json file with the following structure:
[
{
"token": "negotiate",
"frequency": 14,
"definition": "discuss the terms of an arrangement"
},
...
]Metadata linking episodes to filenames is saved in output/metadata.json.
This is an MVP-level, research-driven project built entirely through LLM-driven development. Expect rough edges — contributions, feedback and forks welcome!
The pipeline is designed to run sequentially:
python 01_extract_dict.py wordnet.xml.gz dict.csv
python 02_extract_phrases.py dict.csv phrases.csv
python 03_extract_tokens.py
python 04_filter_tokens.py
python 05_generate_output.pyYou will need:
- Subtitle
.parquetfiles from the Taiga corpus - spaCy model
en_core_web_md - Access to an Ollama-compatible LLM endpoint (for semantic filtering)
MIT License
- Subtitle corpus by Fascinat0r (HuggingFace)
- Dictionary from Global WordNet
- Language model orchestration inspired by modern NLP workflows