Skip to content

consulitsk/VocaFlick

Repository files navigation

📺 VocaFlick

Learn English with real TV dialogue. Powered by LLMs.

VocaFlick is an experimental pipeline that uses large language models (LLMs) to help English learners acquire context-rich vocabulary from subtitles of TV series. For every episode, the system selects 50 relevant words and provides their definitions — all based on real-world usage from subtitle data.


✨ Features

  • Extracts real vocabulary from TV show episodes
  • Uses LLMs and clustering to group words by semantic similarity
  • Automatically assigns definitions from English WordNet
  • Filters and selects relevant learning vocabulary per episode
  • Outputs per-episode JSON files with 50 learning words + definitions

🧠 Pipeline Overview

The project is structured as a modular pipeline of 5 main steps:

Step Script Purpose
1️⃣ 01_extract_dict.py Extracts word–definition pairs from English WordNet XML
2️⃣ 02_extract_phrases.py Filters multiword expressions from dictionary
3️⃣ 03_extract_tokens.py Processes subtitle data, filters tokens, clusters by meaning
4️⃣ 04_filter_tokens.py Uses LLM to assign semantic categories and filter vocabulary
5️⃣ 05_generate_output.py Assigns per-episode vocabulary and exports final JSON

📦 Data Sources


📁 Output Format

Each episode is processed into a .json file with the following structure:

[
  {
    "token": "negotiate",
    "frequency": 14,
    "definition": "discuss the terms of an arrangement"
  },
  ...
]

Metadata linking episodes to filenames is saved in output/metadata.json.


🧪 Status

This is an MVP-level, research-driven project built entirely through LLM-driven development. Expect rough edges — contributions, feedback and forks welcome!


🚀 Usage

The pipeline is designed to run sequentially:

python 01_extract_dict.py wordnet.xml.gz dict.csv
python 02_extract_phrases.py dict.csv phrases.csv
python 03_extract_tokens.py
python 04_filter_tokens.py
python 05_generate_output.py

You will need:

  • Subtitle .parquet files from the Taiga corpus
  • spaCy model en_core_web_md
  • Access to an Ollama-compatible LLM endpoint (for semantic filtering)

📚 License

MIT License


🧩 Credits

  • Subtitle corpus by Fascinat0r (HuggingFace)
  • Dictionary from Global WordNet
  • Language model orchestration inspired by modern NLP workflows

About

VocaFlick is an experimental LLM-powered tool that helps users learn English by extracting 50 key words from each TV episode. It builds contextual vocabulary from real dialogues, serving as an MVP to explore AI-driven, engaging language learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages