📺 VocaFlick

Learn English with real TV dialogue. Powered by LLMs.

VocaFlick is an experimental pipeline that uses large language models (LLMs) to help English learners acquire context-rich vocabulary from subtitles of TV series. For every episode, the system selects 50 relevant words and provides their definitions — all based on real-world usage from subtitle data.

✨ Features

Extracts real vocabulary from TV show episodes
Uses LLMs and clustering to group words by semantic similarity
Automatically assigns definitions from English WordNet
Filters and selects relevant learning vocabulary per episode
Outputs per-episode JSON files with 50 learning words + definitions

🧠 Pipeline Overview

The project is structured as a modular pipeline of 5 main steps:

Step	Script	Purpose
1️⃣	`01_extract_dict.py`	Extracts word–definition pairs from English WordNet XML
2️⃣	`02_extract_phrases.py`	Filters multiword expressions from dictionary
3️⃣	`03_extract_tokens.py`	Processes subtitle data, filters tokens, clusters by meaning
4️⃣	`04_filter_tokens.py`	Uses LLM to assign semantic categories and filter vocabulary
5️⃣	`05_generate_output.py`	Assigns per-episode vocabulary and exports final JSON

📦 Data Sources

Dictionary: English WordNet (Global WordNet)
Subtitles: Taiga Subtitles Corpus

📁 Output Format

Each episode is processed into a .json file with the following structure:

[
  {
    "token": "negotiate",
    "frequency": 14,
    "definition": "discuss the terms of an arrangement"
  },
  ...
]

Metadata linking episodes to filenames is saved in output/metadata.json.

🧪 Status

This is an MVP-level, research-driven project built entirely through LLM-driven development. Expect rough edges — contributions, feedback and forks welcome!

🚀 Usage

The pipeline is designed to run sequentially:

python 01_extract_dict.py wordnet.xml.gz dict.csv
python 02_extract_phrases.py dict.csv phrases.csv
python 03_extract_tokens.py
python 04_filter_tokens.py
python 05_generate_output.py

You will need:

Subtitle .parquet files from the Taiga corpus
spaCy model en_core_web_md
Access to an Ollama-compatible LLM endpoint (for semantic filtering)

📚 License

MIT License

🧩 Credits

Subtitle corpus by Fascinat0r (HuggingFace)
Dictionary from Global WordNet
Language model orchestration inspired by modern NLP workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📺 VocaFlick

✨ Features

🧠 Pipeline Overview

📦 Data Sources

📁 Output Format

🧪 Status

🚀 Usage

📚 License

🧩 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
output		output
01_extract_dict.py		01_extract_dict.py
02_extract_phrases.py		02_extract_phrases.py
03_extract_tokens.py		03_extract_tokens.py
04_filter_tokens.py		04_filter_tokens.py
05_generate_output.py		05_generate_output.py
README.md		README.md
categories.csv		categories.csv
dict.csv		dict.csv
english-wordnet-2024.xml.gz		english-wordnet-2024.xml.gz
filtered_clustered_tokens_with_categories.csv		filtered_clustered_tokens_with_categories.csv
phrases.csv		phrases.csv
tokenization.py		tokenization.py

Folders and files

Latest commit

History

Repository files navigation

📺 VocaFlick

✨ Features

🧠 Pipeline Overview

📦 Data Sources

📁 Output Format

🧪 Status

🚀 Usage

📚 License

🧩 Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages