Skip to content

dpriskorn/LexSrt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LexSrt

This tool helps language mentors overview all lexeme forms found in sentences or SRT-files.

It uses SpaCy to tokenize each subtitle in the srt and looks up the corresponding lexeme in Wikidata based on the detected position of speech and token representation.

The CLI script encourages the user to contribute to Wikidata if senses are completely missing on the matched lexemes.

The API does not check if senses exists in Wikidata currently and simply does the cleaning and matching and output the result.

Features

  • API to match a sentence to lexeme forms
  • CLI to analyze a SRT-file and get an overview of all lexemes an their senses and get 2 csv output files with details

Limitations

  • Only languages with a spaCy model are currently supported

Installation

  1. Clone the repository.
  2. Install dependencies using Poetry:
poetry install
  1. Install a language model for spaCy, for example:
poetry run python3 -m spacy download en_core_web_sm

Recommended models

These models are recommended over the standard spaCy ones in https://spacy.io/models

Use

CLI

python cli.py -i path-to-srt.srt --lang en --spacy_model en_core_web_sm

You can fiddle with the configuration options in config.py

API

An API using fastapi has been implemented.

It supports 2 fields sent via a HTTP POST request:

  • spacy_model
  • sentence

After installing Uvicorn, you can start the API in debug mode:

uvicorn api:app --reload

Test it with:

curl -X POST -H "Content-Type: application/json" http://localhost:8000/process_sentence \
     -d '{"spacy_model": "en_core_web_sm", "sentence": "This is a test sentence."}' 

It should output something like:

{
  "data": [
    {
      "token": "This",
      "spacy_pos": "PRON",
      "matched_forms": [
        "L643260-F1"
      ]
    },
    {
      "token": "is",
      "spacy_pos": "AUX",
      "matched_forms": [
        "L1883-F4"
      ]
    },
    {
      "token": "a",
      "spacy_pos": "DET",
      "matched_forms": [
        "L2767-F1"
      ]
    },
    {
      "token": "test",
      "spacy_pos": "NOUN",
      "matched_forms": [
        "L220909-F1"
      ]
    },
    {
      "token": "sentence",
      "spacy_pos": "NOUN",
      "matched_forms": [
        "L6117-F1"
      ]
    },
    {
      "token": ".",
      "spacy_pos": "ADJ",
      "matched_forms": []
    }
  ]
}

Examples

Ice Age with english limit 8

image

Life of Brian english limit 11

matches: image match errors: image

Shrek norwegian limit 9

matches: image match errors: image

License

GPLv3+ with the exeption of the code borrowed from Ordia, see the licens in the file.

Thanks

Big thanks to Finn Nielsen for writing the spaCy->lexeme function. I improved it a bit for my purposes to get what I wanted.

What I learned

  • Having your own classes is nice.
  • Reusing easy to read quality code from others is nice
  • Reusing classes from other projects is nice
  • WBI is still lacking convenince methods I want so I'll propose them to be included upstream to make life easier for others.
  • Pydantic and black goes a long way, I don't really need all the other linting mypy stuff for whipping up working code like this in a few hours. Stability is good enough for me with those two and the inspection in PyCharm.
  • FastAPI is super duper nice and chatgpt supports it well so using it is a breeze.

About

The purpose of this script is to get all the senses for all the words in a SRT-file from Wikidata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages