Data extractor for 手話・日本語大辞典 (JSL Dictionary)

This repo contains scripts relate to extracting the treasure trove of data from this Japanese Sign Language dictionary. Data I would like to extract are: Japanese glosses, hand shapes, locations and movements, a prose description of the sign and the illustration.

What does the data look like?

Obviously I can't upload this information or the PDF's I'm working with, but you'll find sample pages under sample_data. The dictionary has three types of entries:

one-handed signs
two-handed signs with the same shape for both hands
two-handed signs with different shapes for each hand.

What do you plan to do with the data?

I want to play around with newer NLP technologies and multi-modal learning. Some random ideas I have:

(simplest) generate illustrations from movement descriptions or vise-versa
collect hand/face/body position data and train a model to convert the dictionary data into the control info for an avatar
figure out a writing system and automatically generate spellings

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
sample_data		sample_data
.gitignore		.gitignore
notes.md		notes.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readme.md		readme.md
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data extractor for 手話・日本語大辞典 (JSL Dictionary)

What does the data look like?

What do you plan to do with the data?

About

Releases

Packages

Languages

garfieldnate/syuwa-nihongo-daijiten-extractor

Folders and files

Latest commit

History

Repository files navigation

Data extractor for 手話・日本語大辞典 (JSL Dictionary)

What does the data look like?

What do you plan to do with the data?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages