Skip to content

ewdowiak/Sicilian_Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sicilian Translator

This repository documents our Tradutturi Sicilianu, the first neural machine translator for the Sicilian language. It documents our work and the steps to reproduce it. We hope it opens the door to natural language processing for the Sicilian language.

What is the Sicilian language?

It's the language spoken by the people of Sicily, Calabria and Puglia. It's the language that they speak at home, with family and friends. It's the language that the Sicilian School of Poets recited at the imperial court of Frederick II in the 13th century. And it's a language spoken here in Brooklyn, NY.

How is Sicilian different from Italian?

Comparing Sicilian to Italian is like comparing American football to Australian rules football. Both football codes trace their origins to 19th-century England, but they evolved separately and have different sets of rules. Analogously, both Sicilian and Italian are Romance languages. Most of their vocabularies and grammars come from Latin, but they evolved separately and have different sets of rules.

But they have (of course) influenced each other. Sicilian poetry inspired Dante, the "father of the Italian language," to write poetry in his native Florentine. And this influence on Dante reveals Sicilian's cultural importance: Sicilian had emerged as a literary language long before Italian.

Can you help me learn Sicilian?

Yes. That's our goal. We hope that the Dieli Dictionary will help you learn vocabulary, that Chiù dâ Palora will help you learn grammar and that Tradutturi Sicilianu will help you write in Sicilian.

One of the best sources of information and learning materials is Arba Sicula. For over 40 years, they have been publishing books and journals about Sicilian history, language, literature, art, folklore and cuisine. And the Mparamu lu sicilianu textbook by its editor, Gaetano Cipolla, is more than just a grammar book. It's a complete introduction to Sicily, its language, culture and people.

What's in this repository?

This repository documents the individual steps that we took to create a neural machine translator and provides the code necessary to reproduce them. Separately, the "With Patience and Dedication" introduction provides a broader overview.

Here in this repository, the extract-text directory contains the scripts that we used to collect parallel text from issues of Arba Sicula (which are in PDF format). The dataset directory contains the scripts that we used to prepare the data for training, while its subdirectory sockeye_n30_sw3000 contains the scripts that we'll use to train the models.

The perl-module/Napizia directory provides a Perl module with tokenization and detokenization subroutines. The cgi-bin directory contains scripts to put the translator on a website.

The embeddings directory contains some experimental work, where we lemmatize the text of both languages and train word embedding models. By computing the matrix of cosine similarity from the embeddings, we can create lists of context similar words and include them in our dictionary one day.

And the presentation directory contains our presentation of this project along with links to the resources that made this project possible.

Data Sources

Our largest source of parallel text are issues of the literary journal Arba Sicula. We mixed that data with Arthur Dieli's translations of poetry, proverbs and Giuseppe Pitrè's Folk Tales. And to "learn" Sicilian, we also collected parallel text from the Mparamu lu sicilianu textbook by Gaetano Cipolla (2013) and from Kirk Bonner's Introduction to Sicilian Grammar (2001).

The "Developing a Parallel Corpus" article provides a longer discussion of our data sources and introduces the question of how much parallel text is needed to create a good translator.

Translation Models and Practices

To translate, we use Sockeye's implementation of Vaswani et al's (2017) Transformer model along with Sennrich et al's subword-nmt. And following the best practices of Sennrich and Zhang (2019), the networks are small and have fewer layers and the models were trained with small batch sizes and larger dropout parameters.

The "Just Split, Dropout and Pay Attention" article explains why the method works. In short: we need a smaller model for our smaller dataset.

Unni si trova stu Tradutturi Sicilianu?

A Napizia! Come visit us there. Come Behind the Curtain. And come join us in our study of the Sicilian language!

About

neural machine translator for the Sicilian language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published