Skip to content

ewdowiak/Sicilian_Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sicilian Translator

This repository documents our Tradutturi Sicilianu, the first neural machine translator for the Sicilian language. It documents our work and the steps to reproduce it. We hope it opens the door to natural language processing for the Sicilian language.

What is the Sicilian language?

It's the language spoken by the people of Sicily, Calabria and Puglia. It's the language that they speak at home, with family and friends. It's the language that the Sicilian School of Poets recited at the imperial court of Frederick II in the 13th century. And it's a language spoken here in Brooklyn, NY.

How is Sicilian different from Italian?

Comparing Sicilian to Italian is like comparing American football to Australian rules football. Both football codes trace their origins to 19th-century England, but they evolved separately and have different sets of rules. Analogously, both Sicilian and Italian are Romance languages. Most of their vocabularies and grammars come from Latin, but they evolved separately and have different sets of rules.

But they have (of course) influenced each other. Sicilian poetry inspired Dante, the "father of the Italian language," to write poetry in his native Florentine. And this influence on Dante reveals Sicilian's cultural importance: Sicilian had emerged as a literary language long before Italian.

Can you help me learn Sicilian?

Yes. That's our goal. We hope that the Dieli Dictionary will help you learn vocabulary, that Chiù dâ Palora will help you learn grammar and that Tradutturi Sicilianu will help you write in Sicilian.

One of the best sources of information and learning materials is Arba Sicula. For over 40 years, they have been publishing books and journals about Sicilian history, language, literature, art, folklore and cuisine. And the Learn Sicilian and Learn Sicilian Two textbooks by its editor, Gaetano Cipolla, provide more than a grammar book. They're a complete introduction to Sicily, its language, culture and people.

What's in this repository?

This repository documents the individual steps that we took to create a neural machine translator and provides the code necessary to reproduce them. Separately, the "With Patience and Dedication" introduction provides a broader overview.

Here in this repository, the extract-text directory contains the scripts that we used to collect parallel text from issues of Arba Sicula (which are in PDF format). The dataset directory contains the scripts that we used to prepare the data for training. The training directory contains the scripts that we'll use to train the models. And the translations directory contains scripts to score our models.

The perl-module directory provides a Perl module with tokenization and detokenization subroutines. The web-app directory contains a Mojolicious application to put the translator on a website. And the fastapi directory contains a rewritten version of Sockeye's translate.py, which we use with FastAPI to load the translations model's parameters and keep them ready for translation.

Data Sources

Our largest source of parallel text are issues of the literary journal Arba Sicula. We mixed that data with Arthur Dieli's translations of poetry, proverbs and Giuseppe Pitrè's Folk Tales. And to "learn" Sicilian, we also collected text from Gaetano Cipolla's Learn Sicilian and Learn Sicilian Two textbooks and from Kirk Bonner's Introduction to Sicilian Grammar.

The "Developing a Parallel Corpus" article provides a longer discussion of our data sources and introduces the question of how much parallel text is needed to create a good translator.

Translation Models and Practices

To translate, we use Sockeye's implementation of Vaswani et al's (2017) Transformer model along with Sennrich et al's subword-nmt. And following the best practices of Sennrich and Zhang (2019), the networks are small and have fewer layers and the models were trained with small batch sizes and larger dropout parameters.

The "Just Split, Dropout and Pay Attention" article explains why the method works. In short: we need a smaller model for our smaller dataset.

To shrink our model, we also use small subword vocabularies. And, as explained in the "Subword Splitting" article, we bias the learned subword vocabulary towards the desinences one finds in a textbook.

The "Multilingual Translation" article explains how we can train a single model to translate between multiple languages, including some for which there is little or no parallel text.

Finally, the "Reverse Training Strategy" article reverses the order in which we think about the training stages. First, we think about the fine-tuning stage (last stage). Then, we think backwards through the stages, so that we pre-train a model which will provide a good starting point for the subsequent fine-tuning.

Unni si trova stu Tradutturi Sicilianu?

A Napizia! Come visit us there. Come Behind the Curtain. And come join us in our study of the Sicilian language!

About

neural machine translator for the Sicilian language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors