
##### LET-REMA-LCEX19 Introduction to Language and Speech Technology Take-home exam


---


#**Theme: Een Taal is geen Taal**


---


Welcome to the take-home exam of the first part of the course. The theme of this exam is **Een Taal is geen Taal**: Most people speak more than one language, yet our language technologies tend to struggle with multilingual input - especially if the language is not one of the few languages that most work in NLP is done on. The challenges and limitations of enabling human language technologies to equally support all languages in the world are plenty. How are words (and other elements) in different languages even represented in computers? What about other levels of representation, large and small? From the very basics to complex ML applications, how do Natural Language Processing (NLP) pipelines typically deal with different languages and why is it so hard to build systems that work well across multiple languages? What practices and solutions has the field come up with? What challenges can multilingual data cause in specific NLP applications, e.g. text classification or language modeling? What inequalities are build into such technologies? Feel free to pursue your personal interests within this theme. Possible directions to take this in are:

- how to build a multilingual NLP pipeline that works equally well across diverse languages;
- compare two languages and contempate differences, challenges, and linguistic issues that arise;
- comparisons between different varieties, dialects or sociolects with NLP pipelines/applications in mind.      

If you have a significantly different type of essay in mind that you want to write, please contact us via email and quickly check with us.

###  Instructions

This is an **individual** take-home exam. Please complete the exam on your own and write entirely in your own words.

Think of this exam as an essay discussing the theme “Een Taal is geen Taal”. Your text should comparatively discuss various issues that relate to how (elements of) different human languages are represented in NLP pipelines. Your discussion should include (at least) the following type of issues:

- How does character encoding work in different languages?
- How does tokenizing work in different languages?
- How does stemming/lemmatization work in different languages?
- How does parsing work in different languages?
- How does text classification work ( e.g. bag of words, Naive Bayes vs LLMs)
- How does language modelling (e.g. ngrams, Markov assumption, (instruction-tuned) large language models)


Write the exam in form of an Python notebook (.ipynb). The easiest way to get started is to make a copy of this very document here and start working below. The exam should contain both text cells (written in full sentences) and code cells (with pseudocode) to illustrate relevant concepts.

Use pseudocode only for illustrative purposes. No data needed - you do not need to actually build a data processing pipeline! You could, for instance, use pseudo code to illustrate how a word can be tokenised in different ways or how words of different languages are encoded as unicode etc.

No minimum number of code cells is specified - include as many code cells in your document as you see fit.

### What is pseudocode?

If you are not familiar with pseudocode, start with this [short video tutorial](https://www.youtube.com/watch?v=qfckDdsEIq8).
Your pseudocode does not have to be of any particular format. Just writing out the steps of your pipeline in plain English (like shown in the video tutorial) will suffice. You can also write in actual code, e.g. Python, but you will not get extra credits for doing so.

### Length

Aim for a submission of around 2500 words (excluding this existing instruction text, the bibliography, and your pseudocode cells). Diverging from this aim by more than 20% will impact the grade.   

The easiest way to check the length of your document is to copy the text into a word processor or to download the file as Python code (.py) and open that file in Microsoft Word to count the words.


### Exam weight
This exam is 50% of the overall course grade.

### Grading

The exam will be graded for:

- **Completeness** [primary]: Does the submission cover all aspects of preprocessing and NLP applications that were covered in the course? Please stick to the theme "From words to numbers" but cover all relevant linguistic and technical problems, challenges, and solutions of how various elements of written natural human language are represented and used in language technology.

- **Depth of description** [secondary]: Are all concepts and steps explained clearly, thoroughly and concisely? Are all steps of a typical language processing pipelines described in sufficient detail? Is (pseudo)code used well for illustration purposes?

- **Presentation** [tertiary]: Is the submission well organized, formatted, and professionally presented using both text and code cells? Your submission should be readable like an essay. It should not only entirely be written in full sentences (except the pseudocode cells), but quality of writing matters: write in a clear, academic, and succinct style. Include references to the course text book (link [link text](https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf)) where relevant and include a bibliography. References to literature other than to (Jurafsky and Martin 2023, p.xx) are encouraged but not required.

The minimum grade to pass the exam is 5.5.

The use of text generators (e.g. ChatGPT) is **not allowed** and is treated as a form of plagiarism, we will test each submission for that.

### Submission

Submit the exam via Brightspace as an iPython notebook (.ipynb) file **by the deadline**.


Good luck!

Andreas Liesenfeld and Cristian Tejedor-García



---

Your name: Daan Brugmans

Student number: S1080742


---



## Ideas
Character Encoding
- Character encoding is relatively straightforward for alphabets, abjads, syllabaries, and logographies.
- Abugidas are not as easy; how should we represent abugidas?
  - Take example language(s)
  - Encode all combinations? Straightforward, but inefficient
  - Alternative: encode consonants and vowels in an abugida separately, but render them as one character.
  - How does the Unicode Consortium do this?
- Example: typing kanji on Japanese phones (users write katakana/hiragana and the phone infers kanji)

Tokenization
- Word tokenization
  - Multi-word units in alphabets
  - Characters in logographies (Chinese languages)
- Sentence tokenization
  - Languages without end-of-sentence markers (Thai)

Stemming/Lemmatization
- Defining the word and the stem
- Example: roots and stems in Arabic

Parsing
- Syntactic parsing
  - Word order
  - Case markings for indicating subject, objects, and the like
  - In general: the linguistic typology
- Semantic parsing
  - One meaning can represented with very different phones in many languages (girl, meisje, famke, dèrnje for closely related languages), and vice versa
  - Learning a semantic space is a solution for representing semantics across languages
  - What about sentences with the same meaning, but with very different syntax ("Hoe heet u" vs. "Van wie bu' gèj d'r een?")?

Text Classification
- Since text classification is directly influenced by the contents of a text, it is inherently subject to the language(s) present in the unclassified texts
- A multilingual text classification system must either be capable of handling any of the potential languages present in a text indiscriminately, or have separate models or systems for every potential language and apply the right one, the latter requiring language identification techniques.
- Preferably, we would choose the former, but that requires that our singular system should be capable of handling any number of languages in theory, let alone in practice
- Would machine translation offer a solution? Likely not, as syntactic and semantic intricacies of distant languages are often lost when translated.
- Then how do we build a system that can accept any language as input, yet give the same result for the same task?
  - Semantic spaces?

Language Modelling
- In LLMs, the tokens a model must process depend on the language it tries to model, and what these tokens represent also change by the type of script
- Although with text classification, a system must be capable of accepting any potential language as input, in language modelling, the model must also produce an output of any potential language
  - This could, and currently is theorized to be, achieved by learning a semantic space.
- Can a language model be capable of code switching? If so, does it know that it is using multiple languages in the same produced text, or does it think it is only using one language?

# Introduction: state of the art, challenges and opportunities

The concept of "Eén Taal is Géén Taal" ("One Language is No Language") is currently a hot topic within the field of NLP. As professional international communication is typically performed in English, which can be considered the modern *lingua franca*, it may come as no surprise that state-of-the-art language technology research is typically performed in and in service of English. The vast wealth of English language resources offers researchers plenty of data to train their models. English is what is (generally) considered a ***high-resource language***: a language with many resources from which plenty of data can be extracted for whatever language technology purpose. English may be considered one of the "lucky few" in the high-resource language group, as the ***low-resource language*** group, for which only a relatively small, limited number of resources exist, encompass most of the languages spoken and written. The low-resource language group not only emcompasses many minority languages that are recognized by governments, such as Papiamento and Frisian ([*Erkende talen in Nederland*, 2024](https://www.rijksoverheid.nl/onderwerpen/erkende-talen/erkende-talen-in-nl)), but also languages that aren't officially recognized, such as Brabantian and Zeelandic, local dialects, such as Huissens ([*Historische kring Huessen - Huussese Taol, 2024*](https://www.huessen.nl/over-huissen/dialect)), and even constructed languages, such as Esperanto. 

The current state-of-the-art in NLP is mostly based on high-resource languages, which is likely due to the deep learning models that underpin state-of-the-art technologies, which require vast amounts of data in order to be trained properly. Such models cannot be trained from scratch on low-resource languages, as the data that facilitates state-of-the-art models simply does not exist for those languages. As a consequence, most languages that are not considered high-resource cannot be processed by state-of-the-art models, despite the pool of high-resource languages being a very limited one. This is one major reason why multilingual NLP is such a hot topic: most languages are not even supported by modern tools yet, and the fact that those languages lack a wealth of data that can be used to implement such support is a major hurdle in realizing multilingual NLP systems.

Within research on multilingual language models, one potential solution to this problem, if only a partial one, is to use existing language models trained on high-resource languages, and apply their knowledge on low-resource languages ([*Joshi, S. et al., 2024, Fine Tuning LLMs for Low Resource Languages*](https://doi.org/10.1109/icipcn63822.2024.00090)). Large Language Models pretrained on vast quantities of high-resource language data showcase a remarkable capability of modelling and producing language. These pretrained LLMs could then be finetuned to low-resource languages. The idea here is that the pretrained LLM has already acquired much knowledge about language in general when pretraining on the high-resource language(s), such as developing a semantic space in which it has learned to represent meanings in language as vectors, how to produce coherent and syntactically correct sentences, and, when Reinforcement Learning with Human Feedback is used, a sense of how to produce texts that cater to the needs of human agents. This knowledge is on top of the language-specific knowledge it has learned, such as the words used in the English language. By then finetuning an LLM on low-resource languages, the expectation is that the "general language skills" it has learned can be carried over onto the low-resource language, and that it does not need to relearn these general skills, allowing it to fully focus on learning only the skills that are specific to the low-resource language, such as its vocabulary and syntax. Ideally, the expectation that the finetuning LLM only has to learn the low-resource language itself, without improving its general language skills, means that it may need (much) less data to achieve state-of-the-art language modelling and generation for the low-resource language. In practice, that is easier said than done.

Of course, this only describes one potential solution for building multilingual language modelling systems. But a multilingual NLP pipeline has many other problems to consider as well. I will discuss some of these problems in the rest of this essay.

# Elements of a maximally multilingual NLP pipeline

# Preprocessing

# NLP applications

# Conclusion