<a href="https://colab.research.google.com/github/dgromann/MCMLR/blob/main/MCMLR_BonusExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# First Bonus Exercise

This is a notebook for the first bonus exercise of the lecture "Multilingual and Crosslingual Methods and Language Resources". 

In this first exercise, you will automatically tag natural language sequences with Part-of-Speech (POS) tags and compare: 
 English and two languages 

*   Three different multilingual tools 
*   Three different languages: English and two languages of different language families 

For the selection form different language families, please do not consider the highest level but the level of "Germanic" for English or "Romance" for Spanish. You can utilize Ethnologue (https://www.ethnologue.com/guides/ethnologue200) to search for languages and detect the language family ("Classification" when you click on a language). At least one person of your group should be able to speak the second language, the third can be any language of your choice (make sure the below tools support the languages you want to select!).

Below you will find the code for all three POS taggers utilized for the comparison for English with tips on where to replace the code with the respective language. 

For this exercise, to be able to accommodate as many languages to choose from as possible, we will consider the first or max. first two sentences of the very generic Wikipedia page for "House" (https://en.wikipedia.org/wiki/House). 

*Step-wise instructions for this exercise:*


1.   Select two languages and copy the Wikipedia sentences from the "House" entry in these languages (first or first two if the first is very short) 
2.   Run the code below for all three models for English 
3.   Extend the code to store the result in a file 
4.   Run each POS tagger on the other two languages and store the results 
5.   Compare the results of the POS taggers - which one  is more reliable for which language and why?
6.   Compare the POS tags regarding similarities and differences of the three languages - what can we learn by this about the languages?




## POS Tagging with spaCy 

In order to POS tag a different language with the spaCy POS-Tagger, you need to load the correct model for the language. You can choose from the available languages here (https://spacy.io/usage/models) - this might restrict your possible choice of languages! 




In [None]:
#spaCy is already pre-installed - no need to install it
#Download the model for the correct language by replacing "en_core_web_trf"
!python -m spacy download en_core_web_trf

In [None]:
import spacy

#Replace the English model "en_core_web_trf" with the respecive language
nlp = spacy.load("en_core_web_trf")

text = nlp("A house is a single-unit residential building. It may range in complexity from a rudimentary hut to a complex structure of wood, masonry, concrete or other material, outfitted with plumbing, electrical, and heating, ventilation, and air conditioning systems.")

print(text)

for token in text:
    print(token, token.pos_)

## POS Tagging with  Polyglot

In order to run polyglot on a different language, you need to first download the corresponding model by replacing "pos2.en" with the corresponding language, e.g. for Spanish this would be "pos2.es". The list of supported languages and the meaning of the POS tags can be found here: https://polyglot.readthedocs.io/en/latest/POS.html  

In [None]:
#For Polyglot we need to install the library but also a lot of dependencies
!pip install polyglot
!pip install pyicu
!pip install Morfessor
!pip install pycld2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Here ou need to replace ".en" twice by the  two digit ISO code of your chosen language  (that is also supported by the tool)
!polyglot download pos2.en
!polyglot download embeddings2.en

In [None]:
from polyglot.detect import Detector
from polyglot.text import Text

input_text = "A house is a single-unit residential building. It may range in complexity from a rudimentary hut to a complex structure of wood, masonry, concrete or other material, outfitted with plumbing, electrical, and heating, ventilation, and air conditioning systems."
text = Text(input_text)
print(text.pos_tags)

#POS Tagging with FLAIR#

FLAIR is a multilingual neural model and theoretically needs no adaptations for individual languages. So you only need to replace the input sentence by the sentence in a different language. More information on FLAIR can be found here: https://huggingface.co/flair/upos-multi


FLAIR is not the newest model, but highly multilingual. If you prefer to use a different neural multilingual POS tagger or a language-specific POS tagger from Huggingface, feel free to replace the model for this last example of a POS tagger. 

In [None]:
#First you have to install flair
!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 26.4 MB/s 
Collecting conllu>=4.0
  Downloading conllu-4.5.2-py2.py3-none-any.whl (16 kB)
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Collecting hyperopt>=0.2.7
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 53.3 MB/s 
Collecting bpemb>=0.3.2
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 46.0 MB/s 
[?25hCollecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
[K     |████████████████████████████████| 19.7 MB 1.4 MB/s 
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.6 MB/

In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

# Load tagger
tagger = SequenceTagger.load("flair/upos-multi")

sentence = Sentence("A house is a single-unit residential building.")

tagger.predict(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence:
    print(entity)

ModuleNotFoundError: ignored