Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 2.3: Linguistic Units

In this lab, we will take a closer look how to distinguish between words. We use the processed article from the previous lab. **Modify the code to work with all articles from your dataset.**

In [None]:
import stanza
import pickle

processed_article_file = "../data/processed_data/nlp_article1.pkl"
nlp_output = pickle.load(open(processed_article_file,"rb"))
print(nlp_output)

## 1. Tokens vs Lemmas

In the HLT course, you already learned about the difference between tokens and lemmas. Let's take a look at the difference. 

It depends on the language you work with and on your analysis goal whether you are more interested in tokens or in lemmas. **Think about some examples.**

In [None]:
for i, sentence in enumerate(nlp_output.sentences):
    # Only check first 20 sentences
    if i==20:
        break
        
    print(str(i), sentence.text)
    for word in sentence.words:
        if not word.text == word.lemma:
            print(word.id, word.text, word.lemma)
    print()


## 2. Testing Lemmatization

In the example, we see that in sentence 19, the lemma for "VEULENT" is "vevler". The correct lemma should be "vouloir". 

A reason for this mistake might be that the word is written in all-caps. Let's check this: 

In [None]:
# Let's write a function for testing a single sentence. 
def get_lemmas(input, stanza_pipeline): 
    # This is a quite complex list comprehension. Make sure you understand what it does. 
    lemmas = [word.lemma for word in stanza_pipeline(input).sentences[0].words]
    return lemmas

# We use a faster pipeline that does not perform all processing steps, only tokenization, POS-tagging and lemmatization
french_pipeline = stanza.Pipeline('fr', processors='tokenize,pos,lemma')
test1 = 'Ils veulent fabriquer quelque chose.'
test2 = 'Ils VEULENT fabriquer quelque chose.'

print(get_lemmas(test1, french_pipeline))
print(get_lemmas(test2, french_pipeline))

**Test the lemmatization quality for your own datasets. Collect the tricky cases in this [document](https://docs.google.com/document/d/1tU7KD-WrwYAieMH_Q-6z69NFleTJwp8zIUpCr_rbwlA/edit?usp=sharing)**

Sometimes the problem lies already in the tokenization. Do you also find cases for incorrect tokenization? 

If you find many inconsistencies, you can compare the quality to the output of the nltk or spacy lemmatizer.

## 3. Adding exceptions

The stanza lemmatizer uses a combination of a dictionary and a neural model. The lemma for any word that cannot be found in the dictionary is approximated by the neural model. Combining several resources is called an ensemble model. 
 
We can customize the dictionary to add our own solutions. Check the [documentation](https://stanfordnlp.github.io/stanza/lemma.html#accessing-lemma-for-word).  

**Important: If you modify the pipeline like this, you need to be very transparent in your documentation and provide the modified model (or the code to obtain it) to ensure reproducibility!**

In [None]:
import torch
from os.path import expanduser

# Load the current dictionaries
home = expanduser("~")
# if this is not working, double-check which dictionary is used by your version of stanza
model = torch.load(home +'/stanza_resources/fr/lemma/gsd.pt', map_location='cpu')
word_dict = model['dicts'][0]

# Add a word to the dictionary
word_dict['VEULENT'] = 'vouloir'

# Save the modified model under a different name
torch.save(model, home + '/stanza_resources/fr/lemma/gsd_customized.pt')

# Load your customized pipeline
customized_pipeline = stanza.Pipeline('fr', package='gsd', processors='tokenize,pos,lemma', lemma_model_path=home + '/stanza_resources/fr/lemma/gsd_customized.pt')
test = 'Ils VEULENT fabriquer quelque chose.'
print(get_lemmas(test, customized_pipeline))

## 3. POS-tags

The same lemma can occur in different word classes. For example, *run* can be a verb or a noun. When calculating word frequencies, you might want to distinguish between different POS-tags.   

In [None]:
from collections import Counter

token_pos_frequencies = Counter()
for sentence in nlp_output.sentences:
    # Here you could also use word.text instead of word.lemma. Test if it makes a difference!
    token_pos = [(word.lemma, word.pos) for word in sentence.words]
    token_pos_frequencies.update(token_pos)
    
print(token_pos_frequencies.most_common(50))

## 4. Stopwords

The most frequent words are stopwords. For some research questions, it might make sense to ignore the stopwords.

**Search for the commonly used stopwords for your target language. Discuss the role of stopwords for your dataset. **

Do you see a difference in the most frequent tokens if you ignore stopwords?

In [None]:
import string
# These are the stopwords defined for French in the nltk module and I added the determiners "d'" and "l'"
stopwords = ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', "d'",'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les',"l'", 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

def calculate_token_frequencies(nlp_output, ignore_stopwords=False):
    token_frequencies = Counter()
    for sentence in nlp_output.sentences:
        if ignore_stopwords:
        # Take some time to understand the syntax of the list comprehension, for ignoring stopwords.
        # It is not intuitive
            tokens = [token.text for token in sentence.tokens if token.text not in stopwords ]
        else:
            tokens = [token.text for token in sentence.tokens]

        token_frequencies.update(tokens)
    return token_frequencies

token_frequencies = calculate_token_frequencies(nlp_output, ignore_stopwords=False)
print(token_frequencies.most_common(20))

## 5. Normalization

If we want to determine the relative importance of a term for an article, we can normalize its frequency by the frequency of the term in all articles. 

The code currently distinguishes between uppercase and lowercase words. For many languages and tasks, it is useful to lowercase all words. **Think about the influence of casing on your research question.**

The code currently throws a *ZeroDivisionError*. **What does that mean and how can you fix it?** 

In [None]:
frequencies_currentarticle = calculate_token_frequencies(nlp_output)

# You calculated the document frequencies in the previous lab
frequencies_dataset = pickle.load(open("../data/processed_data/tokenfrequencies.pkl","rb"))

normalized_frequencies = Counter()
for token, freq in frequencies_currentarticle.items():
    # Remove stopwords and punctuation? --> experimental choice
    if not token in stopwords and not token in string.punctuation:
        normalized_frequency = float(freq/frequencies_dataset[token])
        normalized_frequencies[token] = normalized_frequency
    
print(normalized_frequencies.most_common(100))