Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 2.3: Linguistic Units

In this lab, we will take a closer look how to distinguish between words. We use the processed article from the previous lab. **Modify the code to work with all articles from your dataset.**

In [2]:
import stanza
import pickle

processed_article_file = "../data/processed_data/nlp_article1.pkl"
nlp_output = pickle.load(open(processed_article_file,"rb"))
print(nlp_output)

[
  [
    {
      "id": 1,
      "text": "Les",
      "lemma": "le",
      "upos": "DET",
      "feats": "Definite=Def|Number=Plur|PronType=Art",
      "head": 2,
      "deprel": "det",
      "start_char": 0,
      "end_char": 3,
      "ner": "S-LOC",
      "multi_ner": [
        "S-LOC"
      ]
    },
    {
      "id": 2,
      "text": "véganeries",
      "lemma": "véganerie",
      "upos": "NOUN",
      "feats": "Gender=Fem|Number=Plur",
      "head": 3,
      "deprel": "nsubj",
      "start_char": 4,
      "end_char": 14,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 3,
      "text": "vont",
      "lemma": "aller",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin",
      "head": 0,
      "deprel": "root",
      "start_char": 15,
      "end_char": 19,
      "ner": "O",
      "multi_ner": [
        "O"
      ]
    },
    {
      "id": 4,
      "text": "de",
      "lemma": "de",
      "upos": "ADP",
      

## 1. Tokens vs Lemmas

In the HLT course, you already learned about the difference between tokens and lemmas. Let's take a look at the difference. 

It depends on the language you work with and on your analysis goal whether you are more interested in tokens or in lemmas. **Think about some examples.**

In [3]:
for i, sentence in enumerate(nlp_output.sentences):
    # Only check first 20 sentences
    if i==20:
        break
        
    print(str(i), sentence.text)
    for word in sentence.words:
        if not word.text == word.lemma:
            print(word.id, word.text, word.lemma)
    print()


0 Les véganeries vont de plus en plus loin.
1 Les le
2 véganeries véganerie
3 vont aller

1 Connaissez-vous le "faux mage" à base de lait de noix de cajou ? ... Article sans intérêt.
1 Connaissez connaître
2 -vous vous
18 Article article

2 Oui le faux-mage existe et alors ?
1 Oui oui
4 existe exister

3 Oui il y a des simili comme le lait d'amande, des steak de soja et alors ?
1 Oui oui
2 il lui
4 a avoir
5 des un
10 d' de
13 des un

4 Tu veux pas en manger, personne t'y oblige.
1 Tu toi
2 veux vouloir
8 t' toi
10 oblige obliger

5 Critiquer l'existence de produits et en vouloir au vegan ça vous avance à quoi ?
1 Critiquer critiquer
2 l' le
5 produits produit
14 avance avancer

6 C'est vous qui êtes intolérant là...
1 C' ce
2 est être
5 êtes être

7 Être vegan ça veut juste dire boycotter l'utilisation des animaux, la maltraitance animale ect.
1 Être être
4 veut vouloir
8 l' le
11 les le
12 animaux animal
14 la le
16 animale animal

8 L'Homme est aussi un animal donc une bonne partie 

## 2. Testing Lemmatization

In the example, we see that in sentence 19, the lemma for "VEULENT" is "vevler". The correct lemma should be "vouloir". 

A reason for this mistake might be that the word is written in all-caps. Let's check this: 

In [4]:
# Let's write a function for testing a single sentence. 
def get_lemmas(input, stanza_pipeline): 
    # This is a quite complex list comprehension. Make sure you understand what it does. 
    lemmas = [word.lemma for word in stanza_pipeline(input).sentences[0].words]
    return lemmas

# We use a faster pipeline that does not perform all processing steps, only tokenization, POS-tagging and lemmatization
french_pipeline = stanza.Pipeline('fr', processors='tokenize,pos,lemma')
test1 = 'Ils veulent fabriquer quelque chose.'
test2 = 'Ils VEULENT fabriquer quelque chose.'

print(get_lemmas(test1, french_pipeline))
print(get_lemmas(test2, french_pipeline))

2023-10-18 11:37:06 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-18 11:37:06 INFO: Loading these models for language: fr (French):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2023-10-18 11:37:06 INFO: Using device: cpu
2023-10-18 11:37:06 INFO: Loading: tokenize
2023-10-18 11:37:07 INFO: Loading: mwt
2023-10-18 11:37:07 INFO: Loading: pos
2023-10-18 11:37:07 INFO: Loading: lemma
2023-10-18 11:37:07 INFO: Done loading processors!


['eux', 'vouloir', 'fabriquer', 'quelque', 'chose', '.']
['eux', 'veuloir', 'fabriquer', 'quelque', 'chose', '.']


**Test the lemmatization quality for your own datasets. Collect the tricky cases in this [document](https://docs.google.com/document/d/1tU7KD-WrwYAieMH_Q-6z69NFleTJwp8zIUpCr_rbwlA/edit?usp=sharing)**

Sometimes the problem lies already in the tokenization. Do you also find cases for incorrect tokenization? 

If you find many inconsistencies, you can compare the quality to the output of the nltk or spacy lemmatizer.

## 3. Adding exceptions

The stanza lemmatizer uses a combination of a dictionary and a neural model. The lemma for any word that cannot be found in the dictionary is approximated by the neural model. Combining several resources is called an ensemble model. 
 
We can customize the dictionary to add our own solutions. Check the [documentation](https://stanfordnlp.github.io/stanza/lemma.html#accessing-lemma-for-word).  

**Important: If you modify the pipeline like this, you need to be very transparent in your documentation and provide the modified model (or the code to obtain it) to ensure reproducibility!**

In [7]:
import torch
from os.path import expanduser

# Load the current dictionaries
home = expanduser("~")
# if this is not working, double-check which dictionary is used by your version of stanza
model = torch.load(home +'/stanza_resources/fr/lemma/gsd.pt', map_location='cpu')
word_dict = model['dicts'][0]

# Add a word to the dictionary
word_dict['VEULENT'] = 'vouloir'

# Save the modified model under a different name
torch.save(model, home + '/stanza_resources/fr/lemma/gsd_customized.pt')

# Load your customized pipeline
customized_pipeline = stanza.Pipeline('fr', package='gsd', processors='tokenize,pos,lemma', lemma_model_path=home + '/stanza_resources/fr/lemma/gsd_customized.pt')
test = 'Ils VEULENT fabriquer quelque chose.'
print(get_lemmas(test, customized_pipeline))

2023-10-18 11:38:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-18 11:38:20 INFO: Loading these models for language: fr (French):
| Processor | Package                 |
---------------------------------------
| tokenize  | gsd                     |
| mwt       | gsd                     |
| pos       | gsd_charlm              |
| lemma     | /Users/lis...tomized.pt |

2023-10-18 11:38:20 INFO: Using device: cpu
2023-10-18 11:38:20 INFO: Loading: tokenize
2023-10-18 11:38:20 INFO: Loading: mwt
2023-10-18 11:38:20 INFO: Loading: pos
2023-10-18 11:38:20 INFO: Loading: lemma
2023-10-18 11:38:20 INFO: Done loading processors!


['il', 'vouloir', 'fabriquer', 'quelque', 'chose', '.']


## 3. POS-tags

The same lemma can occur in different word classes. For example, *run* can be a verb or a noun. When calculating word frequencies, you might want to distinguish between different POS-tags.   

In [8]:
from collections import Counter

token_pos_frequencies = Counter()
for sentence in nlp_output.sentences:
    # Here you could also use word.text instead of word.lemma. Test if it makes a difference!
    token_pos = [(word.lemma, word.pos) for word in sentence.words]
    token_pos_frequencies.update(token_pos)
    
print(token_pos_frequencies.most_common(50))

[(('le', 'DET'), 55), (('de', 'ADP'), 49), ((',', 'PUNCT'), 31), (('un', 'DET'), 20), (('.', 'PUNCT'), 18), (('à', 'ADP'), 16), (('être', 'AUX'), 16), (('son', 'DET'), 15), (('et', 'CCONJ'), 14), (('...', 'PUNCT'), 13), (('?', 'PUNCT'), 12), (('pas', 'ADV'), 10), (('produit', 'NOUN'), 10), (('ne', 'ADV'), 10), (('pour', 'ADP'), 10), (('en', 'PRON'), 9), (('eux', 'PRON'), 8), (('vous', 'PRON'), 7), (('vouloir', 'VERB'), 7), (('lui', 'PRON'), 6), (('qui', 'PRON'), 6), (('on', 'PRON'), 6), (('plus', 'ADV'), 5), (('lait', 'NOUN'), 5), (('cajou', 'NOUN'), 5), (('alors', 'ADV'), 5), (('manger', 'VERB'), 5), (('ça', 'PRON'), 5), (('ce', 'PRON'), 5), (('ou', 'CCONJ'), 5), (('moi', 'PRON'), 5), (('viande', 'NOUN'), 5), (('pouvoir', 'VERB'), 5), (('nous', 'PRON'), 5), (('avoir', 'AUX'), 5), (('si', 'SCONJ'), 5), (('mais', 'CCONJ'), 5), (('en', 'ADP'), 4), (('"', 'PUNCT'), 4), (('noix', 'NOUN'), 4), (('y', 'PRON'), 4), (('soja', 'NOUN'), 4), (('vegan', 'NOUN'), 4), (('que', 'SCONJ'), 4), (('(', '

## 4. Stopwords

The most frequent words are stopwords. For some research questions, it might make sense to ignore the stopwords.

**Search for the commonly used stopwords for your target language. Discuss the role of stopwords for your dataset. **

Do you see a difference in the most frequent tokens if you ignore stopwords?

In [9]:
import string
# These are the stopwords defined for French in the nltk module and I added the determiners "d'" and "l'"
stopwords = ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', "d'",'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les',"l'", 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

def calculate_token_frequencies(nlp_output, ignore_stopwords=False):
    token_frequencies = Counter()
    for sentence in nlp_output.sentences:
        if ignore_stopwords:
        # Take some time to understand the syntax of the list comprehension, for ignoring stopwords.
        # It is not intuitive
            tokens = [token.text for token in sentence.tokens if token.text not in stopwords ]
        else:
            tokens = [token.text for token in sentence.tokens]

        token_frequencies.update(tokens)
    return token_frequencies

token_frequencies = calculate_token_frequencies(nlp_output, ignore_stopwords=False)
print(token_frequencies.most_common(20))

IndentationError: expected an indented block (1219185725.py, line 11)

## 5. Normalization

If we want to determine the relative importance of a term for an article, we can normalize its frequency by the frequency of the term in all articles. 

The code currently distinguishes between uppercase and lowercase words. For many languages and tasks, it is useful to lowercase all words. **Think about the influence of casing on your research question.**

In [14]:
frequencies_currentarticle = calculate_token_frequencies(nlp_output)

# You calculated the document frequencies in the previous lab
frequencies_dataset = pickle.load(open("../data/processed_data/tokenfrequencies.pkl","rb"))

normalized_frequencies = Counter()
for token, freq in frequencies_currentarticle.items():
    # Remove stopwords and punctuation? --> experimental choice
    if not token in stopwords and not token in string.punctuation:
        normalized_frequency = float(freq/frequencies_dataset[token])
        normalized_frequencies[token] = normalized_frequency
    
print(normalized_frequencies.most_common(100))

ZeroDivisionError: division by zero

The code currently throws a *ZeroDivisionError*. **What does that mean and how can you fix it?** 