Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 4.2: Ngrams and Terminology

In this lab, we will take a closer look how to distinguish between words. We use the processed article from the previous lab. **Modify the code to work with all articles from your dataset.**

In [23]:
import stanza
import pandas as pd
language = "fr"
article_file = "../data/veganism_overview_" + language +".tsv"
content = pd.read_csv(article_file, sep="\t", header = 0, keep_default_na=False)

# Prepare the nlp pipeline
stanza.download(language)
nlp = stanza.Pipeline(language)

current_article = content["Text"][0]
nlp_output = nlp(current_article)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 12:06:00 INFO: Downloading default packages for language: fr (French) ...
2023-11-06 12:06:01 INFO: File exists: /Users/lisabeinborn/stanza_resources/fr/default.zip
2023-11-06 12:06:04 INFO: Finished downloading models and saved to /Users/lisabeinborn/stanza_resources.
2023-11-06 12:06:04 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 12:06:06 INFO: Loading these models for language: fr (French):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
| ner       | wikiner           |

2023-11-06 12:06:06 INFO: Using device: cpu
2023-11-06 12:06:06 INFO: Loading: tokenize
2023-11-06 12:06:06 INFO: Loading: mwt
2023-11-06 12:06:06 INFO: Loading: pos
2023-11-06 12:06:06 INFO: Loading: lemma
2023-11-06 12:06:06 INFO: Loading: depparse
2023-11-06 12:06:06 INFO: Loading: ner
2023-11-06 12:06:07 INFO: Done loading processors!


Often, a sequence of several tokens should be interpreted as a compound or a fixed phrase: 
- New York
- front door
- Chief Executive Officer
- kick the bucket
- state of the art

In order to account for frequent phrases, we can extract ngram statistics. **Try different values for n and analyze the results!**

In [15]:
from collections import Counter
def calculate_ngram_frequencies(n, nlp_output):
    ngram_frequencies = Counter()
    for sentence in nlp_output.sentences:
        tokens = [token.text for token in sentence.tokens]
        ngrams = [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
        ngram_frequencies.update(ngrams)
    return ngram_frequencies
n = 3
ngram_frequencies = calculate_ngram_frequencies(n, nlp_output)
print(ngram_frequencies.most_common(20))

[('noix de cajou', 3), ('de noix de', 2), ('et alors ?', 2), ('il y a', 2), ('indispensable à notre', 2), ('de la viande', 2), ("d' origine animale", 2), ('( de soja', 2), ('huile de cajou', 2), (', nous ne', 2), ('nous ne serions', 2), ('ne serions pas', 2), ('pour en parler', 2), ('Encore une minorité', 2), ("une minorité d'", 2), ("minorité d' ayatollah", 2), ("d' ayatollah qui", 2), ('pourrir la vie', 2), ('la vie de', 2), ('vie de la', 2)]


## 2. Stopwords

The most frequent words are stopwords. For some research questions, it might make sense to ignore the stopwords.

**Search for the commonly used stopwords for your target language. Discuss how stopword removal affects the interpretation of the ngram statistics.** 

In [16]:
import string

# These are the stopwords defined for French in the nltk module and I added the determiners "d'" and "l'"
stopwords = ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', "d'",'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les',"l'", 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

def calculate_ngram_frequencies_without_stopwords(n, nlp_output):
    ngram_frequencies = Counter()
    for sentence in nlp_output.sentences:
        # Here we remove stopwords, Note: take some time to understand the syntax of the list comprehension, it is not intuitive 
        tokens = [token.text for token in sentence.tokens if token.text not in stopwords ]

        ngrams = [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
        ngram_frequencies.update(ngrams)
    return ngram_frequencies

n = 2
ngram_frequencies = calculate_ngram_frequencies_without_stopwords(n, nlp_output)
print(ngram_frequencies.most_common(20))

[('noix cajou', 3), (') ,', 3), (', lait', 3), (', alors', 3), ('alors ?', 2), ('vegan ça', 2), ('plus vite', 2), (", c'", 2), ('origine animale', 2), ('( soja', 2), ('huile cajou', 2), ('etc ...', 2), ('chose ?', 2), (', parler', 2), ('Encore minorité', 2), ('minorité ayatollah', 2), ('pourrir vie', 2), ('vie majorité', 2), ('majorité ...', 2), ('Les véganeries', 1)]


### 3. Normalization

If we want to determine the relative importance of a term for an article, we can normalize its frequency by the frequency of the term in all articles. 

**Important: frequencies need to be calculated for the same ngram size.**

The code currently distinguishes between uppercase and lowercase words. For many languages and tasks, it is useful to lowercase all words. **Think about the influence of casing on your research question.**

In [17]:
frequencies_currentarticle = calculate_ngram_frequencies(1, nlp_output)
# You calculated the document frequencies in an earlier lab
frequencies_dataset = pickle.load(open("../data/processed_data/tokenfrequencies.pkl","rb"))

normalized_frequencies = Counter()
for term, freq in frequencies_currentarticle.items():
    # Remove stopwords and punctuation? --> experimental choice
    if not term in stopwords and not term in string.punctuation:
        normalized_frequency = float(freq/frequencies_dataset[term])
        normalized_frequencies[term] = normalized_frequency
    
print(normalized_frequencies.most_common(100))

ZeroDivisionError: division by zero

The code currently throws a *ZeroDivisionError*. **What does that mean and how can you fix it?** 

## 4. Named Entity Recognition

The stanza pipeline performs named entity recognition. On the token-level, a named entity label can be split into the position and the category.  **The named entity labels might vary across labels because they depend on the labels used in the training dataset. Check [the documentation](https://stanfordnlp.github.io/stanza/available_models.html) for your language (scroll down to NER).** We are trying out the English article here. 

In [24]:
import stanza
import pandas as pd
language = "en"
article_file = "../data/veganism_overview_" + language +".tsv"
content = pd.read_csv(article_file, sep="\t", header = 0, keep_default_na=False)

# Prepare the nlp pipeline
stanza.download(language)
nlp = stanza.Pipeline(language)

current_article = content["Text"][0]
nlp_output = nlp(current_article)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 12:07:36 INFO: Downloading default packages for language: en (English) ...
2023-11-06 12:07:37 INFO: File exists: /Users/lisabeinborn/stanza_resources/en/default.zip
2023-11-06 12:07:39 INFO: Finished downloading models and saved to /Users/lisabeinborn/stanza_resources.
2023-11-06 12:07:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 12:07:41 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ontonotes_charlm    |

2023-11-06 12:07:41 INFO: Using device: cpu
2023-11-06 12:07:41 INFO: Loading: tokenize
2023-11-06 12:07:41 INFO: Loading: pos
2023-11-06 12:07:41 INFO: Loading: lemma
2023-11-06 12:07:41 INFO: Loading: constituency
2023-11-06 12:07:42 INFO: Loading: depparse
2023-11-06 12:07:42 INFO: Loading: sentiment
2023-11-06 12:07:42 INFO: Loading: ner
2023-11-06 12:07:43 INFO: Done loading processors!


In [26]:
sentences = nlp_output.sentences

# For the example, I only look at the first three sentences. Make sure to change this.
for sentence in sentences[0:5]:
    print()
    print(sentence.text)
    print()
    for token in sentence.tokens:
        if not token.ner =="O":

            # This shows us the labels on the token level
            print(token.ner, token.text)
            position, category = token.ner.split("-")

            # Code to combine token labels into entity labels
            if (position == "S"):
                print("Single-token entity: " + category, token.text)
                print("----")
            if (position == "B"):
                current_token = token.text
            if (position == "I"):
                current_token = current_token + " " + token.text
            if (position == "E"):
                current_token = current_token + " " + token.text
                print("Multi-token entity: " + category, current_token)
                print("----")




Thirty years ago, a few Indian-Americans got together to form Vegetarian Vision when they saw more and more Indian immigrants becoming non-vegetarian of the difficulty accessing their traditional Indian products.

B-DATE Thirty
I-DATE years
E-DATE ago
Multi-token entity: DATE Thirty years ago
----
B-NORP Indian
I-NORP -
E-NORP Americans
Multi-token entity: NORP Indian - Americans
----
B-ORG Vegetarian
E-ORG Vision
Multi-token entity: ORG Vegetarian Vision
----
S-NORP Indian
Single-token entity: NORP Indian
----
S-NORP Indian
Single-token entity: NORP Indian
----

“People coming from India couldn’t find enough vegetarian food.

S-GPE India
Single-token entity: GPE India
----

So they were changing their lifestyle.


We felt an organization like this was needed,” Chairman H.K.

S-PERSON H.K
Single-token entity: PERSON H.K
----

Shah, founder of Vegetarian Vision founded in 1992, now called World Vegan Vision, told News India Times.

S-PERSON Shah
Single-token entity: PERSON Shah
----
B-

In [27]:
# We can also directly access the entities
for sentence in sentences[0:2]:
    for entity in sentence.entities:
        print(entity.type, entity.text)

DATE Thirty years ago
NORP Indian-Americans
ORG Vegetarian Vision
NORP Indian
NORP Indian
GPE India


**Which role do named entities play for your dataset?**  How can you adjust the frequency calculations to make sure that named entities consisting of multiple words are treated as a single term? 