Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 2.2: Basic Language Analysis

In the HLT course, you learned how to perform natural language processing steps using the libraries *nltk* and *spacy* in [Lab 1](https://github.com/cltl/ma-hlt-labs/tree/master/lab1.toolkits). Now is a good time to refresh your memory.  
Spacy and NLTK are only available for a few languages. In this lab, we will work with *stanza* which is available for more than 60 languages. Take a look at the [documentation](https://stanfordnlp.github.io/stanza/) for details. Stanza is optimized for accuracy and not for speed, so processing takes longer than with spacy. 

You are free to choose any of the libraries for your assignments. Just make sure to document your selection. 

In [1]:
import pandas as pd
import stanza

# Read in the data 
language = "fr"
article_file = "../data/veganism_overview_" + language +".tsv"
content = pd.read_csv(article_file, sep="\t", header = 0, keep_default_na=False)

# Prepare the nlp pipeline
stanza.download(language)
nlp = stanza.Pipeline(language)


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-11 13:45:30 INFO: Downloading default packages for language: fr (French) ...


Downloading https://huggingface.co/stanfordnlp/stanza-fr/resolve/v1.6.0/models/default.zip:   0%|          | 0…

2023-10-11 13:46:17 INFO: Finished downloading models and saved to /Users/lisabeinborn/stanza_resources.
2023-10-11 13:46:17 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-10-11 13:46:19 INFO: Loading these models for language: fr (French):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
| ner       | wikiner           |

2023-10-11 13:46:19 INFO: Using device: cpu
2023-10-11 13:46:19 INFO: Loading: tokenize
2023-10-11 13:46:19 INFO: Loading: mwt
2023-10-11 13:46:19 INFO: Loading: pos
2023-10-11 13:46:19 INFO: Loading: lemma
2023-10-11 13:46:19 INFO: Loading: depparse
2023-10-11 13:46:19 INFO: Loading: ner
2023-10-11 13:46:20 INFO: Done loading processors!


We have loaded the pipeline and can start processing our content. In the example, we only use the first article. **Once you understand how it works, modify the code to iterate through all articles and save the result.** 

In [2]:
# Process the first article
current_article = content["Text"][0]
processed_article = nlp(current_article)

## 1. Tokenization

The stanza pipeline detects sentence boundaries and segments the texts into tokens. **Analyze the output and check the tokenization quality.** 

In [3]:
sentences = processed_article.sentences

# For the example, I only look at the first three sentences. Make sure to change this. 
for sentence in sentences[0:2]:
    print("Sentence: ", sentence.text)
    print("Tokens: ")
    for token in sentence.tokens:
        print(token.id[0], token.text)
    print()

Sentence:  Les véganeries vont de plus en plus loin.
Tokens: 
1 Les
2 véganeries
3 vont
4 de
5 plus
6 en
7 plus
8 loin
9 .

Sentence:  Connaissez-vous le "faux mage" à base de lait de noix de cajou ? ... Article sans intérêt.
Tokens: 
1 Connaissez
2 -vous
3 le
4 "
5 faux
6 mage
7 "
8 à
9 base
10 de
11 lait
12 de
13 noix
14 de
15 cajou
16 ?
17 ...
18 Article
19 sans
20 intérêt
21 .



## 2. Token Frequencies

Now, let's count the token frequencies in this article. If necessary, go back to [chapter 10](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2010%20-%20Dictionaries.ipynb) of the python course and refresh your skills on how to count with dictionaries and counters.  

In [4]:
from collections import Counter

token_frequencies = Counter()

for sentence in sentences:
    all_tokens =[token.text for token in sentence.tokens]
    token_frequencies.update(all_tokens)
    
# Print the most common tokens
print(token_frequencies.most_common(40))

[('de', 34), (',', 31), ('.', 18), ('la', 18), ('des', 15), ('...', 13), ('et', 13), ('en', 12), ('?', 12), ('à', 11), ('pas', 10), ('le', 9), ("l'", 9), ('produits', 9), ('pour', 9), ('les', 9), ("d'", 7), ('est', 7), ('a', 6), ('vous', 6), ('qui', 6), ('ne', 6), ('ils', 6), ('plus', 5), ('lait', 5), ('cajou', 5), ('alors', 5), ('ça', 5), ('ou', 5), ('notre', 5), ('viande', 5), ('nous', 5), ('mais', 5), ('"', 4), ('noix', 4), ('y', 4), ('soja', 4), ('un', 4), ('une', 4), ("n'", 4)]


The type-token ratio is an indicator for lexical variation. **Think about example texts with very high or very low type-token ratio.**

In [5]:
# Type-token ratio
num_types = len(token_frequencies.keys())
num_tokens = sum(token_frequencies.values())

tt_ratio = num_types/num_tokens

print(num_types, num_tokens)

# Print the type token ratio with 4 decimals
print("%.4f" % tt_ratio)

334 738
0.4526


## 3. Saving as pickle

We can save complex Python objects like dictionaries (or like the processed articles) in a *pickle* file. This can be convenient if you are running a processing step that takes a lot of time and you want to do it only once. You can then save the intermediate output in a pickle-file and load it directly when you continue working on it. 

Note that pickle files can also be used to hide harmful code. So make sure to only open pickle files if you know who created them and what they contain. More information can be found in this [tutorial](https://www.datacamp.com/community/tutorials/pickle-python-tutorial).

When opening files, *w* stands for write, *r* for read and *b* indicates that the file is binary.

In [7]:
import pickle
frequency_file = "../data/processed_data/tokenfrequencies_article1.pkl"
pickle.dump(token_frequencies, open(frequency_file, "wb"))

stanza_objects_file = "../data/processed_data/nlp_article1.pkl"
pickle.dump(processed_article, open(stanza_objects_file, "wb"))

# You can then later load the file like this: 
# loaded_frequencies = pickle.load(open(frequencyfile, "rb")) 

## 4. Token frequencies of all articles

In the example, we only used the first article. **Once you understand how it works, modify the code to iterate through all articles and save the result in "../data/processed_data/tokenfrequencies.pkl".** 

Stanza processing takes quite long, so you can try it directly with your own dataset, if you prefer. Do not save all stanza objects, the file will get quite big. 

In [None]:
# Initialize variables


# Iterate through all articles
    
    # Skip empty articles
    
    # Process the article with the stanza pipeline
    
    # Iterate through all sentences of the article
    
        # Add the tokens to a counter

# Save the token frequencies as a pickle file


## 5. More linguistic processing

The pipeline provides a large amount of additional information. Let's check the representation of the sentence. **Try to understand which information is represented by the attributes *lemma*, *upos*, *feats*, *heads*, *deprel* and *ner*.** We will learn more about this in the next week, but you can already collect some statistics over the information.   

In [None]:
print(sentences[0])