# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Teaching Assistant

**Credits:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words.
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)


# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [1]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

--2023-11-01 16:33:12--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12990065 (12M) [text/plain]
Saving to: ‘langid_dataset.csv.2’


2023-11-01 16:33:12 (119 MB/s) - ‘langid_dataset.csv.2’ saved [12990065/12990065]



In [2]:
!pip install iso639-lang langid langdetect fasttext fastlangid nltk spacy scikit-learn gensim



In [3]:
import pandas as pd
from iso639 import Lang
from fastlangid.langid import LID
import langid
from langdetect import detect
import time

fast_text = LID()

dataset = pd.read_csv("langid_dataset.csv")
language_codes = [Lang(x).pt1 for x in dataset["language"]]
dataset["code"] = language_codes
dataset

Unnamed: 0,Text,language,code
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian,et
1,sebes joseph pereira thomas på eng the jesuit...,Swedish,sv
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai,th
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil,ta
4,de spons behoort tot het geslacht haliclona en...,Dutch,nl
...,...,...,...
21995,hors du terrain les années et sont des année...,French,fr
21996,ใน พศ หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...,Thai,th
21997,con motivo de la celebración del septuagésimoq...,Spanish,es
21998,年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...,Chinese,zh


In [4]:
# Accuracy measures

# FastText
counter = 0
for idx, row in dataset.iterrows():
    sample = row["Text"]
    expected = row["code"]
    predicted = fast_text.predict(sample)
    if expected == predicted:
        counter += 1

print(
    f"FastText algorithm has performed with an accuracy of {100 * counter / len(dataset)}%"
)

# LangID
counter = 0
for idx, row in dataset.iterrows():
    sample = row["Text"]
    expected = row["code"]
    predicted = langid.classify(sample)[0]
    if expected == predicted:
        counter += 1

print(
    f"LangID algorithm has performed with an accuracy of {100 * counter / len(dataset)}%"
)

# langdetect
counter = 0
for idx, row in dataset.iterrows():
    sample = row["Text"]
    expected = row["code"]
    try:
        predicted = detect(sample)
    except:
        continue
    if expected == predicted:
        counter += 1

print(
    f"langdetect algorithm has performed with an accuracy of {100 * counter / len(dataset)}%"
)

FastText algorithm has performed with an accuracy of 92.23636363636363%
LangID algorithm has performed with an accuracy of 95.42727272727272%
langdetect algorithm has performed with an accuracy of 84.3409090909091%


In [5]:
# Time measures

# FastText
start = time.time()
for idx, row in dataset.iterrows():
    predicted = fast_text.predict(row["Text"])
end = time.time()
elapsed = end - start
print(f"FastText algorithm classified the dataset in {elapsed} seconds")

# LangID
start = time.time()
for idx, row in dataset.iterrows():
    predicted = langid.classify(row["Text"])[0]
end = time.time()
elapsed = end - start
print(f"LangID algorithm classified the dataset in {elapsed} seconds")

# LangID
start = time.time()
for idx, row in dataset.iterrows():
    try:
        predicted = detect(row["Text"][0])
    except:
        continue
end = time.time()
elapsed = end - start
print(f"langdetect algorithm classified the dataset in {elapsed} seconds")

FastText algorithm classified the dataset in 4.166727781295776 seconds
LangID algorithm classified the dataset in 42.5226833820343 seconds
langdetect algorithm classified the dataset in 203.21818041801453 seconds


# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [6]:
import nltk

nltk.download("punkt")

import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
import en_core_web_sm

nlp = en_core_web_sm.load()

english_samples = dataset[dataset["code"] == "en"]["Text"]

# nltk
average = 0
for sample in english_samples:
    tokens = nltk.word_tokenize(sample)
    average += len(tokens)

average /= len(english_samples)
print(f"The average number of word tokens per sentence using nltk is {average}.")

# spacy
average = 0
for sample in english_samples:
    tokens = nlp(sample)
    average += len(tokens)

average /= len(english_samples)
print(f"The average number of word tokens per sentence using spacy is {average}.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The average number of word tokens per sentence using nltk is 68.752.
The average number of word tokens per sentence using spacy is 72.334.


# Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.

The output of this step is a dependency tree similar to the one reported in the figure below.

![dependency tree](http://www.rangakrish.com/wp-content/uploads/2018/04/Deptree-example2.png)

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [7]:
from spacy import displacy

sample = dataset[dataset["code"] == "en"].sample()["Text"].item()
print(sample)

doc = nlp(sample)
displacy.render(doc, style="dep", jupyter=True, options={"distance": 90})

sayyid yusuf hashim al-rifa’i nasiha li-ikhwaninia ulama’ najd “advice to our brethren the scholars of najd” introduction by msr al-buti with sayyid ‘alawi ahmad al-haddad’s misbah al-anam “the light of mankind” english


# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [8]:
print(f"Original sentence: {sample}")

doc = nlp(sample)
lemmas = " ".join([token.lemma_ for token in doc])
print(f"Lemmas: {lemmas}")

no_stop = " ".join([token.lemma_ for token in doc if not token.is_stop])
print(f"No stopwords: {no_stop}")

tagged = [(token.lemma_, token.tag_) for token in doc if not token.is_stop]
print(tagged)

Original sentence: sayyid yusuf hashim al-rifa’i nasiha li-ikhwaninia ulama’ najd “advice to our brethren the scholars of najd” introduction by msr al-buti with sayyid ‘alawi ahmad al-haddad’s misbah al-anam “the light of mankind” english
Lemmas: sayyid yusuf hashim al - rifa’i nasiha li - ikhwaninia ulama ' najd " advice to our brother the scholar of najd " introduction by msr al - buti with sayyid ' alawi ahmad al - haddad ’s misbah al - anam " the light of mankind " english
No stopwords: sayyid yusuf hashim al - rifa’i nasiha li - ikhwaninia ulama ' najd " advice brother scholar najd " introduction msr al - buti sayyid ' alawi ahmad al - haddad misbah al - anam " light mankind " english
[('sayyid', 'NN'), ('yusuf', 'NNP'), ('hashim', 'NNP'), ('al', 'NNP'), ('-', 'HYPH'), ('rifa’i', 'NNP'), ('nasiha', 'NNP'), ('li', 'NNP'), ('-', 'HYPH'), ('ikhwaninia', 'NNP'), ('ulama', 'NNP'), ("'", "''"), ('najd', 'NN'), ('"', '``'), ('advice', 'NN'), ('brother', 'NNS'), ('scholar', 'NNS'), ('najd

# **Occurrence-based text representation - TF-IDF**

---

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer


vectorizer = TfidfVectorizer()
corpus = list(dataset["Text"])

X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()
print(X.shape)

(22000, 277719)


# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, dataset["language"], test_size=0.2
)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9177272727272727


# **Topic Modelling**
Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
# Sometimes google colab bugs and doesn't let me run commands
# this is a workaround found in
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working

import locale

print(locale.getpreferredencoding())


def getpreferredencoding(do_setlocale=True):
    return "UTF-8"


if locale.getpreferredencoding() != "UTF-8":
    locale.getpreferredencoding = getpreferredencoding

print(locale.getpreferredencoding())

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [15]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import LsiModel

dataset_tm = pd.read_csv("CovidFake_filtered.csv")
headlines = list(dataset_tm["headlines"])
corpus = [h.split() for h in headlines]

dic = Dictionary(corpus)
corpus_doc2bow = [dic.doc2bow(t) for t in corpus]

lsi_model = LsiModel(corpus_doc2bow, id2word=dic)
lsi_model.print_topics(5)

[(0,
  '0.636*"the" + 0.389*"of" + 0.314*"in" + 0.280*"a" + 0.254*"to" + 0.179*"and" + 0.155*"that" + 0.121*"is" + 0.107*"coronavirus" + 0.106*"A"'),
 (1,
  '0.629*"the" + -0.598*"a" + -0.386*"in" + -0.105*"A" + -0.097*"and" + -0.078*"has" + -0.077*"COVID-19" + -0.071*"video" + -0.070*"on" + -0.064*"been"'),
 (2,
  '-0.709*"to" + 0.633*"of" + -0.137*"and" + -0.123*"is" + -0.071*"for" + -0.069*"be" + -0.054*"that" + -0.051*"are" + -0.049*"from" + -0.048*"due"'),
 (3,
  '0.734*"in" + -0.428*"of" + -0.363*"to" + -0.283*"a" + 0.172*"the" + 0.097*"coronavirus" + -0.046*"from" + 0.040*"was" + -0.037*"COVID-19." + -0.034*"on"'),
 (4,
  '-0.479*"to" + -0.425*"of" + 0.407*"a" + -0.391*"in" + 0.259*"that" + 0.240*"the" + 0.213*"is" + 0.187*"and" + 0.115*"for" + 0.075*"coronavirus"')]

# Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:
1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [16]:
import nltk
from nltk.corpus import stopwords
import string

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
corpus_processed = [
    [w.lower() for w in h if w.lower() not in stop_words] for h in corpus
]
corpus_processed = [
    [w.translate(str.maketrans("", "", string.punctuation)) for w in s]
    for s in corpus_processed
]

dic_processed = Dictionary(corpus_processed)
corpus_processed_doc2bow = [dic_processed.doc2bow(t) for t in corpus_processed]

lsi_model_processed = LsiModel(corpus_processed_doc2bow, id2word=dic_processed)
lsi_model_processed.print_topics(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[(0,
  '0.743*"coronavirus" + 0.406*"covid19" + 0.172*"video" + 0.139*"people" + 0.120*"shows" + 0.120*"facebook" + 0.109*"novel" + 0.107*"claim" + 0.098*"new" + 0.096*"shared"'),
 (1,
  '0.811*"covid19" + -0.544*"coronavirus" + 0.065*"video" + 0.049*"shows" + -0.045*"novel" + -0.040*"new" + 0.039*"hospital" + 0.037*"claims" + 0.037*"facebook" + 0.034*"lockdown"'),
 (2,
  '-0.358*"video" + -0.326*"facebook" + -0.297*"claim" + 0.294*"covid19" + -0.278*"shows" + -0.278*"posts" + -0.267*"times" + 0.250*"coronavirus" + -0.250*"shared" + -0.196*"multiple"'),
 (3,
  '0.652*"video" + 0.300*"shows" + 0.253*"people" + -0.240*"facebook" + -0.237*"posts" + -0.214*"claim" + -0.202*"shared" + -0.174*"times" + -0.152*"multiple" + 0.143*"lockdown"'),
 (4,
  '-0.880*"people" + 0.303*"video" + 0.130*"shows" + 0.097*"coronavirus" + 0.089*"covid19" + -0.073*"lockdown" + -0.072*"government" + -0.068*"virus" + -0.065*"died" + 0.065*"patients"')]

# Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [17]:
from gensim.models import LdaModel

lda_model = LdaModel(corpus_processed_doc2bow, id2word=dic_processed, num_topics=5)
lda_model.print_topics(5)



[(0,
  '0.040*"coronavirus" + 0.023*"covid19" + 0.015*"water" + 0.010*"hospital" + 0.009*"new" + 0.009*"video" + 0.008*"people" + 0.007*"patients" + 0.006*"cure" + 0.006*"infected"'),
 (1,
  '0.032*"coronavirus" + 0.021*"covid19" + 0.008*"indian" + 0.008*"video" + 0.008*"chinese" + 0.007*"outbreak" + 0.007*"minister" + 0.006*"india" + 0.006*"cure" + 0.006*"found"'),
 (2,
  '0.071*"coronavirus" + 0.011*"people" + 0.011*"covid19" + 0.010*"new" + 0.008*"china" + 0.007*"due" + 0.007*"outbreak" + 0.007*"virus" + 0.007*"predicted" + 0.006*"novel"'),
 (3,
  '0.049*"coronavirus" + 0.026*"covid19" + 0.015*"china" + 0.015*"shows" + 0.014*"video" + 0.011*"novel" + 0.011*"wuhan" + 0.010*"people" + 0.010*"vaccine" + 0.009*"infected"'),
 (4,
  '0.036*"coronavirus" + 0.030*"covid19" + 0.013*"people" + 0.009*"video" + 0.009*"new" + 0.008*"chinese" + 0.008*"government" + 0.007*"shows" + 0.006*"china" + 0.006*"says"')]

# Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
!pip install pyLDAvis "pandas<2.0.0"

In [26]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()

display_lda = gensimvis.prepare(lda_model, corpus_processed_doc2bow, dic_processed)

pyLDAvis.display(display_lda)

  and should_run_async(code)
