<a href="https://colab.research.google.com/github/francescodisalvo05/polito-nlp/blob/main/Labs/Lab_01_text_processing_and_topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Everything

In [1]:
# install
!pip install fastlangid
!pip install iso639-lang
!pip install langdetect
!pip install langid



In [17]:

import re
import pandas as pd

from iso639 import Lang
from langdetect import detect
from fastlangid.langid import LID
import langid

import spacy
from spacy import displacy

import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

from gensim.corpora import Dictionary
from gensim.models import LdaModel, LsiModel

import pyLDAvis
from pyLDAvis import gensim_models
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words. 
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)

# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

Your code

In [3]:
df = pd.read_csv('langid_dataset.csv')

y = df.language
X = df.Text

### FastText

In [6]:
%%time

fastlangid = LID()

results_FastText = fastlangid.predict(X)

# there are several dialects for Chineese
results_FastText = ["zh" if x[:2] == "zh" else x for x in results_FastText]
results_name_FastText = [Lang(x).name for x in results_FastText]

print("Accuracy for FastText : ", accuracy_score(y, results_name_FastText))
print("---------------------------------")

Accuracy for FastText :  0.9677272727272728
---------------------------------
CPU times: user 3.58 s, sys: 20.5 ms, total: 3.6 s
Wall time: 3.61 s


### LangID

In [8]:
%%time

results_langid = [Lang(langid.classify(x)[0]).name for x in X]

print("Accuracy for LangID : ", accuracy_score(y, results_langid))
print("--------------------------------------")

Accuracy for LangID :  0.9542727272727273
--------------------------------------
CPU times: user 1min 15s, sys: 53.6 s, total: 2min 9s
Wall time: 1min 7s


### LangDetect

In [9]:
%%time
results_LangDetect = []

# there are four rows that give the error "no features in text"
# one is blank but the other three are Arabic
# >> I will manually omit these ones

errors_idxs = []
y_cleaned = y.copy()

for i,x in enumerate(X):

    try:
        language = detect(x)

        if language.startswith("zh"):
          language = "zh"
        
        results_LangDetect.append(Lang(language).name)

    except:
        # drop this element from y
        y_cleaned.drop(i,inplace=True)

print("Accuracy for LangDetect : ", accuracy_score(y_cleaned, results_LangDetect))
print("--------------------------------------")

Accuracy for LangDetect :  0.8752557166886393
--------------------------------------
CPU times: user 2min 29s, sys: 1.61 s, total: 2min 30s
Wall time: 2min 30s


# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

### Filter English Text

In [10]:
nlp = spacy.load("en_core_web_sm")

In [11]:
X_eng = df[df["language"] == "English"][["Text"]]

X_eng["num_tokens_nltk"] = df["Text"].apply(lambda x : len(word_tokenize(x)))
X_eng["num_tokens_spacy"] = df["Text"].apply(lambda x : len(nlp(x)))

In [12]:
X_eng.head()

Unnamed: 0,Text,num_tokens_nltk,num_tokens_spacy
37,in johnson was awarded an american institute ...,28,30
40,bussy-saint-georges has built its identity on ...,40,50
76,minnesotas state parks are spread across the s...,132,144
90,nordahl road is a station served by north coun...,59,59
97,a talk by takis fotopoulos about the internati...,50,53


In [13]:
X_eng.mean(axis=0)

  """Entry point for launching an IPython kernel.


num_tokens_nltk     68.738
num_tokens_spacy    72.334
dtype: float64

# Exercise 3

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [14]:
nlp = spacy.load("en_core_web_sm")

sentence = "Hello, I am Francesco and I am studying at Politecnico!"

doc = nlp(sentence)
displacy.render(doc, style="dep", jupyter=True)

In [15]:
displacy.render(doc, style="ent", jupyter=True)

# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [None]:
lemmatizer = WordNetLemmatizer()

lemmatized_tokens = []
cleaned_tokens = []

stopwords_eng = stopwords.words('english')

# change the noun for the lemmatization
sentence = "He is Francesco and he studies at Politecnico"
tokens = word_tokenize(sentence)

# step 0 : print sentence
print(f"Step 0 : {sentence}")

# step 1 : Lemmatization
for token in tokens:
    
    # lemmatization's result is a common sense word
    # therefore we should apply it before the stopword removal

    # lemmatization
    lemmatized_tokens.append(lemmatizer.lemmatize(token))

print(f"Step 1 : {' '.join(lemmatized_tokens)}")


# step 2 : Stopwords removal
for l_token in lemmatized_tokens:    
    
    if l_token not in stopwords_eng:
      cleaned_tokens.append(l_token)

print(f"Step 2 : {' '.join(cleaned_tokens)}")
    
# step 3 : POS tagging
print(f"Step 3 : {nltk.pos_tag(cleaned_tokens)}")

Step 0 : He is Francesco and he studies at Politecnico
Step 1 : He is Francesco and he study at Politecnico
Step 2 : He Francesco study Politecnico
Step 3 : [('He', 'PRP'), ('Francesco', 'NNP'), ('study', 'NN'), ('Politecnico', 'NNP')]


# **Occurrence-based text representation - TF-IDF**

---
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)

# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_vec,y, test_size=0.2, stratify=y)

model = LogisticRegression().fit(X_train,y_train)
results = model.predict(X_test)

print(accuracy_score(y_test,results))

0.9518181818181818


# **Topic Modelling**

Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [18]:
def preprocess_text(documents):
  """
  :param df: (Series) documents to clean
  :return: (Series) cleaned documents 
           - Lemmatization
           - Stopwords 
  """

  # tokenizer that removes punctuation
  tokenizer = nltk.RegexpTokenizer(r"\w+")
  lemmatizer = WordNetLemmatizer()
  stopwords_eng = stopwords.words('english')

  cleaned_docs = []

  for curr_doc in documents:

    # lowercase
    curr_doc = curr_doc.lower()
    # tokenize
    curr_tokens = tokenizer.tokenize(curr_doc)
    # lemmatized tokens
    curr_lemmas = [lemmatizer.lemmatize(t) for t in curr_tokens]
    # remove stopwords
    cleaned_tokens = [t for t in curr_lemmas if not t in stopwords_eng]

    cleaned_docs.append(cleaned_tokens)

  return cleaned_docs

In [19]:
# step 0
df_covid = pd.read_csv('CovidFake_filtered.csv').drop(columns=['Unnamed: 0'])

# step 1
headlines = df_covid.headlines
headlines_cleaned = preprocess_text(headlines)

# step 2
dct = Dictionary(headlines_cleaned)  # initialize a Dictionary

# step 3

# generate a bag of word corpus
corpus_bow = []
for x in headlines_cleaned:
  corpus_bow.append(dct.doc2bow(x))

# show the first 10 tokens of the first document
print(corpus_bow[0][10])

(10, 1)


In [17]:
# step 4
lsi_model = LsiModel(corpus_bow, id2word=dct, num_topics=5)  
result_lsi = lsi_model.print_topics(num_topics=5, num_words=5)

In [18]:
# example of result
result_lsi[0]

(0,
 '0.608*"covid" + 0.600*"19" + 0.251*"coronavirus" + 0.153*"ha" + 0.138*"claim"')

In [20]:
def clean_print_result(result):
  for res in result:
    
    # extract the most representative words for that topic
    words = re.findall('[a-z]{1,}', res[1])
    joined_words = ', '.join(words)

    print(f"Topic {res[0] + 1} : {joined_words}")

clean_print_result(result_lsi)

Topic 1 : covid, coronavirus, ha, claim
Topic 2 : coronavirus, covid, ha, claim
Topic 3 : coronavirus, claim, post, facebook, ha
Topic 4 : video, show, ha, people, post
Topic 5 : wa, people, video, show, coronavirus


# Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Repeat the same procedure of Ex. 7 by adding a preliminary preprocessing step to **remove stopwords**.

In [None]:
# already done above

# Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [21]:
lda_model  = LdaModel(corpus_bow, id2word=dct, num_topics=5)  
result_lda = lda_model.print_topics(num_topics=5, num_words=5)

In [22]:
clean_print_result(result_lda)

Topic 1 : coronavirus, covid, ha, video
Topic 2 : coronavirus, outbreak, show, video, new
Topic 3 : coronavirus, people, covid, lockdown
Topic 4 : coronavirus, china, people, wa, chinese
Topic 5 : covid, coronavirus, cure, ha


# Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
!pip install pyLDAvis==3.3.0

In [24]:
pyLDAvis.enable_notebook()
vis = gensim_models.prepare(lda_model, corpus_bow, dct, mds="mmds", R=15)
vis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


The dashboard will not be visible on GitHub, but [this](https://www.researchgate.net/profile/Tunazzina-Islam/publication/338491108/figure/fig3/AS:845535753818113@1578602841696/Visualization-using-pyLDAVis-Best-viewed-in-electronic-format-zoomed-in.ppm) is an example 