#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Giuseppe Gallipoli

**Credits:** Moreno La Quatra

**Practice 1:** Text processing and topic modeling

# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words.
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



### Language Identification

| Text                                                                                                                                | Language Code |
|-------------------------------------------------------------------------------------------------------------------------------------|---------------|
| The "Deep Natural Language Processing" course is offered during the first semester of the second year at Politecnico di Torino      | `EN`            |
| Il corso "Deep Natural Language Processing" viene impartito al Politecnico di Torino durante il primo semestre del secondo anno.    | `IT`            |
| Le cours "Deep Natural Language Processing" est enseigné au Politecnico di Torino pendant le premier semestre de la deuxième année. | `FR`            |

**Language Identification** is a crucial prelimiary step because each language has its own characteristics. The knowledge of the main language associated to a given text could be beneficial for all subsequent steps in text processing pipeline.

The data collection used in this first part of the practice is provided [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P1/langid_dataset.csv) - [source: Kaggle](https://www.kaggle.com/martinkk5575/language-detection)


# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [fastlangid](https://pypi.org/project/fastlangid/) (built on FastText)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [1]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

--2024-10-04 09:53:24--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12990065 (12M) [application/octet-stream]
Saving to: ‘langid_dataset.csv’


2024-10-04 09:53:24 (85.2 MB/s) - ‘langid_dataset.csv’ saved [12990065/12990065]



In [38]:
## pip install fastlangid, iso639-lang, langid, langdetect

[31mERROR: Invalid requirement: 'fastlangid,': Expected end or semicolon (after name and no valid version specifier)
    fastlangid,
              ^[0m[31m
[0m

In [39]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/981.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m14.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993221 sha256=8ae953c9aa77792231167fafcfd0ad9b976febc74c006c961dc1d2921b98a45d
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711a

In [1]:
# your code here
import pandas as pd
df = pd.read_csv('langid_dataset.csv')
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [3]:
from iso639 import Lang
df["language"] = df["language"].apply(lambda x: Lang(x).pt1)
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,et
1,sebes joseph pereira thomas på eng the jesuit...,sv
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,th
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,ta
4,de spons behoort tot het geslacht haliclona en...,nl


In [32]:
from fastlangid.langid import LID

langid = LID()
pred = langid.predict(df["Text"].tolist())  # Convert the Series to a list

corrected = sum(pred_label == label for pred_label, label in zip(pred, df["language"]))
accuracy = corrected / len(pred)

print(accuracy)

0.9223636363636364


In [34]:
import langid
pred = [langid.classify(text)[0] for text in df["Text"]]

corrected = sum(pred_label == label for pred_label, label in zip(pred, df["language"]))
accuracy = corrected / len(pred)

print(accuracy)

0.9542727272727273


In [20]:
from langdetect import detect

pred = [detect(text) for text in df["Text"]]

corrected = sum(pred_label == label for pred_label, label in zip(pred, df["language"]))
accuracy = corrected / len(pred)

print(accuracy)

0.997


# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [5]:
#pip install nltk spacy

In [6]:
#!python -m spacy download en_core_web_sm

In [4]:
# your code here
df = pd.read_csv('langid_dataset.csv')
df["language"] = df["language"].apply(lambda x: Lang(x).pt1)
df = df[df['language'] == 'en']
df.head()

Unnamed: 0,Text,language
37,in johnson was awarded an american institute ...,en
40,bussy-saint-georges has built its identity on ...,en
76,minnesotas state parks are spread across the s...,en
90,nordahl road is a station served by north coun...,en
97,a talk by takis fotopoulos about the internati...,en


## using nltk

In [8]:
import nltk
nltk.download('punkt')
tokens = df["Text"].apply(lambda x: nltk.word_tokenize(x))
tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Text
37,"[in, johnson, was, awarded, an, american, inst..."
40,"[bussy-saint-georges, has, built, its, identit..."
76,"[minnesotas, state, parks, are, spread, across..."
90,"[nordahl, road, is, a, station, served, by, no..."
97,"[a, talk, by, takis, fotopoulos, about, the, i..."
...,...
21829,"[on, march, empty, mirrors, press, published, ..."
21879,"[he, [, musk, ], wants, to, go, to, mars, to, ..."
21896,"[overall, the, male, is, black, above, and, wh..."
21897,"[tim, reynolds, born, december, in, wiesbaden,..."


In [14]:
print(tokens.iloc[0])

['in', 'johnson', 'was', 'awarded', 'an', 'american', 'institute', 'of', 'architects', 'gold', 'medal', 'in', 'he', 'became', 'the', 'first', 'recipient', 'of', 'the', 'pritzker', 'architecture', 'prize', 'the', 'most', 'prestigious', 'international', 'architectural', 'award']


In [10]:
tokens.apply(lambda x: len(x)).mean()

68.752

### using spacy

In [17]:
import numpy as np

def spacy_helper(text):
    spacy_doc = nlp(text)
    spacy_sentences = list(spacy_doc.sents)
    spacy_word_counts = [len([token.text for token in sentence if not token.is_punct]) for sentence in spacy_sentences]
    spacy_average_words_per_sentence = np.mean(spacy_word_counts)

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")
token_spacy = df["Text"].apply(lambda x: len(nlp(x)))
token_spacy

Unnamed: 0,Text
37,30
40,50
76,144
90,59
97,53
...,...
21829,57
21879,58
21896,85
21897,42


# Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.

The output of this step is a dependency tree similar to the one reported in the figure below.

![dependency tree](http://www.rangakrish.com/wp-content/uploads/2018/04/Deptree-example2.png)

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [25]:
# your code here
import spacy
from spacy import displacy
import random
nlp = spacy.load("en_core_web_sm")
random_sentence = df['Text'].sample().values[0]

doc = nlp(random_sentence)
displacy.render(doc, style="dep", jupyter=True)

# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [26]:
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

rule


In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [28]:
# your code here
import pandas as pd
import spacy
from spacy import displacy
import random

def analyze_sentence(df, text_column):
    nlp = spacy.load("en_core_web_sm")

    random_sentence = df[text_column].sample().values[0]
    print(f"Selected Sentence: {random_sentence}")

    doc = nlp(random_sentence)

    lemmas = [token.lemma_ for token in doc]
    print(f"Lemmatized Words: {lemmas}")

    # Stopword Removal
    filtered_words = [token.text for token in doc if not token.is_stop]
    print(f"Words after Stopword Removal: {filtered_words}")

    # Part of Speech Tagging
    pos_tags = [(token.text, token.pos_) for token in doc]
    print("Part of Speech Tags:")
    for word, pos in pos_tags:
        print(f"{word}: {pos}")

    # Visualize the dependency tree
    displacy.render(doc, style="dep", jupyter=True, options={"compact": True})


# Call the function with the DataFrame and the appropriate text column name
analyze_sentence(df, 'Text')  # Adjust 'text' if the column name is different

Selected Sentence: born in tokyo she studied at waseda university but left before graduating on stage she participated in shoji kokamis troupe daisan butai and also appeared in plays directed by yukio ninagawa and hideki noda she has appeared in over fifty japanese tv dramas including brother beat and the onsen e ikō series
Lemmatized Words: ['bear', 'in', 'tokyo', 'she', 'study', 'at', 'waseda', 'university', 'but', 'leave', 'before', 'graduate', 'on', 'stage', 'she', 'participate', 'in', 'shoji', 'kokamis', 'troupe', 'daisan', 'butai', 'and', 'also', 'appear', 'in', 'play', 'direct', 'by', 'yukio', 'ninagawa', 'and', 'hideki', 'noda', 'she', 'have', 'appear', 'in', 'over', 'fifty', 'japanese', 'tv', 'drama', 'include', 'brother', 'beat', 'and', 'the', 'onsen', 'e', 'ikō', 'series']
Words after Stopword Removal: ['born', 'tokyo', 'studied', 'waseda', 'university', 'left', 'graduating', 'stage', 'participated', 'shoji', 'kokamis', 'troupe', 'daisan', 'butai', 'appeared', 'plays', 'dire

# **Occurrence-based text representation - TF-IDF**

---

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [33]:
df.head()

Unnamed: 0,Text,language
37,in johnson was awarded an american institute ...,en
40,bussy-saint-georges has built its identity on ...,en
76,minnesotas state parks are spread across the s...,en
90,nordahl road is a station served by north coun...,en
97,a talk by takis fotopoulos about the internati...,en


In [32]:
# your code here
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize_sentences(df, text_column):
    vectorizer = TfidfVectorizer()

    tfidf_matrix = vectorizer.fit_transform(df[text_column])

    # Extract feature names
    feature_names = vectorizer.get_feature_names_out()

    return tfidf_matrix, feature_names


# Vectorize the sentences in the 'text' column
tfidf_matrix, feature_names = vectorize_sentences(df, 'Text')

# Display the TF-IDF matrix as a dense array (not recommended for very large datasets)
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Display the feature names (terms)
print("Feature Names:")
print(feature_names)

TF-IDF Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Feature Names:
['aam' 'aamer' 'aardvark' ... '思川駅' 'ﬂoatplane' 'ﬂoats']


# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def train_language_detector(df, text_column, label_column):

    # Step 1: Vectorize the sentences using TF-IDF
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(df[text_column])
    y = df[label_column]

    # Step 2: Split the data into training (80%) and testing (20%)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Step 3: Train a multi-class classifier (Logistic Regression)
    classifier = LogisticRegression(max_iter=1000)
    classifier.fit(X_train, y_train)

    # Step 4: Make predictions on the test set
    y_pred = classifier.predict(X_test)

    # Step 5: Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)

    return classifier, accuracy


url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv"
df = pd.read_csv(url)

model, accuracy = train_language_detector(df, text_column='Text', label_column='language')

print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 0.96


# **Topic Modeling**
---
Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modeling focuses on capturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [40]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

--2024-10-04 11:38:27--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1088708 (1.0M) [text/plain]
Saving to: ‘CovidFake_filtered.csv’


2024-10-04 11:38:27 (17.5 MB/s) - ‘CovidFake_filtered.csv’ saved [1088708/1088708]



In [56]:
import pandas as pd

df_covid = pd.read_csv('CovidFake_filtered.csv', index_col=0)
df_covid.head()

Unnamed: 0,headlines,outcome
0,A post claims compulsory vacination violates t...,0
1,A photo claims that this person is a doctor wh...,0
2,Post about a video claims that it is a protest...,0
3,All deaths by respiratory failure and pneumoni...,0
4,The dean of the College of Biologists of Euska...,0


In [54]:
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LsiModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

# Download stopwords and tokenizer model (if not already done)
nltk.download('punkt')
nltk.download('stopwords')

def train_lsi_model(df, text_column, num_topics=5):
    """
    Trains an LSI model on the text headlines in the given DataFrame.

    Args:
    - df (pd.DataFrame): DataFrame containing text data.
    - text_column (str): Name of the column containing headlines.
    - num_topics (int): Number of topics for the LSI model.

    Returns:
    - lsi_model: Trained LSI model.
    - dictionary: Gensim dictionary used in the LSI model.
    """
    stop_words = set(stopwords.words('english'))

    processed_texts = [
        [word for word in word_tokenize(str(headline).lower()) if word.isalnum() and word not in stop_words]
        for headline in df[text_column]
    ]

    dictionary = corpora.Dictionary(processed_texts)

    corpus = [dictionary.doc2bow(text) for text in processed_texts]

    lsi_model = LsiModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)

    print("Top-5 Topics:")
    for i, topic in enumerate(lsi_model.print_topics(num_topics=num_topics)):
        print(f"Topic #{i + 1}: {topic}")

    return lsi_model, dictionary

url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv"
df = pd.read_csv(url)

# Train the LSI model on the 'text' column of the dataset
lsi_model, dictionary = train_lsi_model(df, text_column='headlines', num_topics=5)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Top-5 Topics:
Topic #1: (0, '0.870*"coronavirus" + 0.158*"video" + 0.131*"people" + 0.121*"novel" + 0.113*"facebook" + 0.109*"shows" + 0.105*"new" + 0.101*"claim" + 0.089*"shared" + 0.088*"china"')
Topic #2: (1, '-0.393*"coronavirus" + 0.390*"video" + 0.306*"facebook" + 0.293*"shows" + 0.275*"claim" + 0.255*"posts" + 0.244*"times" + 0.239*"shared" + 0.177*"multiple" + 0.172*"thousands"')
Topic #3: (2, '0.575*"video" + 0.305*"people" + -0.271*"facebook" + -0.264*"posts" + 0.251*"shows" + -0.246*"claim" + -0.221*"shared" + -0.205*"times" + -0.173*"multiple" + -0.159*"novel"')
Topic #4: (3, '-0.807*"people" + 0.405*"video" + 0.191*"shows" + 0.113*"coronavirus" + -0.105*"government" + -0.094*"virus" + -0.079*"president" + -0.076*"died" + -0.071*"vaccine" + -0.067*"pandemic"')
Topic #5: (4, '-0.371*"people" + 0.363*"lockdown" + 0.328*"india" + 0.304*"government" + 0.234*"president" + 0.225*"claims" + -0.173*"shows" + 0.154*"says" + 0.134*"health" + 0.134*"vaccine"')


# Exercise 8

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:
1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [59]:
# your code here
import string

def train_lsi_model(df, text_column, num_topics=5):
    """
    Trains an LSI model on the text headlines in the given DataFrame with enhanced preprocessing.

    Args:
    - df (pd.DataFrame): DataFrame containing text data.
    - text_column (str): Name of the column containing headlines.
    - num_topics (int): Number of topics for the LSI model.

    Returns:
    - lsi_model: Trained LSI model.
    - dictionary: Gensim dictionary used in the LSI model.
    """
    # Step 1: Preprocess the text
    stop_words = set(stopwords.words('english'))
    punctuation_set = set(string.punctuation)

    # Tokenize, remove stopwords, punctuation, and lowercase
    processed_texts = [
        [
            word for word in word_tokenize(str(headline).lower())
            if word not in stop_words and word not in punctuation_set and word.isalnum()
        ]
        for headline in df[text_column]
    ]

    # Step 2: Create a dictionary (word -> id mapping)
    dictionary = corpora.Dictionary(processed_texts)

    # Step 3: Create the bag-of-words representation of the corpus
    corpus = [dictionary.doc2bow(text) for text in processed_texts]

    # Step 4: Train the LSI model
    lsi_model = LsiModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)

    # Step 5: Print the top-5 topics generated by the LSI model
    print("Top-5 Topics:")
    for i, topic in enumerate(lsi_model.print_topics(num_topics=num_topics)):
        print(f"Topic #{i + 1}: {topic}")

    return lsi_model, dictionary

# Usage
# Load the dataset
url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv"
df = pd.read_csv(url)

# Train the LSI model on the 'text' column of the dataset
lsi_model, dictionary = train_lsi_model(df, text_column='headlines', num_topics=5)

Top-5 Topics:
Topic #1: (0, '0.870*"coronavirus" + 0.158*"video" + 0.131*"people" + 0.121*"novel" + 0.113*"facebook" + 0.109*"shows" + 0.105*"new" + 0.101*"claim" + 0.089*"shared" + 0.088*"china"')
Topic #2: (1, '0.393*"coronavirus" + -0.390*"video" + -0.306*"facebook" + -0.293*"shows" + -0.275*"claim" + -0.255*"posts" + -0.244*"times" + -0.239*"shared" + -0.177*"multiple" + -0.172*"thousands"')
Topic #3: (2, '0.575*"video" + 0.305*"people" + -0.272*"facebook" + -0.264*"posts" + 0.251*"shows" + -0.246*"claim" + -0.221*"shared" + -0.205*"times" + -0.173*"multiple" + -0.159*"novel"')
Topic #4: (3, '-0.806*"people" + 0.405*"video" + 0.191*"shows" + 0.113*"coronavirus" + -0.105*"government" + -0.093*"virus" + -0.079*"president" + -0.076*"died" + -0.071*"vaccine" + -0.067*"pandemic"')
Topic #5: (4, '0.371*"people" + -0.364*"lockdown" + -0.330*"india" + -0.305*"government" + -0.231*"president" + -0.225*"claims" + 0.174*"shows" + -0.152*"says" + -0.134*"health" + -0.133*"vaccine"')


# Exercise 9

Leveraging the same corpus used for LSI model generation, apply LDA modeling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [61]:
# your code here

import string
from gensim.models import LdaModel

def train_lsi_model(df, text_column, num_topics=5):
    """
    Trains an LSI model on the text headlines in the given DataFrame with enhanced preprocessing.

    Args:
    - df (pd.DataFrame): DataFrame containing text data.
    - text_column (str): Name of the column containing headlines.
    - num_topics (int): Number of topics for the LSI model.

    Returns:
    - lsi_model: Trained LSI model.
    - dictionary: Gensim dictionary used in the LSI model.
    """
    # Step 1: Preprocess the text
    stop_words = set(stopwords.words('english'))
    punctuation_set = set(string.punctuation)

    # Tokenize, remove stopwords, punctuation, and lowercase
    processed_texts = [
        [
            word for word in word_tokenize(str(headline).lower())
            if word not in stop_words and word not in punctuation_set and word.isalnum()
        ]
        for headline in df[text_column]
    ]

    # Step 2: Create a dictionary (word -> id mapping)
    dictionary = corpora.Dictionary(processed_texts)

    # Step 3: Create the bag-of-words representation of the corpus
    corpus = [dictionary.doc2bow(text) for text in processed_texts]

    # Step 4: Train the LSI model
    Lda_model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)

    # Step 5: Print the top-5 topics generated by the LSI model
    print("Top-5 Topics:")
    for i, topic in enumerate(Lda_model.print_topics(num_topics=num_topics)):
        print(f"Topic #{i + 1}: {topic}")

    return Lda_model, dictionary

# Usage
# Load the dataset
url = "https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv"
df = pd.read_csv(url)

# Train the LSI model on the 'text' column of the dataset
lsi_model, dictionary = train_lsi_model(df, text_column='headlines', num_topics=5)



Top-5 Topics:
Topic #1: (0, '0.026*"coronavirus" + 0.010*"facebook" + 0.009*"health" + 0.009*"minister" + 0.008*"claim" + 0.007*"people" + 0.007*"whatsapp" + 0.007*"infected" + 0.007*"claims" + 0.006*"times"')
Topic #2: (1, '0.052*"coronavirus" + 0.033*"china" + 0.014*"wuhan" + 0.010*"patients" + 0.010*"people" + 0.009*"vaccine" + 0.007*"chinese" + 0.006*"kill" + 0.005*"lab" + 0.005*"created"')
Topic #3: (2, '0.045*"coronavirus" + 0.010*"video" + 0.010*"new" + 0.009*"people" + 0.009*"masks" + 0.007*"predicted" + 0.007*"wuhan" + 0.007*"outbreak" + 0.005*"shows" + 0.005*"virus"')
Topic #4: (3, '0.059*"coronavirus" + 0.015*"shows" + 0.014*"video" + 0.014*"people" + 0.014*"novel" + 0.012*"water" + 0.009*"chinese" + 0.009*"cure" + 0.009*"infected" + 0.009*"lockdown"')
Topic #5: (4, '0.063*"coronavirus" + 0.015*"new" + 0.012*"video" + 0.011*"people" + 0.008*"shows" + 0.008*"infected" + 0.007*"virus" + 0.006*"president" + 0.006*"outbreak" + 0.005*"hospital"')


# Exercise 10

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [62]:
pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyldavis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyldavis
Successfully installed funcy-2.0 pyldavis-3.4.1


In [63]:
# your code here
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import nltk
import pyLDAvis
import pyLDAvis.gensim_models

stop_words = set(stopwords.words('english'))
punctuation_set = set(string.punctuation)

processed_texts = [
    [
        word for word in word_tokenize(str(headline).lower())
        if word not in stop_words and word not in punctuation_set and word.isalnum()
    ]
    for headline in df['text']
]

# Step 2: Create a dictionary (word -> id mapping)
dictionary = corpora.Dictionary(processed_texts)

# Step 3: Create the bag-of-words representation of the corpus
corpus = [dictionary.doc2bow(text) for text in processed_texts]

# Step 4: Train the LDA model
num_topics = 5
lda_model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=10, random_state=42)

# Step 5: Print the top topics generated by the LDA model
print("Top Topics:")
for i, topic in enumerate(lda_model.print_topics(num_topics=num_topics)):
    print(f"Topic #{i + 1}: {topic}")

# Step 6: Visualize LDA model using pyLDAvis
pyLDAvis.enable_notebook()

# Preparing the LDA visualization
lda_vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

# Displaying the LDA visualization
pyLDAvis.display(lda_vis)

KeyError: 'text'

# Exercise 11
**Credits:** Giuseppe Gallipoli

#### Introduction
[Large Language Models](https://en.wikipedia.org/wiki/Large_language_model) (LLMs) are a type of deep learning model capable of language generation. These models are built on deep learning architectures, primarily using neural networks, and are trained on massive amounts of text data. LLMs generally leverage the *Transformer* architecture (a groundbreaking deep architecture that you will study in more detail later in the course), which allows them to process language in context, capturing complex relationships between words and concepts.

Large Language Models have demonstrated excellent capabilities across a wide variety of tasks, making them versatile models which can be applied in diverse scenarios and use cases.

Given their relevance, although you have not yet covered this topic in the course, we will provide you, starting from this first laboratory practice, with practical applications showing how LLMs can be used to solve a diverse range of tasks.\
Don't worry about the theoretical or more technical aspects: they will be covered in more detail in due time.\
For now, the most important thing to know is that users interact with LLMs by means of a **prompt**, which is a piece of text containing the instruction or question the user wants to give or ask the model.

#### Topic Modeling using Large Language Models

In this practice, we will use a Large Language Model to address a topic modeling-related task. Specifically, rather than modeling topic distributions as done with techniques like LSI or LDA, we will ask the LLM to extract the most relevant topic(s) from sentences (or from an entire corpus) according to different approaches.

For this task, we will use the [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) 7B model, i.e., `HuggingFaceH4/zephyr-7b-beta`.

**1<sup>st</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>without providing</u> a predefined list of topics to choose from.\
*Example of prompt*:\
Which are the most relevant topics of the following sentence?

\
<u>Suggestion</u>: To increase speed, switch to a GPU runtime. You can do this by clicking on Runtime → Change runtime type → Hardware accelerator → Select T4 GPU.\
If you encounter an `OutOfMemoryError`, try restarting the session by clicking on Runtime → Restart session.

In [1]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

--2024-10-04 12:14:28--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1088708 (1.0M) [text/plain]
Saving to: ‘CovidFake_filtered.csv’


2024-10-04 12:14:29 (142 MB/s) - ‘CovidFake_filtered.csv’ saved [1088708/1088708]



In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = pd.read_csv("CovidFake_filtered.csv")

PROMPT = "Which are the most relevant topics of the following sentence?"

# It may take some time to download the model
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for _,row in df_tmodeling.iterrows():
  full_prompt = f"{PROMPT} {sentence}"
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  return output

**2<sup>nd</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>providing</u> a predefined list of topics to choose from.

*Example of prompt*:\
Which are the most relevant topics of the following sentence?\
Choose among: medicine, COVID, Artificial Intelligence, treatment, English literature, vaccine, gardening

In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = pd.read_csv("CovidFake_filtered.csv")

PROMPT = "Which are the most relevant topics of the following sentence? Choose among: medicine, COVID, Artificial Intelligence, treatment, English literature, vaccine, gardening"

# It may take some time to download the model. If you have already downloaded and loaded it, you can skip this part
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for _,row in df_tmodeling.iterrows():
  full_prompt = f"{PROMPT} {sentence}"
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  return output

**3<sup>rd</sup> approach**: Ask the model to identify the topic(s) contained in a given sentence <u>providing</u> a predefined list of topics to choose from along with the corresponding definitions.

*Example of prompt*:\
Which are the most relevant topics of the following sentence?\
Choose among:
- medicine: treatment for illness or injury, or the study of this
- COVID: an infectious disease caused by a coronavirus
- Artificial Intelligence: computer systems that have some of the qualities that the human brain has, such as learn from data
- treatment: the use of drugs to cure a person of an illness or injury
- English literature: artistic works written in the English language, especially those with a high and lasting artistic value
- vaccine: a substance that is put into the body of a person or animal to protect them from a disease
- gardening: the job or activity of working in a garden

Definitions are taken from the [Cambridge Dictionary](https://dictionary.cambridge.org/) and have been slightly adapted.

In [None]:
# fill in the following code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd, random

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

df_tmodeling = pd.read_csv("CovidFake_filtered.csv")

PROMPT = "Which are the most relevant topics of the following sentence? Choose among: medicine: treatment for illness or injury, or the study of this COVID: an infectious disease caused by a coronavirus Artificial Intelligence: computer systems that have some of the qualities that the human brain has, such as learn from data treatment: the use of drugs to cure a person of an illness or injury English literature: artistic works written in the English language, especially those with a high and lasting artistic value vaccine: a substance that is put into the body of a person or animal to protect them from a disease gardening: the job or activity of working in a garden"

# It may take some time to download the model. If you have already downloaded and loaded it, you can skip this part
model = AutoModelForCausalLM.from_pretrained('HuggingFaceH4/zephyr-7b-beta', torch_dtype=torch.float16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta')

for _,row in df_tmodeling.iterrows():
  full_prompt = f"{PROMPT} {sentence}"
  input = tokenizer(full_prompt, return_tensors='pt').to(device)
  output = model.generate(**input, max_new_tokens=32)
  output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  return output

After manually inspecting some of the outputs for each approach, here you can find some questions to reason about the results:
- Did you find an approach which worked best overall?
- Did you encounter any cases where the LLM failed?
- What happens if all the topics provided are irrelevant to the sentence?
- Does the presence of definitions improve the model's performance?
- What challenges or limitations did you observe?