# **Text processing**
---
The text processing phase is a preliminary stage where the text to be manipulated is processed to be ready for subsequent analysis.

Text processing usually entails several steps that could possibly include:
- **Language Identification**: identifying the language of a given text.
- **Tokenization**: splitting a given text in several sentences/words. 
- **Dependency tree parsing:** analyzing the depencies between words composing the text.
- **Stemming/Lemmatization:** obtain the root form for each word in text.
- **Stopword removal**: removing words that are si commonly used that they carry very little useful information.
- **Part of Speech Tagging:** given a word, retrieve its part of speech (proper noun, common noun or verb).



# Exercise 1:

Benchmark different language-detection algorithm by computing the accuracy of each approach:
- [FastText](https://pypi.org/project/fastlangid/)
- [LangID](https://github.com/saffsd/langid.py)
- [langdetect](https://pypi.org/project/langdetect/)

**Hint:** language code conversion: [iso639-lang](https://pypi.org/project/iso639-lang/)

For each method report:
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

--2021-10-09 15:50:32--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12990065 (12M) [text/plain]
Saving to: ‘langid_dataset.csv’


2021-10-09 15:50:33 (105 MB/s) - ‘langid_dataset.csv’ saved [12990065/12990065]



In [None]:
pip install fastlangid

In [None]:
pip install langid

In [None]:
pip install iso639-lang


In [None]:
pip install langdetect

In [None]:
from fastlangid.langid import LID
import csv

source = 'langid_dataset.csv'
dataset = [[], []]

with open(source) as f:
  data = csv.reader(f)
  for row in data:
    dataset[0].append(row[0])
    dataset[1].append(row[1])

In [None]:
# dataset reading 
import pandas as pd
from iso639 import Lang
df_langid = pd.read_csv('langid_dataset.csv')
print (df_langid)
# convert to language codes
df_langid['language'] = df_langid['language'].apply(lambda x: Lang(x).pt1)
print(df_langid)

                                                    Text  language
0      klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1      sebes joseph pereira thomas  på eng the jesuit...   Swedish
2      ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3      விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4      de spons behoort tot het geslacht haliclona en...     Dutch
...                                                  ...       ...
21995  hors du terrain les années  et  sont des année...    French
21996  ใน พศ  หลักจากที่เสด็จประพาสแหลมมลายู ชวา อินเ...      Thai
21997  con motivo de la celebración del septuagésimoq...   Spanish
21998  年月，當時還只有歲的她在美國出道，以mai-k名義推出首張英文《baby i like》，由...   Chinese
21999   aprilie sonda spațială messenger a nasa și-a ...  Romanian

[22000 rows x 2 columns]
                                                    Text language
0      klement gottwaldi surnukeha palsameeriti ning ...       et
1      sebes joseph pereira thomas  på

In [None]:
# FastLangID performances
%%time
from iso639 import Lang
from sklearn.metrics import accuracy_score

langid = LID()
result = langid.predict(dataset[0])
result = ['zh' if r.startswith('zh') else r for r in result]
print(result[0:100])
result = [Lang(r).name for r in result]

print(f"The accuracy is: ", accuracy_score(result, dataset[1]))

['en', 'et', 'no', 'th', 'ta', 'nl', 'ja', 'tr', 'de', 'ur', 'ja', 'id', 'pt', 'fr', 'zh', 'ko', 'th', 'et', 'pt', 'en', 'hi', 'ta', 'es', 'fr', 'fr', 'et', 'ko', 'fr', 'ps', 'nl', 'fa', 'fr', 'ro', 'ru', 'ja', 'id', 'la', 'la', 'en', 'fr', 'pt', 'en', 'ur', 'en', 'id', 'id', 'ja', 'ar', 'ar', 'id', 'sv', 'nl', 'ru', 'nl', 'ar', 'ar', 'tr', 'ur', 'fr', 'pt', 'pt', 'id', 'id', 'ta', 'sv', 'fa', 'ru', 'en', 'ko', 'ar', 'th', 'pt', 'ar', 'ta', 'ps', 'ur', 'ps', 'en', 'ru', 'fa', 'ja', 'pt', 'hi', 'fa', 'en', 'sv', 'en', 'la', 'fr', 'la', 'ps', 'en', 'et', 'id', 'ta', 'ta', 'la', 'la', 'en', 'ps']
The accuracy is:  0.9676832871233125
CPU times: user 3.52 s, sys: 18.1 ms, total: 3.54 s
Wall time: 3.56 s


In [None]:
# LangID performances
%%time
from iso639 import Lang
from sklearn.metrics import accuracy_score
import langid

results = []
for item in dataset[0]:
  #print(langid.classify(item)[0])
  results.append(Lang(langid.classify(item)[0]).name)

print(f'The accuracy is: ', accuracy_score(results, dataset[1]))

The accuracy is:  0.9542293532112177
CPU times: user 1min 1s, sys: 1min 4s, total: 2min 6s
Wall time: 1min 4s


In [None]:
# langdetect performances
%%time
from iso639 import Lang
from sklearn.metrics import accuracy_score
from langdetect import detect

res = []
y_s = dataset[1].copy()

for i, sent in enumerate(dataset[0]):
  #print(i)
  try:
    #print(detect(sent))
    res.append(detect(sent))
    res = [Lang('zh').name if r.startswith('zh') else Lang(r).name for r in res]
  
  except:
    y_s.pop(i)


print(f'The accuracy is: ', accuracy_score(res, y_s))

In [None]:
from langdetect import detect

y_true = []
y_pred = []

start = time.time()
for index, row in tqdm(df_langid.iterrows()):
    sentence = row["Text"]
    real_lang_code = row["language"]
    y_true.append(real_lang_code)
    try:
        y_pred.append(detect(sentence))
    except Exception as e:
        y_pred.append("")
        print (e, "\n")
end = time.time()

print ("\nlangdetect avg ms per example:", (end-start)*1000/len(df_langid.index))
print ("\nlangdetect accuracy:", accuracy_score(y_true, y_pred))

# Exercise 2

For English-written text, apply word-level tokenization. What is the average number of words per sentence?

Implement word-tokenization using both [nltk](https://www.nltk.org/) and [spacy](https://spacy.io/). Report the results for both of them.

For spaCy use the `en_core_web_sm` model.

In [None]:
import nltk
nltk.download('punkt')
import spacy
nlp = spacy.load("en_core_web_sm")


sum0 = 0
sum1 = 0
j = 0

for i, s in enumerate(dataset[0]):
  if dataset[1][i]=='English':
    curr_tokens = nltk.word_tokenize(s)
    doc = nlp(s)
    curr_len0 = len(curr_tokens)
    curr_len1 = len(doc)
    sum0 += curr_len0
    sum1 += curr_len1
    j += 1
  
print(f'The average num of words per English sentence for NLTK is: ', sum0/j)
print(f'The average num of words per English sentence for Spacy is: ', sum1/j)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The average num of words per English sentence for NLTK is:  68.738
The average num of words per English sentence for Spacy is:  72.334


# Exercise 3

Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
# your code
import random
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

english_s = []
for i, s in enumerate(dataset[0]):
  if dataset[1][i]=="English":
    english_s.append(s)

j = random.randint(0, len(english_s)-1)
rand_doc = nlp(english_s[j])
print(english_s[j])
displacy.render(rand_doc, jupyter=True)
#spacy.displacy.serve(rand_doc, style='dep')

# Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [None]:
# Lemmatization
from spacy.tokens import Doc
sWords = nlp.Defaults.stop_words
lemmatized_doc = Doc(nlp.vocab, words=[t.lemma_ for t in rand_doc])

# Stopword Elimination
words = []
for t in rand_doc:
  if t.text not in sWords:
    words.append(t.text)

doc_without_sw = Doc(nlp.vocab, words=words)

#PoS tagging
for t in rand_doc:
  print(t.tag_)

# **Occurrence-based text representation - TF-IDF**

---
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

# Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(dataset[0])

# Exercise 6

Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, dataset[1], test_size=0.2)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB


clf = MultinomialNB().fit(x_train, y_train)

In [None]:
print(x_train.shape)
print(x_test.shape)

(17600, 277719)
(4401, 277719)


In [None]:
# predicting and computing accuracy
from sklearn.metrics import accuracy_score

preds = clf.predict(x_test)
print(f'The accuracy on the test set for the NB clf is: ', accuracy_score(preds, y_test))

The accuracy on the test set for the NB clf is:  0.9416041808679846


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
dt = DecisionTreeClassifier()
dt = dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
accuracy_score(y_test, y_pred)

0.8975232901613269

# **Topic Modelling**

Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)


# Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [None]:
pip install --upgrade gensim

In [None]:
# Let's load the CSV file with pandas
import pandas as pd

df = pd.read_csv('CovidFake_filtered.csv')

In [None]:
hl = df['headlines']
hl_list = hl.to_list()

In [None]:
len(hl_list)

9727

In [None]:
import gensim
from gensim import corpora

tokens_list = [[word for word in document.split()] for document in hl_list]

corp_dict = corpora.Dictionary(tokens_list)
bow_list = []
for words in tokens_list:
  bow_list.append(corp_dict.doc2bow(words))


In [None]:
# LSI
from gensim.models import LsiModel

lsi_model = LsiModel(corpus=bow_list, id2word=corp_dict)
vector = lsi_model[bow_list]

In [None]:
lsi_model.print_topics(5)

[(0,
  '0.636*"the" + 0.389*"of" + 0.314*"in" + 0.280*"a" + 0.254*"to" + 0.179*"and" + 0.155*"that" + 0.121*"is" + 0.107*"coronavirus" + 0.106*"A"'),
 (1,
  '-0.629*"the" + 0.598*"a" + 0.386*"in" + 0.105*"A" + 0.097*"and" + 0.078*"has" + 0.077*"COVID-19" + 0.071*"video" + 0.070*"on" + 0.064*"been"'),
 (2,
  '-0.709*"to" + 0.633*"of" + -0.137*"and" + -0.123*"is" + -0.071*"for" + -0.069*"be" + -0.054*"that" + -0.051*"are" + -0.049*"from" + -0.048*"due"'),
 (3,
  '-0.734*"in" + 0.428*"of" + 0.363*"to" + 0.283*"a" + -0.172*"the" + -0.097*"coronavirus" + 0.046*"from" + -0.040*"was" + 0.037*"COVID-19." + 0.034*"on"'),
 (4,
  '0.479*"to" + 0.425*"of" + -0.407*"a" + 0.391*"in" + -0.259*"that" + -0.240*"the" + -0.213*"is" + -0.187*"and" + -0.115*"for" + -0.075*"coronavirus"')]

In [None]:
for topic in vector[:5]:

      print(topic)


[(0, 3.000143571650947), (1, -0.7428179577313365), (2, -0.5942236443971363), (3, 0.36944065783577384), (4, -0.9548173676273563), (5, -1.3744075160646982), (6, -0.6018581743443496), (7, 2.112272839040409), (8, 0.1823932453450455), (9, -0.5647390361336598), (10, -0.7719159002325008), (11, -0.5238152628462154), (12, 0.0710648404156954), (13, -0.871695194213208), (14, -0.06821185825181168), (15, -0.01377603041095901), (16, -0.14078880179784586), (17, -0.3082791847237329), (18, 0.1304814959326445), (19, -0.1535003000181402), (20, 0.04329128510962064), (21, -0.36840928338208456), (22, -0.7734199482559441), (23, 0.1339234906056869), (24, 0.09504895559487918), (25, -0.13980571471849648), (26, -0.2927476981301089), (27, 0.256358697684132), (28, 0.3607001707683395), (29, -0.0643713086819144), (30, -0.18410887981372193), (31, 0.06404007352973098), (32, -0.0291783928576033), (33, -0.35453746941862285), (34, 0.011782945448840839), (35, -0.22245485245072524), (36, -0.37804422033504465), (37, 0.03987

# Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Repeat the same procedure of Ex. 7 by adding a preliminary preprocessing step to **remove stopwords**.

In [None]:
import gensim
from gensim import corpora
from gensim.parsing.preprocessing import remove_stopwords
import string


corpus = []
for document in hl_list:
  document = gensim.parsing.preprocessing.lower_to_unicode(document)
  document = remove_stopwords(document)
  corpus.append([w.translate(str.maketrans('', '', string.punctuation)) for w in document.split()])

print(corpus)
corp_dict = corpora.Dictionary(corpus)
bow_list = []
for words in corpus:
  bow_list.append(corp_dict.doc2bow(words))

from gensim.models import LsiModel

lsi_model = LsiModel(corpus=bow_list, id2word=corp_dict)
vector = lsi_model[bow_list]



In [None]:
for topic in vector[:5]:
      print(topic)

[(0, 1.3835951269710596), (1, 0.38328545888262366), (2, 0.5628545754203838), (3, 0.21122190375852928), (4, 0.1445633107461164), (5, -0.02600035414065352), (6, 0.32548035202792586), (7, -0.8308811499566274), (8, 0.11087430800392176), (9, -0.13263391221278611), (10, -0.7007041821798518), (11, 0.33187411261951194), (12, -0.09387040278481198), (13, 0.4786796701389231), (14, -0.09337932921181992), (15, 0.14485292256908275), (16, -0.17753111729780635), (17, 0.01838489672892537), (18, -0.21587918702916845), (19, 0.04915267355311528), (20, -0.09949407488599711), (21, -0.28918673716564713), (22, 0.015805131580617682), (23, 0.04638805732976419), (24, -0.15463908358116768), (25, -0.23432106719086454), (26, -0.24951833759730518), (27, 0.070281159864177), (28, 0.011815066701951446), (29, -0.12064151940161273), (30, -0.07438338497596514), (31, -0.05144816805673432), (32, -0.14540454853123877), (33, 0.08389632352870599), (34, 0.24098791997240307), (35, -0.08496628861755323), (36, 0.026361630305363048

In [None]:
lsi_model.print_topics(5)

[(0,
  '0.750*"coronavirus" + 0.402*"covid19" + 0.172*"video" + 0.139*"people" + 0.121*"shows" + 0.121*"facebook" + 0.110*"novel" + 0.107*"claim" + 0.099*"new" + 0.096*"shared"'),
 (1,
  '0.816*"covid19" + -0.536*"coronavirus" + 0.067*"video" + 0.051*"shows" + -0.044*"novel" + 0.040*"hospital" + -0.039*"new" + 0.038*"facebook" + 0.037*"claims" + 0.035*"lockdown"'),
 (2,
  '-0.358*"video" + -0.326*"facebook" + -0.298*"claim" + 0.295*"covid19" + -0.279*"shows" + -0.278*"posts" + -0.267*"times" + 0.251*"coronavirus" + -0.250*"shared" + -0.196*"multiple"'),
 (3,
  '-0.655*"video" + -0.302*"shows" + -0.250*"people" + 0.241*"facebook" + 0.238*"posts" + 0.215*"claim" + 0.203*"shared" + 0.175*"times" + 0.152*"multiple" + -0.141*"lockdown"'),
 (4,
  '-0.887*"people" + 0.299*"video" + 0.128*"shows" + 0.094*"coronavirus" + 0.085*"covid19" + -0.071*"lockdown" + -0.069*"government" + -0.069*"virus" + -0.065*"died" + 0.064*"patients"')]

# Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [None]:
from gensim.models import LdaModel

lda = LdaModel(bow_list, id2word=corp_dict, num_topics=3)
lda_vec = lda[bow_list]

lda.print_topics(3)

[(0,
  '0.023*"covid19" + 0.018*"coronavirus" + 0.013*"people" + 0.009*"hospital" + 0.007*"masks" + 0.006*"died" + 0.005*"shows" + 0.005*"patients" + 0.004*"video" + 0.004*"new"'),
 (1,
  '0.074*"coronavirus" + 0.022*"covid19" + 0.015*"china" + 0.010*"wuhan" + 0.009*"virus" + 0.009*"people" + 0.009*"new" + 0.009*"infected" + 0.008*"outbreak" + 0.008*"water"'),
 (2,
  '0.037*"coronavirus" + 0.025*"covid19" + 0.017*"video" + 0.013*"shows" + 0.009*"people" + 0.008*"facebook" + 0.007*"claim" + 0.007*"novel" + 0.007*"shared" + 0.007*"government"')]

# Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
pip install pyldavis

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()


lda_display = gensimvis.prepare(lda, bow_list, corp_dict, sort_topics=False)
pyLDAvis.display(lda_display)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
corpus[0]

['post',
 'claims',
 'compulsory',
 'vacination',
 'violates',
 'principles',
 'bioethics',
 'coronavirus',
 'doesnt',
 'exist',
 'pcr',
 'test',
 'returns',
 'false',
 'positives',
 'influenza',
 'vaccine',
 'related',
 'covid19']