**SOFT DEADLINE:** `20.03.2022 23:59 msk` 

# [5 points] Part 1. Data cleaning

The task is to clear the text data of the crawled web-pages from different sites. 

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

1. Remove non-english words
1. Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
1. Apply lemmatization / stemming
1. Remove stop-words
1. Additional processing - At your own initiative, if this helps to obtain a better distribution

The choosen order:
1. Remove html-tags
2. Remove non-english words
3. Remove stop-words
4. Apply lemmatization / stemming
5. Additional processing

#### Hints

1. To do text processing you may use nltk and re libraries
1. and / or any other libraries on your choise

#### Data reading

The dataset for this part can be downloaded here: `https://drive.google.com/file/d/1wLwo83J-ikCCZY2RAoYx8NghaSaQ-lBA/view?usp=sharing`

In [62]:
import pandas as pd

# path = './storage/train.csv'
path = './storage/web_sites_data.csv'

In [74]:
data = pd.read_csv(path)

# for debug
data = data.sample(1000)

data

Unnamed: 0,text
49158,"<!DOCTYPE html>\n <html class=""desktop"">\n\n<..."
30189,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<!DOCTYPE ht...
30888,"<!-- START 'htmlHead' -->\n<link rel=""alternat..."
27006,<HTML><HEAD><TITLE>Quote for ED - FreeRealTime...
11917,<HTML>\n<HEAD>\n<TITLE>Lionel Messi wallpaper...
...,...
35985,"\n<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0..."
7724,"<!-- START 'htmlHead' -->\n<link rel=""alternat..."
57195,\n<html>\n<head>\n<title>FootyMania.com => Pla...
54469,\n<html>\n<head>\n<title>ESPNsoccernet.com Wor...


In [75]:
ldata = data.values.tolist()
ldata = [i[0] for i in ldata]
ldata[0][:200], ldata[10][:200]

('<!DOCTYPE html>\n  <html class="desktop">\n\n<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# good_reads: http://ogp.me/ns/fb/good_reads#">\n  <meta name="google-site-verification" content="Pf',
 '<!-- START \'htmlHead\' -->\n<link rel="alternate" type="application/rss+xml" title="SI - Top Stories [RSS]" href="http://rss.cnn.com/rss/si_topstories.rss"/>\n<link rel="alternate" type="application/rss+')

#### Data processing

In [76]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
import re
# nltk.download('words')
# nltk.download('stopwords')
# nltk.download("wordnet")
# nltk.download('omw-1.4')

In [77]:
# 1. Remove html-tags
html_reg = """<("[^"]*"|'[^']*'|[^'">])*>"""
other_reg = ""

tldata = []
for item in ldata:

    nitem = item
    nitem = re.sub(html_reg, "", nitem)
    nitem = re.sub("\t", "", nitem)
    nitem = re.sub("\r", "", nitem)
    nitem = re.sub("\n", "", nitem)
    nitem = re.sub("&nbsp;", "", nitem)

    tldata.append(nitem)
    
    continue

ldata = tldata
ldata[1][:300]


'   @import "http://uk.reuters.com/resources/css/rcom-main.css";@import "http://uk.reuters.com/resources/css/rcom-tertiary.css";@import "http://www.reuters.com/resources/css/rcom-homepage.css";@import "/includes/rcom-football.css";@import "/includes/cw-football.css";            ClÃ¡udio CaÃ§apa | New'

In [78]:
## intermediate processing

# split string to words
tldata = []
for item in ldata:
    tldata.append(item.split(' '))
    # [tldata.append(i) for i in nltk.word_tokenize(item)]
    continue

ldata = tldata

In [79]:
# 2. Remove non-english words
eng_words = set(nltk.corpus.words.words())

tldata = []
print(len(ldata[1]))
for words in ldata:
    # tldata.append(" ".join(w for w in nltk.wordpunct_tokenize(item) if w.lower() in eng_words or not w.isalpha()))
    tldata.append(w for w in words if w.lower() in eng_words or not w.isalpha())
    continue

tldata = ldata
len(ldata[1])

1575


1575

In [None]:
# 3. Remove stop-words

tldata = []
for words in ldata:
    tmp = [word for word in words if word not in stopwords.words()]
    tldata.append(tmp)
    continue

len(words)

In [81]:
# 4. Apply lemmatization / stemming
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

tldata = []
for words in ldata:
    tldata.append([lemmatizer.lemmatize(i) for i in words])
    # [stemmer.stem(i) for i in words]
    continue

ldata = tldata
ldata[1][100]

''

In [None]:
# 5. Additional processing

#### Vizualization

As a visualisation, it is necessary to construct a frequency distribution of words (the 100 most common words), sorted by frequency. 

For visualization purposes we advice you to use plotly, but you are free to choose other libraries

#### Provide examples of processed text (some parts)

Is everything all right with the result of cleaning these examples? What kind of information was lost?

# [10 points] Part 2. Duplicates detection. LSH

#### Libraries you can use

1. LSH - https://github.com/ekzhu/datasketch
1. LSH - https://github.com/mattilyra/LSH
1. Any other library on your choise

1. Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
1. Make a plot dependency of duplicates on shingle size (with fixed minhash length) 
1. Make a plot dependency of duplicates on minhash length (with fixed shingle size)

In [None]:
# 1. Detect duplicated text


# [Optional 10 points] Part 3. Topic model

In this part you will learn how to do topic modeling with common tools and assess the resulting quality of the models. 

The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: `https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing`

#### Preprocess dataset with the functions from the Part 1

#### Quality estimation

Implement the following three quality fuctions: `coherence` (or `tf-idf coherence`), `normalized PMI`, `based on the distributed word representation`(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.

### Topic modeling

Read and preprocess the dataset, divide it into train and test parts `sklearn.model_selection.train_test_split`. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.

Plot the histogram of resulting tokens counts in the processed datasets.

Plot the histogram of resulting tokens counts in the processed datasets.

#### NMF

Implement topic modeling with NMF (you can use `sklearn.decomposition.NMF`) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

#### LDA

Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

### Additive regularization of topic models 

Implement topic modeling with ARTM. You may use bigartm library (simple installation for linux: pip install bigartm) or TopicNet framework (`https://github.com/machine-intelligence-laboratory/TopicNet`)

Create artm topic model fit it to the data. Try to change hyperparameters (number of specific and background topics) to better fit the dataset. Play with smoothing and sparsing coefficients (use grid), try to add decorrelator. Print out resulting topics.

Write a function to convert new documents to topics probabilities vectors.

Calculate the quality scores for each model. Make a barplot to compare the quality.