**SOFT DEADLINE:** `20.03.2022 23:59 msk` 

# [5 points] Part 1. Data cleaning

The task is to clear the text data of the crawled web-pages from different sites. 

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

1. Remove non-english words
1. Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
1. Apply lemmatization / stemming
1. Remove stop-words
1. Additional processing - At your own initiative, if this helps to obtain a better distribution

#### Hints

1. To do text processing you may use nltk and re libraries
1. and / or any other libraries on your choise

#### Data reading

The dataset for this part can be downloaded here: `https://drive.google.com/file/d/1wLwo83J-ikCCZY2RAoYx8NghaSaQ-lBA/view?usp=sharing`

In [11]:
import pandas as pd
from bs4 import BeautifulSoup
import html
import re
import unicodedata
import spacy
from tqdm.notebook import tqdm
from collections import Counter
import nltk
from nltk.corpus import words
from datasketch import MinHash, MinHashLSH
import plotly.graph_objs as go
from plotly.offline import iplot
from sklearn.model_selection import train_test_split
import gensim
from gensim.models import TfidfModel, CoherenceModel
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from gensim.models import Nmf

In [3]:
nltk.download('words')
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fahrizain\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [61]:
web_df = pd.read_csv('web_sites_data.csv')
web_df.head()

Unnamed: 0,text
0,"<html>\n<head profile=""http://www.w3.org/2005/..."
1,"<html>\n<head profile=""http://www.w3.org/2005/..."
2,"<html>\n<head profile=""http://www.w3.org/2005/..."
3,"<html>\n<head profile=""http://www.w3.org/2005/..."
4,"<html>\n<head profile=""http://www.w3.org/2005/..."


#### Data processing

In [62]:
contents = web_df['text'].to_list()
len(contents)

71699

In [4]:
def clean(text):
    doc = BeautifulSoup(html.unescape(text), 'html.parser').text
    no_extra_enter = re.sub(r'\n+', '\n', doc)
    # fix_unicode = unicodedata.normalize('NFKD', no_extra_enter)    
    no_nonascii = re.sub(r'[^\x00-\x7f]', r'', no_extra_enter)
    pattern1 = r'(?<!\s|\-)([A-Z][a-z]+)'
    pattern2 = r'((\d+\-\d+)([A-z]+))'
    fix_camel_case = re.sub(pattern2, r'\2 \3',\
        re.sub(pattern1, r' \1', no_nonascii))
    no_link = re.sub(r'https?://\S+|www\.\S+', '', fix_camel_case)
    # remove punctuation
    punct = r'[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]'
    no_punct = re.sub(punct, '', no_link)
    # remove numbers
    no_number = re.sub(r'\d+', '', no_punct)
    # remove spaces
    no_spaces = re.sub(r' +', ' ', no_number)

    return no_spaces.strip()


def preprocess(text):
    doc = nlp(text)
    tokens = []

    for token in doc:
        if token.lang_ == 'en'\
        and token.pos_ != 'SPACE'\
        and not token.is_stop: 
            tokens.append(token.lemma_)

    return ' '.join(tokens)


def clean_pipeline(text):
    return preprocess(clean(text))

In [24]:
for c in tqdm(contents[24709:]):
    processed = clean_pipeline(c)
    with open('processed.txt', 'a') as w:
        w.write(processed + '\n')

100%|██████████| 46990/46990 [4:25:51<00:00,  2.95it/s]   


#### Vizualization

As a visualisation, it is necessary to construct a frequency distribution of words (the 100 most common words), sorted by frequency. 

For visualization purposes we advice you to use plotly, but you are free to choose other libraries

In [2]:
with open('./processed.txt', 'r', encoding='utf-8') as r:
    data = r.readlines()

len(data)

71699

In [82]:
counter = Counter()

for text in data:
    counter.update(Counter(text.split()))

top100 = counter.most_common(100)
top100[:5]

[('Xbox', 380000),
 ('Game', 365006),
 ('game', 210174),
 ('Games', 190669),
 ('CNET', 123196)]

In [83]:
df_top100 = pd.DataFrame(top100, columns=['word', 'freq'])
df_top100.head()

Unnamed: 0,word,freq
0,Xbox,380000
1,Game,365006
2,game,210174
3,Games,190669
4,CNET,123196


In [84]:
trace = go.Bar(
    x = df_top100.word,
    y = df_top100.freq,
    opacity=.75,
    
)
layout = go.Layout()
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

#### Provide examples of processed text (some parts)

Is everything all right with the result of cleaning these examples? What kind of information was lost?

In [73]:
# original scraped data
contents[1024][:1000]

'<!DOCTYPE html>\n\n\n<html onclick="window.numPreClicks=(window.numPreClicks+1)||1;window.lastPreClick=event.target">\n    <head>\n<!-- // --><script language=\'javascript\' type=\'text/javascript\'>\n<!--\n\treq_1_1351067931=new Image();\nreq_1_1351067931.src=\'/__ssobj/ard.png?5802792578353125008_1-1-\'+(664*2034740+571);\n//-->\n<!-- // --></script>\n\n        <!-- Flite - 1.37.15 - pwbflite05 -  - Wed Oct 24 04:38:51 EDT 2012 - command=product.view.book - test_cell= -->\n        <meta charset="utf-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n        <title>BARNES & NOBLE | Gardening for Birds: How to Help Birds Make the Most of Your Garden by Stephen Moss, HarperCollins UK | Hardcover</title>\n        <meta name="keywords" content=""/>\n    \n        <meta name="description" content="Available in: Hardcover. Featuring seven garden plan illustrations, this guide covers different types of garden - window ledge, balcony, terrace or roof terrace, sm

In [75]:
# preprocessed data
data[1024][:1000]

'barnes noble gardening bird help bird Garden Stephen Moss Harper Collins UK Hardcover Skip Main Content Sign account Account Settings Wish list Order Status NOOK store Events help Election Read decide Holiday Preview Seasons Best Books save Tony Bennetts New Viva Duets Certified PreOwned nookâ ® Devices start Search product book NOOK Store NOOK Books textbook Movies tv music Kids Books Marketplace Rare Books Newsstand Calendars Home Gifts Toys Games Search million Products shopping Bag item spend free SHIPPING book NOOK Books NOOK Textbooks newsstand teen kid Toys Games Home Gifts DV Ds Music Gift Cards close gardening bird help bird Garden Stephen Moss Gill Tomblin Illustrator add List add list bn Library Favorites Wish list read New Essential list create new Essential List enter list enter invalid character enter valid alpha numeric character Essential list New Essential list add description list Submit Cancel New Wish list create new Wish list enter list enter invalid character en

# [10 points] Part 2. Duplicates detection. LSH

#### Libraries you can use

1. LSH - https://github.com/ekzhu/datasketch
1. LSH - https://github.com/mattilyra/LSH
1. Any other library on your choise

1. Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
1. Make a plot dependency of duplicates on shingle size (with fixed minhash length) `define shingle function`
1. Make a plot dependency of duplicates on minhash length (with fixed shingle size) `num_perm`

In [3]:
NGRAM = 5
NUM_PERM = 32
THRESHOLD = 0.5
DATA_SIZE = 1500

In [4]:
def shingle(text, ngram=5):
    return set(text[head:head + ngram] for head in range(0, len(text)))

def LSH(data, ngrams, num_perms, threshold):
    
    # transform text into shingles
    # tqdm.write('Transform data...')
    transformed = [shingle(d, ngrams) for d in tqdm(data, desc='Transform data')]
    
    # create minhash
    # tqdm.write('Creating MinHash...')
    min_hashes = []
    for item in tqdm(transformed, desc='Create MinHash'):
        m = MinHash(num_perms)
        # update minhash values
        for d in item:
            m.update(d.encode('utf-8'))                
        # collect minhash
        min_hashes.append(m)

    # create LSH
    # tqdm.write('Create LSH...')
    lsh = MinHashLSH(threshold, num_perms)
    # insert minhash LSH element
    # tqdm.write('Inserting MinHash to LSH...')
    for i, m in enumerate(tqdm(min_hashes, desc='Insert MinHash')):
        # i -> document id
        # m -> minhash
        lsh.insert(i, m)

    # finding similarity
    # tqdm.write('Find Similarity...')
    doc_sim = []
    for i, hash in enumerate(tqdm(min_hashes, desc='Find Similarity')):
        similar = lsh.query(hash)
        # exclude queried document itself
        if i in similar:
            similar.remove(i)

        # collect similar document id
        doc_sim.extend(similar)
    
    return set(doc_sim)

def plot_dependency(x, y, xlabel, ylabel, title, opacity=.75):
    trace = go.Bar(
        x = x,
        y = y,
        opacity=opacity
    )

    layout = go.Layout()
    fig = go.Figure(data=[trace], layout=layout)
    fig.update_xaxes(title_text=xlabel)
    fig.update_yaxes(title_text=ylabel)
    fig.update_layout(title=title)
    iplot(fig)

In [5]:
ngrams = [25, 50, 75, 100, 125]

# fixed minhash length
fminhash_result = []
for gram in ngrams:
    result = LSH(data[:DATA_SIZE], gram, 128, THRESHOLD)
    fminhash_result.append(len(result))

plot_dependency(ngrams, fminhash_result, 'Shingle Size', 'Duplicates', 'Dependency of Duplicates on Shingle Size (with fixed minhash length)')

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

In [5]:
num_perms = [64, 128, 256, 512, 1024]

# fixed shingles length
fshingle_result = []
for perms in num_perms:
    result = LSH(data[:DATA_SIZE], 50, perms, THRESHOLD)
    fshingle_result.append(len(result))

plot_dependency(num_perms, fshingle_result, 'MinHash Size', 'Duplicates', 'Dependency of Duplicates on Minhash Length (with fixed shingle size)')

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

Transform data:   0%|          | 0/1500 [00:00<?, ?it/s]

Create MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Insert MinHash:   0%|          | 0/1500 [00:00<?, ?it/s]

Find Similarity:   0%|          | 0/1500 [00:00<?, ?it/s]

# [Optional 10 points] Part 3. Topic model

In this part you will learn how to do topic modeling with common tools and assess the resulting quality of the models. 

In [5]:
topics = pd.read_csv('data.csv')
topics.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: `https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing`

#### Preprocess dataset with the functions from the Part 1

In [6]:
tqdm.pandas()
topics['clean'] = topics.text.progress_apply(clean_pipeline)
topics.clean.head()

  0%|          | 0/19579 [00:00<?, ?it/s]

0    process afford mean ascertain dimension dungeo...
1                          occur fumbling mere mistake
2    left hand gold snuff box caper hill cut manner...
3    lovely spring look Windsor Terrace sixteen fer...
4    find gold Superintendent abandon attempt perpl...
Name: clean, dtype: object

#### Quality estimation

Implement the following three quality fuctions: `coherence` (or `tf-idf coherence`), `normalized PMI`, `based on the distributed word representation`(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.

### Topic modeling

Read and preprocess the dataset, divide it into train and test parts `sklearn.model_selection.train_test_split`. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.

In [7]:
X = topics.clean
y = topics.author

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((15663,), (3916,), (15663,), (3916,))

In [8]:
topic_tokens = [text.split() for text in X_train.to_list()][:5]
topic_tokens[:1]

[['speak', 'peculiarity', 'hair']]

In [9]:
id2word = Dictionary(topic_tokens)
corpus = [id2word.doc2bow(line) for line in topic_tokens]
tfidf = TfidfModel(corpus)
tfidf[corpus[0]]

[(0, 0.5773502691896258), (1, 0.5773502691896258), (2, 0.5773502691896258)]

Plot the histogram of resulting tokens counts in the processed datasets.

Plot the histogram of resulting tokens counts in the processed datasets.

#### NMF

Implement topic modeling with NMF (you can use `sklearn.decomposition.NMF`) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

In [12]:
nmf = Nmf(corpus=corpus, \
          id2word=id2word,
          num_topics=3,
          random_state=42,
          chunksize=100)

doc_nmf = nmf[corpus]

In [13]:
coherence_nmf = CoherenceModel(model=nmf, \
    texts=topic_tokens, 
    dictionary=id2word,
    coherence='c_v')

npmi_nmf = CoherenceModel(model=nmf, \
    texts=topic_tokens, 
    dictionary=id2word,
    coherence='c_npmi')

print(f'Coherence score: {coherence_nmf.get_coherence()}')
print(f'NPMI score: {npmi_nmf.get_coherence()}')

Coherence score: 0.8155339729623563
NPMI score: -0.14625867874115356


#### LDA

Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

In [14]:
lda = LdaModel(corpus=corpus, \
    id2word=id2word, 
    num_topics=3,
    random_state=42,
    update_every=1,
    chunksize=100,
    passes=10,
    alpha='auto',
    per_word_topics=True)

doc_lda = lda[corpus]

In [15]:
coherence_lda = CoherenceModel(model=lda, \
    texts=topic_tokens, 
    dictionary=id2word,
    coherence='c_v')

npmi_lda = CoherenceModel(model=lda, \
    texts=topic_tokens, 
    dictionary=id2word,
    coherence='c_npmi')

print(f'Coherence score: {coherence_lda.get_coherence()}')
print(f'NPMI score: {npmi_lda.get_coherence()}')

Coherence score: 0.7316938481186951
NPMI score: -0.03920549112979189


### Additive regularization of topic models 

Implement topic modeling with ARTM. You may use bigartm library (simple installation for linux: pip install bigartm) or TopicNet framework (`https://github.com/machine-intelligence-laboratory/TopicNet`)

Create artm topic model fit it to the data. Try to change hyperparameters (number of specific and background topics) to better fit the dataset. Play with smoothing and sparsing coefficients (use grid), try to add decorrelator. Print out resulting topics.

Write a function to convert new documents to topics probabilities vectors.

Calculate the quality scores for each model. Make a barplot to compare the quality.