# MVD 5. cvičení

## 1. část - TF-IDF s word embeddingy

V minulém cvičení bylo za úkol implementovat TF-IDF algoritmus nad datasetem z Kagglu. Dnešní cvičení je rozšířením této úlohy s použitím word embeddingů. Lze použít předtrénované GloVe embeddingy ze 3. cvičení, nebo si v případě zájmu můžete vyzkoušet práci s Word2Vec od Googlu (najdete [zde](https://code.google.com/archive/p/word2vec/)).

Cvičení by mělo obsahovat následující části:
- Načtení článků a embeddingů
- Výpočet document vektorů pomocí TF-IDF a word embeddingů 
    - Pro výpočet TF-IDF využijte [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) z knihovny sklearn
    - Vážený průměr GloVe / Word2Vec vektorů

$$ doc\_vector = \frac{1}{|d|} \sum\limits_{w \in d} TF\_IDF(w) glove(w) $$

$w$ ... slovo<br>
$d$ ... dokument<br>

- Dotaz bude transformován stejně jako dokument

- Výpočet relevance pomocí kosinové podobnosti

$$ score(q,d) = cos\_sim(query\_vector, doc\_vector) $$

$q$ ... dotaz<br>
$d$ ... dokument<br>

### Načtení článků

In [1]:
import csv
import json
import spacy
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
file_path = "data/articles.csv"
lemmatizer = spacy.load('en_core_web_sm', disable=['parser', 'ner'])



In [3]:
with open(file_path, 'r') as read_obj:
    dict_reader = csv.DictReader(read_obj)
    data_raw = list(dict_reader)

In [4]:
def purge_chars(text, chars_to_remove):
    for c in list(chars_to_remove):
        text = text.replace(c, "")
    return text

def prep_text(text):
    # Převést všechen text na lower case
    text = text.lower()
    # Odstranění interpunkce a všech speciálních znaků (apostrof, ...)
    numbers = "1234567890"
    interpunction = ",.:;?!"
    chars = numbers + interpunction + '#^&@$€Łłþ→ø%+*/|\–—-\'’‘""”[]{}()'
    text = purge_chars(text, chars)
    # remove white spaces
    text = ' '.join(text.split())
    # Aplikace lemmatizátoru:
    return ' '.join([token.lemma_ for token in lemmatizer(text)])

def process_data(data):
    """
    data :
    [[Title, Text], ...] ->
    [[title, [word, ...]], ...]
    """
    ret = np.empty([len(data),2], dtype=object)
    for index, article in enumerate(data):
        ret[index, 0] = prep_text(article["title"])
        ret[index, 1] = prep_text(article["text"])
    return ret

def split_to_word_lists(texts):
    return [text.split(' ') for text in texts]

def make_reverse_indexing(texts):
    words_indexes = {}
    word_lists = split_to_word_lists(texts)
    for index, words_list in enumerate(word_lists):
        #print(words_list)
        for word in words_list:
            indexes = words_indexes.get(word, [])
            indexes.append(index)
            words_indexes[word] = indexes
            #print("words_indexes[", word, "] ->", indexes)
    return words_indexes

In [5]:
data = process_data(data_raw)

### Načtení embeddingů

In [6]:
DEBUG = True

SIZES = [50, 100, 200, 300]
DIRECTORY = "data"
FILE_NAME = "glove.6B."
path = DIRECTORY + "/" + FILE_NAME + str(SIZES[0]) + "d.txt"

In [7]:
def load_data(file_name):
    word2idx = {}
    words = []
    vectors = []
    with open(file_name, "r") as file:
        for i, line in enumerate(file.readlines()):
            key, *values = line.strip().split(" ")
            vector = np.array([float(number) for number in values])
            words.append(key)
            vectors.append(vector)
            word2idx[key] = i
            if 0 and DEBUG:
                print(key)
                print(vector)
        
    return np.array(words), np.array(vectors), word2idx

In [8]:
# Load data
words, vectors, word2idx = load_data(path)

### TF-IDF + Word2Vec a vytvoření doc vektorů

- Funguje lépe při menším množství dat
- Průměrujeme vážené Word2Vec vektory
- Váhou každého slova je získané TF-IDF

In [9]:
def sort_by_columm(array, columm=0, reverse_=False):
    return(sorted(array, key = lambda x: x[columm], reverse=reverse_)) 

def unique(array):
    return list(set(array))

def document_frequency(word, inverse_index):
    return len(unique(inverse_index.get(word, set())))

def c(word, inverse_index, document_index):
    return inverse_index.get(word, []).count(document_index)

def tf_idf(word, inverse_index, texts, document_index):
    M = len(texts)
    c2 = c(word, inverse_index, document_index)
    return c2 * np.log((M+1)/document_frequency(word, inverse_index))

def phrase_vector(words, vectors, word2idx, inverse_index, texts, document_index):
    vector = np.zeros(vectors[0].shape)
    for word in words:
        index = word2idx.get(word, None)
        if index is not None:
            vector += vectors[index] * tf_idf(word, inverse_index, texts, document_index)
    return vector/len(words)

def generate_phrase_vectors(vectors, word2idx, inverse_index, texts):
    ret = []
    for index, text in enumerate(texts):
        words = text.split(' ')
        vector = phrase_vector(words, vectors, word2idx, inverse_index, texts, index)
        ret.append(vector)
    return np.array(ret)


In [10]:
words = unique(flatten(split_to_word_lists(titles)))
for word in words:
    number = df(word, inverse_index)
    if number == 0:
        print(word)

NameError: name 'flatten' is not defined

In [11]:
documents = data
titles = documents[:,0]
texts = documents[:,1]
title_indexing = make_reverse_indexing(titles)
text_indexing = make_reverse_indexing(texts)

title_vector = generate_phrase_vectors(vectors, word2idx, title_indexing, titles)
doc_vector = generate_phrase_vectors(vectors, word2idx, text_indexing, texts)

### Transformace dotazu a výpočet relevance

In [12]:
def similarity(w1, w2):
    if len(w1.shape) < 2:
        w1 = w1.reshape(1, -1)
    return (np.dot(w1, w2)) / (np.linalg.norm(w1, axis=1) * np.linalg.norm(w2) + 0.0000001)

In [13]:
def tf_idf_query(word, words, inverse_index, texts):
    M = len(texts)
    c1 = words.count(word)
    return c1 * np.log((M+1)/document_frequency(word, inverse_index))

def phrase_vector_query(words, vectors, word2idx, inverse_index, texts):
    vector = np.zeros(vectors[0].shape)
    for word in words:
        index = word2idx.get(word, None)
        if index is not None:
            vector += vectors[index] * tf_idf_query(word, words, inverse_index, texts)
    return vector/len(words)

"""
def score(query, vectors, word2idx, title_vector, doc_vector, texts, alpha=0.7):
    words = prep_text(query).split(' ')
    phrase_vector_query(words, vectors, word2idx, inverse_index, texts)
    return alpha*tf_idf_title + (1-alpha)*tf_idf_text
"""

"\ndef score(query, vectors, word2idx, title_vector, doc_vector, texts, alpha=0.7):\n    words = prep_text(query).split(' ')\n    phrase_vector_query(words, vectors, word2idx, inverse_index, texts)\n    return alpha*tf_idf_title + (1-alpha)*tf_idf_text\n"

In [19]:
#query = "coursera vs udacity machine learning"
query = "defense skepticism deep learning"
alpha=0.7
#scores = score(query, title_vector, doc_vector, texts)
words = prep_text(query).split(' ')
query_title_vector = phrase_vector_query(words, vectors, word2idx, title_indexing, texts)
query_text_vector = phrase_vector_query(words, vectors, word2idx, text_indexing, titles)
print(title_vector.shape)
print(query_title_vector.shape)
scores = alpha*similarity(title_vector, query_title_vector) + (1-alpha)*similarity(doc_vector, query_text_vector)

(337, 50)
(50,)


In [None]:
print(scores.shape)
print(scores)
print(np.argsort(scores))

In [None]:
indexes = np.argsort(scores, )[::-1]
print(indexes)
scores_sorted = scores[indexes]
print(scores_sorted)

In [20]:
titles = np.array(data[:,0]).reshape(-1, 1)
texts = np.array(data[:,1]).reshape(-1, 1)
indexes = np.argsort(scores)[::-1]
scores_sorted = scores[indexes].reshape(-1, 1)
indexes_T = indexes.reshape(-1, 1)

sorted_data = np.concatenate((indexes_T, titles[indexes], texts[indexes], scores_sorted), axis=1)
print("Index: \t Title: \t Score:")
for article in sorted_data:
    print(article[0], "\t", article[1], "\t", article[3])

Index: 	 Title: 	 Score:
209 	 in defense of skepticism about deep learning gary marcus medium 	 0.791487775796834
327 	 in defense of skepticism about deep learning gary marcus medium 	 0.791487775796834
135 	 obamarnn machine generate political speech samim medium 	 0.7381373562012444
75 	 artificial intelligence the revolution have not happen yet 	 0.7336936082759244
100 	 artificial intelligence the revolution have not happen yet 	 0.7336936082759244
20 	 artificial intelligence the revolution have not happen yet 	 0.7336936082759244
37 	 the current state of machine intelligence shivon zilis medium 	 0.7333870109084172
117 	 the current state of machine intelligence shivon zilis medium 	 0.7333870109084172
210 	 how to learn deep learning in month towards data science 	 0.7280966545787406
15 	 machine learning be fun part modern face recognition with deep learning 	 0.7123480770679826
57 	 machine learning be fun part modern face recognition with deep learning 	 0.7123480770679826

## Bonus - Našeptávání

Bonusem dnešního cvičení je našeptávání pomocí rekurentních neuronových sítí. Úkolem je vytvořit jednoduchou rekurentní neuronovou síť, která bude generovat text (character-level přístup). 

Optimální je začít po dokončení cvičení k předmětu ANS, kde se tato úloha řeší. 

Dataset pro učení vaší neuronové sítě naleznete na stránkách [Yahoo research](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1), lze využít např. i větší [Kaggle dataset](https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data) nebo vyhledat další dataset na [Google DatasetSearch](https://datasetsearch.research.google.com/).

Vstupem bude rozepsaný dotaz a výstupem by měly být alespoň 3 dokončené dotazy.