# MVD 3. cvičení

## 1. část - Stažení a načtení předtrénovaných GloVe slovních reprezentací


### Stažení dat

Pro toto cvičení můžete používat předtrénované Word2Vec nebo GloVe vektory. Následující text se vztahuje ke GloVe vektorům, které byly vybrány z důvodu menší velikosti. 

Základní verzi vektorů lze stáhnout [zde (GloVe link)](https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip).

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download)

Po rozbalení staženého archivu budete mít několik verzí o různé dimenzi vektorů - 50d, 100d, 200d, 300d. Je doporučeno začít pracovat s nejmenšími vektory a na větších spouštět až závěrečné řešení.

### Načtení dat

Data jsou uložena v textovém souboru, kde je na každém řádku slovo a jeho příslušný vektor.

Načtení je vhodné provést do dvou proměnných -> words, vectors. Words bude list o délce *n* a vectors bude matice o velikosti *(n, d)*. 

Zároveň vytvořte slovník word2idx, pomocí kterého lze získat index libovolného slova (pomocí *word2idx['queen']*).

In [2]:
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [3]:
DEBUG = True

SIZES = [50, 100, 200, 300]
DIRECTORY = "data"
FILE_NAME = "glove.6B."
path = DIRECTORY + "/" + FILE_NAME + str(SIZES[0]) + "d.txt"

In [4]:
def load_data(file_name):
    word2idx = {}
    words = []
    vectors = []
    with open(file_name, "r") as file:
        for i, line in enumerate(file.readlines()):
            key, *values = line.strip().split(" ")
            vector = np.array([float(number) for number in values])
            words.append(key)
            vectors.append(vector)
            word2idx[key] = i
            if 0 and DEBUG:
                print(key)
                print(vector)
        
    return np.array(words), np.array(vectors), word2idx


In [5]:
# Load data
words, vectors, word2idx = load_data(path)

In [6]:
word = "king"
print("word:", word)
print("vec:", vectors[word2idx[word]])

word: king
vec: [ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
  0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
 -0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
 -0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
  0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
 -1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
 -0.51042 ]


## 2. část - Kosinová podobnost

Vytvořte funkci cossim, která bude vracet kosinovu podobnost dvou vstupních vektorů.


$$ similarity(a,b) = cos(\theta) = \frac{a \cdot b}{\lVert a \lVert \lVert b \lVert} $$

In [7]:
def similarity(w1, w2):
    if len(w1.shape) < 2:
        w1 = w1.reshape(1, -1)
    return (np.dot(w1, w2)) / (np.linalg.norm(w1, axis=1) * np.linalg.norm(w2))

In [13]:
king = vectors[word2idx["king"]]
queen = vectors[word2idx["queen"]]
prince = vectors[word2idx["prince"]]
print(similarity(king, queen))
print(similarity(prince, queen))
print()
array = np.array([king, prince])
print(similarity(array, queen))


[0.7839043]
[0.7821861]

[0.7839043 0.7821861]


## 3. část - Slovní analogie

Nejznámější slovní analogií je z Word2Vec $f("king") - f("man") = f("queen") - f("woman")$

1. Vytvořte skript pro hledání analogií $f("king") - f("man") = f("??") - f("woman")$ a vyzkoušejte i nějaké další.
2. Vypište 5 nejpodobnějších slov

In [8]:
print(similarity(vectors[word2idx["king"]], vectors[word2idx["queen"]]))
print(similarity(vectors[word2idx["man"]], vectors[word2idx["woman"]]))
print()
print(similarity(vectors[word2idx["king"]], vectors[word2idx["man"]]))
print(similarity(vectors[word2idx["queen"]], vectors[word2idx["woman"]]))

[0.7839043]
[0.88603377]

[0.53093769]
[0.60031058]


In [8]:
def find_close_words(word, n=5):
    distances = similarity(vectors, word)
    indexes = np.argsort(distances)[::-1]
    return words[indexes[1:n+1]]

def closest_words(a, b, c):
    dist = vectors[word2idx[b]] - vectors[word2idx[a]]
    new = vectors[word2idx[c]] + dist
    return find_close_words(new)

In [9]:
print(closest_words("king","man","queen"))
print(closest_words("man","king","woman"))
print(closest_words("father","mother","grandpa"))
print(closest_words("car","engine","plane"))
print(closest_words("car","wheel","plane"))
print(closest_words("knight","blade","soldier"))
print(closest_words("czech","prague","england"))
print(closest_words("bow","arrow","gun"))
print(closest_words("soccer","ball","hockey"))
print(closest_words("ship","sailor","spacecraft"))
print(closest_words("ship","sailor","car"))
print(closest_words("ship","sailor","plane"))
print(closest_words("ship","sink","plane"))
print(closest_words("ship","sink","car"))
print(closest_words("ship","sink","man"))
print(closest_words("ship","sink","rocket"))
print(closest_words("ship","sink","building"))
print(closest_words("fast","faster","slow"))


['girl' 'man' 'her' 'boy' 'she']
['queen' 'daughter' 'prince' 'throne' 'princess']
['grandpa' 'mommy' 'mom' 'daddy' 'aunt']
['engine' 'plane' '747' 'jet' 'spacecraft']
['tail' 'rudder' 'takeoff' 'landing' 'orbit']
['soldier' 'blade' 'bullet' 'shoots' 'bulldozer']
['cardiff' 'edinburgh' 'nottingham' 'birmingham' 'leeds']
['rifle' 'weapon' 'handgun' 'caliber' 'guns']
['puck' 'throws' 'throw' 'pass' 'hook']
['cassini' 'astronaut' 'spacecraft' 'orbiter' 'gemini']
['racer' 'teen' 'motorcycle' 'car' 'rider']
['plane' 'rider' 'stewardess' 'savicevic' 'star']
['sideways' 'stuck' 'blown' 'plane' 'windshield']
['drives' 'wheels' 'windshield' 'fix' 'bump']
['hard' 'somebody' 'thing' "'m" 'looks']
['exploding' 'rocket' 'flare' 'blasting' 'projectiles']
['building' 'floor' 'brick' 'wall' 'crumbling']
['slower' 'slow' 'quicker' 'slowing' 'accelerated']


### Bonus - Vytvořte vizualizaci slovních analogií

Pro získání bonusového bodu je potřeba vytvořit vizualizaci slovních analogií (redukce dimenze + vizualizace).

In [32]:
a1 = np.array(["woman" for x in range(5)])
a2 = np.array(["plane" for x in range(5)])
a3 = np.array(["gun" for x in range(5)])
a4 = np.array(["spacecraft" for x in range(5)])
w5 = np.array(["england" for x in range(5)])
w6 = np.array(["slow" for x in range(5)])
y = np.concatenate((a1,a2,a3,a4,w5,w6), axis=0)
#print(y)

In [33]:
keys = [
    'girl', 'man', 'her', 'boy', 'she',
    'engine', 'plane', '747', 'jet', 'spacecraft', 
    'rifle', 'weapon', 'handgun', 'caliber', 'guns',
    'cassini', 'astronaut', 'spacecraft', 'orbiter', 'gemini',
    'cardiff', 'edinburgh', 'nottingham', 'birmingham', 'leeds',
    'slower', 'slow', 'quicker', 'slowing', 'accelerated'
]
vec = np.empty([len(y),vectors.shape[1]])
for i, key in enumerate(keys):
    vec[i] = vectors[word2idx[key]]

In [43]:
# pca, či tsne pro redukci dimenze
# Aplikace metody TSNE

#tsne = TSNE(n_components=2, perplexity=5, random_state=1).fit_transform(vec)
tsne = TSNE(n_components=2, perplexity=5).fit_transform(vec)

plot = px.scatter(tsne[:, 0], tsne[:, 1],
    text=keys, color=['word: ' + str(x) for x in y])
plot.update_coloraxes(showscale=False)
plot.layout.template = 'plotly'
plot


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



In [55]:
def draw_arrow(arr_start, arr_end):
    dx = arr_end[0] - arr_start[0]
    dy = arr_end[1] - arr_start[1]
    plt.arrow(arr_start[0], arr_start[1], dx, dy, 
        head_width=1, head_length=1, 
        length_includes_head=True, color='black')

In [None]:
keys = [["king","man"],["queen","woman"]]
vectors[word2idx[key]]

for key1, key2 in keys:
    drawArrow(vectors[word2idx[key1]], vectors[word2idx[key2]])
plt.ylabel('some numbers')
plt.show()