## 1) Question for the videos?

## 2) Homework questions
- What is a word embedding?
- What are the desired properties of word embeddings?
- What are the potential uses of word embeddings?

## 3) Validation of Word Embeddings

In [1]:
import sys
import os
path = os.path.join("..",  # go back a folder
                   "class_06") # go into class 6
sys.path.append(path)  # add to path (so we can load the file)

In [4]:
# example of preprocessing
from text_preprocessor import Text

text_data = "This is a sample text. We will need to split it into sentences. Afterwards, it will be split into tokens"

text = Text(text_data)
text.tokenize(method="nltk", split_sent=True)
print(text.get_tokens())

sentences = text.get_tokens()

[['This', 'is', 'a', 'sample', 'text'], ['We', 'will', 'need', 'to', 'split', 'it', 'into', 'sentences'], ['Afterwards', ',', 'it', 'will', 'be', 'split', 'into', 'tokens']]


In [8]:
# train a word embedding 
from gensim.models import Word2Vec
model = Word2Vec(sentences, 
                 size=3, # size of the embedding layer (very low!)
                 window=5, # the size of the window (max distance between current and predicted word)
                 min_count=1, # ignore word with freq lower than this
                 sg = 1, # should it use skip-gram (alternative is CBOW)
                 workers=4) # number of cores to to use when training

print(model["sample"])

[-0.1649963   0.03560992  0.12827325]


The word vector we can also we illustrated using color. Something like this:

![](http://jalammar.github.io/images/word2vec/word2vec.png)


---
Well given the above example is just an example of how you would explain the differences. There is a lot of examples in English online so I will show you a Danish example here:

(these can be loaded from DaNLP [github site](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md))

In [11]:
# Download model

from danlp.models.embeddings  import load_wv_with_gensim

if "word2vec.model" not in os.listdir():
    word_embeddings = load_wv_with_gensim('conll17.da.wv')
    word_embeddings.save("word2vec.model")
word_embeddings = KeyedVectors.load("word2vec.model")

## Similar Words

In [45]:
# most similar word (synonym detection)

print("Aarhus = ")
print(word_embeddings.most_similar(positive=['aarhus'], topn=10))

print("\nKat = ")
print(word_embeddings.most_similar(positive=['kat'], topn=10))

print("\nis corona more associated with beer or a virus:")
print(word_embeddings.similarity('corona', 'øl'))
print(word_embeddings.similarity('corona', 'virus'))



Aarhus = 
[('aalborg', 0.8658724427223206), ('århus', 0.8577944040298462), ('københavns', 0.8440234661102295), ('odense', 0.8249165415763855), ('risskov', 0.80350661277771), ('kolding', 0.7970957159996033), ('roskilde', 0.793404221534729), ('herning', 0.7862382531166077), ('konferencer8000', 0.7729060649871826), ('amager', 0.7698875069618225)]

Kat = 
[('hund', 0.8387212753295898), ('hamster', 0.8017323613166809), ('kanin', 0.7978731989860535), ('racekat', 0.7938638925552368), ('hunhund', 0.7916580438613892), ('hund.', 0.78998863697052), ('undulat', 0.7851536273956299), ('hvalp', 0.7825443744659424), ('hunden', 0.7818223237991333), ('katten', 0.7807555794715881)]

is corona more associated with beer or a virus:
0.24333946
0.0184646


## Analogies
"Copenhagen is to Denmark what ___ is to London" 

or mathemathically: $CPH - DK + UK \approx $ ?

In [55]:
#capital
print(word_embeddings.most_similar(positive=['københavn', 'england'], negative=['danmark'], topn=5))
# maybe not?
print(word_embeddings.most_similar(positive=['københavn', 'tyrkiet'], negative=['danmark'], topn=5))
# conjugations
print(word_embeddings.most_similar(positive=['læge', 'manden'], negative=['mand'], topn=5))


[('london', 0.7156291604042053), ('edinburgh', 0.6790332794189453), ('woolwich', 0.6561343669891357), ('eh1', 0.6467728614807129), ('leeds', 0.6460204124450684)]
[('antalya', 0.678152322769165), ('istanbul', 0.6532646417617798), ('ankara', 0.6456939578056335), ('alanya', 0.636702299118042), ('izmir', 0.6343185901641846)]
[('lægen', 0.7816711664199829), ('speciallæge', 0.7328570485115051), ('radiodoktor', 0.7276281714439392), ('ventegodt', 0.7276049852371216), ('flytlies', 0.7271700501441956)]


## Odd-one-out Detection
This works by:
- take the mean of all the word-embeddings
- calculate the cosine-distance (similarity) from that center to each word
- return the most dissimilar word (i.e. the one with the highest cosine-distance from that mean vector)


In [64]:
word_embeddings.doesnt_match("elefant giraf ko ske".split())

'ske'

## Other ways word embeddings have been used

shift in word meaning over time
![](https://ruder.io/content/images/size/w2000/2017/10/semantic_change.png)

Cross lingual word embeddings:
![](https://s3.ap-south-1.amazonaws.com/techleerimages/771a4957-7fb8-4ddd-ba04-4fa73187e5f1.png)

# Exercises:
These exercises is made for the Danish word embedding but feel free to use the embeddings by google instead.

- What is to is to woman ("kvinde") what man ("mand") to doctor ("læge") is this problematic?
- Discuss how you would find pluralis of a Danish word
    - Find pluralis of 3 danish words using word embeddings
- Discuss how you would find the antonym of a Danish word
    - Find 3
- Examine 3 word with multiple meaning, how well does word embeddings hande these?
- Download this tagged [data](https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-da-32.txt). This is from a Danish sentiment lexicon from AFINN tagged by Finn A. Nielsen (A Danish NLP researcher). Discuss how you could expand this lexicon using word embeddings and test out if your assumptions are correct.
- Can you use odd one out detection on AFINN's sentiment lexicon to check for errors in the dictionary? I.e. words which are tagged as positive but are in fact not?
- Read the following exercise. Is there other entities which you could classify using word embeddings. Think first entities which we have talkedd about in the litterature and the think about more general concepts are word pocess (such as )

- train a model using word embeddings instead of to predict positive and negative words. I will suggest the following steps, but there is other ways of doing this feel free to diverge from these instructions:

    - 1) Extract positive, negative and neutral words from this (You will have to choose a reasonable cut-offs)

    - 2) Train a classifier using scikit-learn which takes a word embedding as input and a outputs whether a word its sentiment

    - 3) calculate the performance metrics on a hold out test set (how well does it perform on unseen words?)

[('containerrederi', 0.862022340297699), ('containerrederiet', 0.8570366501808167), ('tørlast', 0.8389478325843811), ('tankers', 0.8358646035194397), ('tankrederiet', 0.8347023725509644), ('containerreder', 0.8336674571037292), ('alphaliner', 0.8306968808174133), ('maersks', 0.8299944996833801), ('seago', 0.8279753923416138), ('drillings', 0.8269393444061279)]
