# Hands on word embeddings

Pre-trained embeddings are available from many companies and organisations. You can adopt them, saving you some time and resources

Have a look at...
- [Gensim's documentation](https://radimrehurek.com/gensim/models/word2vec.html)
- [Google's word2vec project](https://code.google.com/archive/p/word2vec/)

There are may pre-trained word2vec models available. Consider...
- [Gensim's](https://github.com/RaRe-Technologies/gensim-data)
- [University of Oslo's](http://vectors.nlpl.eu/repository/)

The coming 2 cells represent alternaties to perform the same thing (almost)

1. Downloading the Google w2v embeddigns using the book library
2. Using w2v embeddings you have downloaded in advance

The first one takes long (use with caution). The second one does not 
take that long (because you are supposed to have downloaded the 
resource in advance)

In [1]:
# Downloading Gensim's word2vec pre-trained model (run it only once)
# Execute the next block only once (the model is big!)
# I tested this in early Novemmer 2022, but even the book says it could trigger errors.
from nlpia.data.loaders import get_data
word_vectors = get_data('word2vec')

2022-10-31 16:02:04.222576: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-31 16:02:04.658412: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-31 16:02:04.658436: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-31 16:02:04.703602: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-31 16:02:05.641525: W tensorflow/stream_executor/platform/de






In [None]:
# Alternative 2: Once the resouce has been downloaded, we have to import it

from gensim.models.keyedvectors import KeyedVectors
GOOGLE_VECTORS = "/some/reasonable/path/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format(GOOGLE_VECTORS,
     binary=True, limit=200000)

# 200000 limits the number of loaded vectors to 200k only 
# The aim is speeding up and saving some memory (just for the class)

**Back to the slides**

## Retrieving the most similar vectors

In [2]:
word_vectors.most_similar(positive=['cooking', 'potatoes'], topn=5)

[('cook', 0.6973530650138855),
 ('oven_roasting', 0.6754531264305115),
 ('Slow_cooker', 0.6742031574249268),
 ('sweet_potatoes', 0.6600280404090881),
 ('stir_fry_vegetables', 0.6548759341239929)]

In [4]:
word_vectors.most_similar(positive=['cooking'], topn=5)

[('cook', 0.7584654092788696),
 ('Cooking', 0.7552592158317566),
 ('baking', 0.6751805543899536),
 ('cookery', 0.6722506880760193),
 ('humongous_belly', 0.6695600748062134)]

In [8]:
word_vectors.most_similar(positive=['bush', 'clinton'], topn=1) # not there with 200k

[('reagan', 0.613964319229126)]

In [7]:
word_vectors.most_similar(positive=['Bush', 'Clinton'], topn=1) # not there with 200k

[('Obama', 0.821182131767273)]

In [6]:
word_vectors.most_similar(positive=['bush', 'president'], topn=1)

[('President', 0.5852983593940735)]

In [9]:
word_vectors.most_similar(positive=['Biden', 'president'], topn=1)

[('President', 0.6760360598564148)]

In [11]:
word_vectors.most_similar(positive=['bologna', 'pasta'], topn=3)

[('pastas', 0.6319469809532166),
 ('Palermo', 0.629331111907959),
 ('panettone', 0.6292574405670166)]

In [12]:
word_vectors.most_similar(positive=['bologna', 'pasta'], topn=3)

[('meatloaf', 0.6751123070716858),
 ('ziti', 0.6721709370613098),
 ('ravioli', 0.6670635342597961)]

In [13]:
word_vectors.most_similar(positive=['Kentucky', 'chicken'], topn=3)

[('Tennessee', 0.5758107900619507),
 ('grilled_herbed', 0.5744916796684265),
 ('burgoo', 0.5679264664649963)]

In [14]:
word_vectors.most_similar(positive=['atlanta', 'baseball'], topn=3)

[('red_sox', 0.6523391604423523),
 ('mlb', 0.6118411421775818),
 ('phillies', 0.6006778478622437)]

In [None]:
# Something else?
word_vectors.most_similar(positive=[None, None] , topn=3)

## Retrieving the most similar vectors, after subtraction

In [None]:
# Let us load a bigger model (if you went for the second alternative)
from gensim.models.keyedvectors import KeyedVectors
GOOGLE_VECTORS = "/some/reasonable/path/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format(GOOGLE_VECTORS,
    binary=True, limit=400000)

In [20]:
word_vectors.most_similar(positive=['Chicago', 'Italy'], negative=['America'], topn=3)

[('Milan', 0.5907516479492188),
 ('Bologna', 0.5323939919471741),
 ('Genoa', 0.48967602849006653)]

In [16]:
# Not Germany with 200k
word_vectors.most_similar(positive=['Germany', 'France'], negative=['Europe'], topn=3)

[('Belgium', 0.5782909989356995),
 ('Joel_Chenal', 0.5722925662994385),
 ('extradites_Noriega', 0.5573993921279907)]

In [24]:
word_vectors.most_similar(positive=['Spain', 'America'], negative=['Europe'], topn=3)

[('Spanish', 0.4955134391784668),
 ('Jose_Luis', 0.46971771121025085),
 ('El_Salvador', 0.467809796333313)]

In [25]:
word_vectors.most_similar(positive=['Spain', 'America'], negative=['Europe', 'language'], topn=3)

[('Argentina', 0.3731328845024109),
 ('Chile', 0.36462217569351196),
 ('Panamerican', 0.351054310798645)]

**Back the slides**

## Finding the "outlier" (or indeed the least similar word)

In [26]:
word_vectors.doesnt_match("potatoes milk cake computer".split())

'computer'

In [27]:
word_vectors.doesnt_match("spanish italian french".split())

'spanish'

In [28]:
word_vectors.doesnt_match("beer wine spritz water".split())

'spritz'

In [29]:
word_vectors.doesnt_match("linguistics semantics pragmatics speech".split())

'speech'

In [30]:
word_vectors.doesnt_match("dog cat snake fish".split())

'fish'

In [31]:
word_vectors.doesnt_match("cow sheep goat camel".split())

'camel'

In [32]:
word_vectors.doesnt_match("bears eagles giants braves".split())

'braves'

In [35]:
word_vectors.doesnt_match("fries pizza taco sushi".split())

'sushi'

**Back to the slides**

## Adding and subtracting

In [36]:
word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=2)

[('queen', 0.7118193507194519), ('monarch', 0.6189674139022827)]

In [37]:
word_vectors.most_similar(positive=['pizza', 'mozzarella'], negative=['pineapple'], topn=3)

[('pizzas', 0.624053418636322),
 ('Pizza', 0.5243308544158936),
 ('mozzarella_cheese', 0.5103603005409241)]

In [38]:
word_vectors.most_similar(positive=['italy', 'mafia'], negative=['york'], topn=3)

[('camorra', 0.5486218929290771),
 ('Ndrangheta_mafia', 0.5391587615013123),
 ('Calabrian_Mafia', 0.5161239504814148)]

In [39]:
word_vectors.most_similar(positive=['black'], topn=10)

[('white', 0.8092213869094849),
 ('Responded_Letterman_How', 0.6182776689529419),
 ('blacks', 0.589222252368927),
 ('crypt_inscribed', 0.5855618119239807),
 ('transporting_petrochemicals', 0.5834174752235413),
 ('brown', 0.5766680240631104),
 ('Shilah_Phillips', 0.5763780474662781),
 ('women_dating_interracially', 0.5670552253723145),
 ('wrote_Newitz', 0.5604413747787476),
 ('blue', 0.5492398142814636)]

In [None]:
# Some other interesting example?
word_vectors.most_similar(positive=None, negative=None, topn=2)

**back to the slides**

## Similarity between two words

In [40]:
word_vectors.similarity('princess', 'queen')

0.7070532

In [41]:
word_vectors.similarity('prince', 'frog')

0.31469014

In [42]:
word_vectors.similarity('god', 'monster')

0.4181738

In [43]:
word_vectors.similarity('gaze', 'watch')

0.2732389

In [44]:
word_vectors.similarity('frog', 'toad')

0.7049819

In [45]:
word_vectors.similarity('headache', 'flu')

0.21391372

In [46]:
word_vectors.similarity('Aztec', 'Mayan')

0.57236975

In [47]:
word_vectors.similarity('Rome', 'Athens')

0.55075914

In [48]:
word_vectors.similarity('automobile', 'car')

0.5838368

In [49]:
word_vectors.similarity('rail', 'train')

0.588517

In [50]:
word_vectors.similarity('ragu', 'pesto')

0.6560988

In [51]:
word_vectors.similarity('Toscana', 'Lombardia')

0.41493478

In [52]:
word_vectors.similarity('Toscana', 'Lazio')

0.30580756

In [53]:
word_vectors.similarity('pizza', 'taco')

0.5879696

In [54]:
word_vectors.similarity('piadina', 'taco')

0.3507547

In [None]:
# Some other interesting example?
word_vectors.similarity(None, None)

**back to the slides**

## Accessing the actual vectors

In [55]:
word_vectors['phone']

array([-0.01446533, -0.12792969, -0.11572266, -0.22167969, -0.07373047,
       -0.05981445, -0.10009766, -0.06884766,  0.14941406,  0.10107422,
       -0.03076172, -0.03271484, -0.03125   , -0.10791016,  0.12158203,
        0.16015625,  0.19335938,  0.0065918 , -0.15429688,  0.03710938,
        0.22753906,  0.1953125 ,  0.08300781,  0.03686523, -0.02148438,
        0.01483154, -0.21289062,  0.16015625,  0.29101562, -0.03149414,
       -0.05883789,  0.04418945, -0.11767578, -0.12597656,  0.08447266,
       -0.10791016, -0.11279297,  0.17871094,  0.04467773,  0.17675781,
       -0.17089844, -0.02160645, -0.00061417, -0.17480469, -0.04760742,
        0.06835938, -0.0546875 ,  0.04467773, -0.19628906, -0.18554688,
       -0.10839844, -0.06030273,  0.11474609,  0.08544922,  0.05859375,
        0.23925781, -0.07080078,  0.11816406, -0.11132812,  0.08300781,
       -0.04394531,  0.00970459, -0.1484375 ,  0.265625  , -0.13769531,
        0.23535156, -0.19824219,  0.31445312,  0.02734375,  0.16

**back to the slides**

# Training a word2vec model

In [64]:
# Setup 
from gensim.models.word2vec import Word2Vec
import nltk
from nltk.corpus import brown

num_features = 300   # The  cardinality of the embedding space
min_word_count = 3   # Words appearing less times will be discarded (depends on the size of the corpus)
num_workers = 2      # Number of CPU cores to be used (depends on hardware)
window_size = 6      # Size of the context
subsampling = 1e-3   # Threshold for configuring which higher-frequency words are randomly downsampled

In [67]:
# Loading some data
nltk.download('brown')
sentence_list = brown.sents()
len(token_list)

[nltk_data] Downloading package brown to /home/albarron/nltk_data...
[nltk_data]   Package brown is already up-to-date!


57340

In [68]:
sentence_list

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [72]:
# Model initialisation 
# I RAN THIS EARLIER. I wont do it now, as it takes some time 
model = Word2Vec(
    sentence_list,
    workers=num_workers,
    vector_size=num_features,   # Notice that this parameter used to be size
    #min_count=min_word_count,
    window=window_size,
    sample=subsampling)

In [71]:
# Discarding the unneeded output weights and freezing the rest
# This is not necessary since gensim 4: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
# model.init_sims(replace=True)



  model.init_sims(replace=True)


In [73]:
# Saving the model 
model_name = "my_domain_specific_word2vec_model"
model.save(model_name)

In [74]:
# Loading a model

model = Word2Vec.load(model_name)
model.wv.most_similar('brown')
# Notice that model.most_similar('brown') will be deprecated soon

[('jacket', 0.9722703099250793),
 ('thick', 0.9707545042037964),
 ('green', 0.9691036939620972),
 ('stretched', 0.9638636708259583),
 ('pale', 0.962711751461029),
 ('heavy', 0.9622294902801514),
 ('tall', 0.9598557949066162),
 ('yellow', 0.9594091773033142),
 ('beard', 0.957897424697876),
 ('flat', 0.9576342105865479)]

**back to the slides**

## fastText

In [None]:
import gensim.models.fasttext as fastext
MODEL_PATH = "~/corpora/embeddings/FastText/cc.it.300.bin.gz"
# ft_model = FastText.load_fasttext_format(model_file=MODEL_PATH)
ft_model = fastext.load_facebook_vectors(MODEL_PATH)


In [None]:
ft_model.most_similar('calcio')

In [None]:
ft_model.most_similar('football')

In [None]:
from gensim.models import fasttext
MODEL_PATH = "/Users/albarron/corpora/embeddings/FastText/it/cc.it.300.bin.gz"
# MODEL_PATH = "~/corpora/embeddings/FastText/cc.it.300.bin.gz"
ft_model = fasttext.load_facebook_vectors(MODEL_PATH)
# ft_model.most_similar('calcio')


In [None]:
ft_model.most_similar('calcio')