# Word2Vec and word meaning

What is the semantics, the meaning of a word?

A common sense hypothesis is to say that the meaning of a word is the real object that the word represents. In this framework, words that are not known to be mapped on real objects (e.g., new words for a reader or just random strings of characters like xvul) have no meaning at all.

However, this hypothesis is quite useless in a digital context, where we just have words, not real objects.

An alternative approach if the **Distributional Hypothesis** about language and word meaning that states that words that occur in the same contexts tend to have similar meanings. (Harris, 1954) In other words, you shall know a word by the company it keeps . (Firth, 1957)

> Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146-162.

> John R. Firth. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, Special volume of the Philological Society, pages 1–32. Firth, John Rupert, Haas William, Halliday, Michael A. K., Oxford, Blackwell Ed., 1957.

## Explore the Gensim implementation
> Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405.

In [1]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from gensim.models import Word2Vec, KeyedVectors
from gensim.test.utils import datapath

In [4]:
wv = KeyedVectors.load_word2vec_format(datapath("/Users/flint/Data/word2vec/GoogleNews-vectors-negative300.bin"), 
                                       binary=True)

## Similarity

In [6]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


In [7]:
for x, y in wv.most_similar('car'):
    print(x, y)

vehicle 0.7821096181869507
cars 0.7423831224441528
SUV 0.7160962224006653
minivan 0.6907036900520325
truck 0.6735789775848389
Car 0.6677608489990234
Ford_Focus 0.667320191860199
Honda_Civic 0.6626849174499512
Jeep 0.651133120059967
pickup_truck 0.6441438794136047


In [8]:
vectors = []
for word in ['car', 'minivan', 'bicycle', 'airplane']:
    vectors.append(wv.get_vector(word))
V = np.array(vectors)

In [13]:
v = V.mean(axis=0)
v = v - wv.get_vector('car')

In [14]:
wv.similar_by_vector(v)

[('LightHawk', 0.3567371666431427),
 ('Beaver_floatplane', 0.3410896956920624),
 ('Bluebills', 0.3352811932563782),
 ('airplane', 0.32490819692611694),
 ('Volk_Field', 0.307051420211792),
 ('Andersland', 0.30294421315193176),
 ('Expedia_Expedia.com', 0.30243098735809326),
 ('NASA_Weightless_Wonder', 0.30234652757644653),
 ('Cessna_###B', 0.2979174852371216),
 ('propeller_plane', 0.2970975339412689)]

## Analogy

FRANCE : PARIS = ITALY : ?

PARIS - FRANCE + ITALY

In [15]:
wv.most_similar(positive=['King', 'woman'], negative=['man'])

[('Queen', 0.5515626668930054),
 ('Oprah_BFF_Gayle', 0.47597548365592957),
 ('Geoffrey_Rush_Exit', 0.46460166573524475),
 ('Princess', 0.4533674716949463),
 ('Yvonne_Stickney', 0.4507041573524475),
 ('L._Bonauto', 0.4422135353088379),
 ('gal_pal_Gayle', 0.4408389925956726),
 ('Alveda_C.', 0.4402790665626526),
 ('Tupou_V.', 0.4373864233493805),
 ('K._Letourneau', 0.4351031482219696)]

## Not matching

In [16]:
wv.doesnt_match("school professor apple student".split())

'apple'

## Mean

In [None]:
vp = wv['school']
vr = wv['professor']
vx = wv['student']
m = (vp + vr + vx) / 3

In [None]:
wv.similar_by_vector(m)

In [None]:
pairs = [
    ('lecturer', 'school'),
    ('lecturer', 'professor'),
    ('lecturer', 'student'),
    ('lecturer', 'teacher'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

## Context

In [None]:
wv.most_similar('buy')

In [None]:
wv.similarity('buy', 'money')

## Train a custom model

In [5]:
import gensim.models

In [None]:
sentences = _ # assume there's one document per line, tokens separated by whitespace
model = gensim.models.Word2Vec(sentences=sentences)

## Update an existing model

In [6]:
import pymongo
import nltk
from string import punctuation
import copy

In [7]:
MO = gensim.models.Word2Vec.load('/Users/flint/Playground/MeaningSpread/w2v-global.model')

In [20]:
MO.wv.most_similar('pandemic')

[('influenza', 0.7046176195144653),
 ('h1n1', 0.6910238265991211),
 ('outbreak', 0.6785882711410522),
 ('avian', 0.6601735949516296),
 ('flu', 0.6578388214111328),
 ('outbreaks', 0.5860258936882019),
 ('swine', 0.5697133541107178),
 ('pandemics', 0.5689476728439331),
 ('epidemic', 0.552070677280426),
 ('cholera', 0.5491413474082947)]

In [11]:
db = pymongo.MongoClient()['twitter']['tweets']

In [12]:
tweets = list(db.find())

In [13]:
corpus = dict([(tweet['id'], tweet['text']) for tweet in tweets])

In [14]:
nltk_tokenize = lambda text: [x.lower() for x in nltk.word_tokenize(text) if x not in punctuation]

In [15]:
data = [nltk_tokenize(y) for x, y in corpus.items()]

In [16]:
M1 = copy.deepcopy(MO)

In [17]:
M1.train(data, total_examples=MO.corpus_count, epochs=MO.epochs)

(4694460, 6477630)

In [21]:
M1.wv.most_similar('pandemic')

[('h1n1', 0.6190395355224609),
 ('influenza', 0.5918642282485962),
 ('outbreak', 0.5802384614944458),
 ('avian', 0.5463380813598633),
 ('flu', 0.5393428206443787),
 ('swine', 0.4824046492576599),
 ('epidemic', 0.47371941804885864),
 ('outbreaks', 0.46408894658088684),
 ('pandemics', 0.46319717168807983),
 ('h5n1', 0.45786356925964355)]