# Section 2: Working with Word Vectors

Vectors are just a list of numbers. 

For every word count its appearances in a certain book. Then you have a vector of # Books integers. 

### Counting problems

- Useful in some cases, e.g. simple classifiers with Naive Bayes
- Undesirable properties
    - stopwords ('a', 'the', 'and', ...) can overshadow other important words
    - usually stopwords are removed
- TF-IDF (Term Rrequency - Inverse Document Frequency)
    - words that appear in many documents are probably less meaningful
    - \# times a word appears in a document / # documents the word appears in 
    - usually logarithm is applied to counts

### Shape of tables

All tables, no matter which model we are working with are: V x D

V - vocabulary size, total number of possible words
D - vector dimensionality (e.g. number of books a word appears in)

### Word embedding

It is just a feature vector that represents a word *embedded* into a vector space.

## Word analogies

King - Man ~= Queen - Woman

### Find analogies

Take three of the words, find the fourth: 

King - Man + Woman = ? (expect to find Queen)

Use vector distance to find the closest matching word

#### Distances

- Euclidean distance: $|a - b|^2$
- Cosine distance: $1 - \frac{a^T b}{|a||b|}$

They usually *emerge* from the training process of models like Word2Vec and GloVe, not in TF-IDF.

### Dimensionality

Matrices used in NLP have a lot of dimensions, which makes it difficult to be plotted in a 2D graph. 

The algorithm `t-SNE` provides a way to reduce dimensionality of vectors, preserving structure of data. 

In [6]:
import sys
import os
sys.path.append(os.path.abspath('../vendor/machine_learning_examples'))

In [None]:
%run -i '../vendor/machine_learning_examples/nlp_class2/tfidf_tsne.py'

reading: enwiki-20181120-pages-articles1-15.txt
reading: enwiki-20181120-pages-articles1-07.txt
reading: enwiki-20181120-pages-articles1-20.txt
START inf
END inf
the 289137
of 148617
and 116747
in 105212
to 84993
a 78272
is 42456
as 37557
was 32224
for 30722
by 29774
that 28940
with 27633
on 23601
are 20256
from 19942
it 17543
or 16603
an 15497
which 14990
be 14668
at 14364
were 13018
this 12668
his 12227
has 10347
he 10154
have 10133
also 10109
not 10105
its 9358
their 8793
but 8182
had 8166
other 8089
one 7972
first 7452
they 7415
such 7297
been 6905
most 6535
more 6457
can 6293
some 6210
used 6185
all 5811
new 5791
after 5779
two 5726
into 5594
who 5552
these 5414
there 5377
than 5282
may 5252
many 5207
when 5206
between 5050
during 4882
only 4869
time 4429
over 4245
would 4228
states 4104
while 3928
world 3898
however 3846
years 3801
about 3711
use 3694
known 3567
state 3536
both 3440
system 3421
united 3376
people 3235
including 3215
government 3204
no 3179
under 3172
century 3118

pressure 435
allows 435
declared 435
text 434
ended 434
gdp 434
portuguese 433
reason 433
origin 432
t 432
discovered 432
creation 431
ocean 431
plant 430
account 429
humans 429
mother 429
event 429
road 429
degree 428
1997 428
approach 428
hours 428
claimed 427
disease 427
ad 427
opened 427
9 426
stage 426
paul 426
song 426
blood 425
russia 425
protection 424
income 423
professional 423
1970s 422
starting 422
g 422
stone 422
digital 422
sense 421
growing 421
border 421
indigenous 421
sports 421
wife 420
operation 420
fire 420
earliest 419
cavalry 419
sound 419
nation 419
contain 418
argued 418
records 417
1990s 417
workers 417
models 417
engine 417
married 416
letter 416
birth 415
1980s 415
hall 415
machine 415
go 414
simply 414
collection 413
remain 412
resources 412
whole 412
experience 412
structures 412
represented 411
emperor 411
definition 411
previously 410
regular 410
weapons 410
louis 410
taking 410
spain 410
attack 409
summer 409
plan 409
front 409
completed 409
campaign 408

aspects 264
week 264
vary 264
29 264
authors 264
regarding 264
am 264
command 264
constructed 264
programming 264
iceland 264
jones 264
fuel 264
immune 263
transfer 263
stable 263
1950s 263
accounts 263
wanted 263
movements 263
runs 263
cubs 263
failure 262
rapidly 262
weeks 262
describes 262
similarly 262
format 262
caffeine 262
hands 261
plays 261
songs 261
australian 261
presidential 261
refused 260
portion 260
dates 260
vehicle 260
universities 260
inspired 260
los 260
facilities 259
greatly 259
attempted 259
bring 259
intellectual 259
1979 259
vision 259
professor 259
bowl 259
banks 259
tv 259
contained 258
condition 258
christians 258
kept 258
articles 258
dam 258
1945 258
k 258
traditions 257
solution 257
build 257
forward 257
split 257
me 257
prince 257
brain 257
visible 257
essential 256
entirely 256
extent 256
medium 256
writer 256
owned 256
labor 256
connecticut 256
decline 255
oxygen 255
episode 255
crown 255
improve 255
flight 255
mali 255
usage 254
causing 254
persons 254

## Pretrained word vectors in Glove

Small vocabulary size of 400k words.

In [3]:
%run -i '../vendor/machine_learning_examples/nlp_class2/pretrained_glove.py'

Loading word vectors...
Found 400000 word vectors.
king - man = queen - woman
france - paris = britain - london
france - paris = italy - rome
paris - france = rome - italy
france - french = england - english
japan - japanese = china - chinese
japan - japanese = italy - italian
japan - japanese = australia - australian
december - november = july - june
miami - florida = houston - texas
einstein - scientist = matisse - painter
china - rice = chinese - bread
man - woman = he - she
man - woman = uncle - aunt
man - woman = brother - sister
man - woman = friend - wife
man - woman = actor - actress
man - woman = father - mother
heir - heiress = queen - princess
nephew - niece = uncle - aunt
france - paris = japan - tokyo
france - paris = china - beijing
february - january = october - november
france - paris = italy - rome
paris - france = rome - italy
neighbors of: king
	prince
	queen
	ii
	emperor
	son
neighbors of: france
	french
	belgium
	paris
	spain
	netherlands
neighbors of: japan
	japan

## Pretrained vectors word2vec

Excessive number of words: ~3M words
Some short phrases like: "New_York" (multiwords)

In [2]:
%run -i '../vendor/machine_learning_examples/nlp_class2/pretrained_w2v.py'

  if np.issubdtype(vec.dtype, np.int):


king - man = queen - woman
france - paris = england - london
france - paris = italy - rome
paris - france = lohan - italy
france - french = england - english
japan - japanese = tibet - chinese
japan - japanese = italy - italian
japan - japanese = queensland - australian
december - november = september - june
miami - florida = dallas - texas
einstein - scientist = jude - painter
china - rice = dinnerware - bread
man - woman = he - she
man - woman = uncle - aunt
man - woman = brother - sister
man - woman = son - wife
man - woman = actor - actress
man - woman = father - mother
heir - heiress = prince - princess
nephew - niece = uncle - aunt
france - paris = japan - tokyo
france - paris = chinese - beijing
february - january = april - november
france - paris = italy - rome
paris - france = lohan - italy
neighbors of: king
	kings
	queen
	monarch
	crown_prince
	prince
	sultan
	ruler
	princes
	Prince_Paras
	throne
neighbors of: france
	spain
	french
	germany
	europe
	italy
	england
	european


## Text classification with word vectors

- Bag of words not considering order in sentences, e.g. "dog toy" = "toy dog". TF-IDF is an example that uses bag of words. 
- Get the average for all the vectors in a sentence / document. This way a single vector per document is obtained.
- Use these vectors into any supervised classifier algorithm. 

- Evaluation of classification model can also be used as an evaluation of how good are the underneath word embeddings for the purpose of classifying the documents. 

In [1]:
%run -i '../vendor/machine_learning_examples/nlp_class2/bow_classifier.py'

Loading word vectors...
Found 400000 word vectors.
Numer of samples with no words found: 0 / 5485
Numer of samples with no words found: 0 / 2189
train score: 0.9992707383773929
test score: 0.9314755596162632


## Using pretrained vectors later

- Recurrent Neural Networks (RNN) three layer network: 
  - Embedding layer: contains the word embedding matrix. Input word index, Output word vector
  - Recurrent Unit: Simple unit, GRU, LSTM
  - Dense layer: Maps the recurrent units output to one of the output classes
  
- **Recursive** Neural Networks: 
  - Neural network structured as a tree, where each leaf node in the tree represents a word
  - Each word is represented as a vector
  - Indexing word embedding matrix
  
All neural networks that receive words, the first thing they do is use a first layer to convert the word to its embedding. 
- Instead of initializing the matrix randomly, pretrained word vectors should be used.
- Do not update this layer during the training step.
    `tf.Variable(pretrained_values, trainable=False)`