# Section 2: Working with Word Vectors

Vectors are just a list of numbers. 

For every word count its appearances in a certain book. Then you have a vector of # Books integers. 

### Counting problems

- Useful in some cases, e.g. simple classifiers with Naive Bayes
- Undesirable properties
    - stopwords ('a', 'the', 'and', ...) can overshadow other important words
    - usually stopwords are removed
- TF-IDF (Term Rrequency - Inverse Document Frequency)
    - words that appear in many documents are probably less meaningful
    - \# times a word appears in a document / # documents the word appears in 
    - usually logarithm is applied to counts

### Shape of tables

All tables, no matter which model we are working with are: V x D

V - vocabulary size, total number of possible words
D - vector dimensionality (e.g. number of books a word appears in)

### Word embedding

It is just a feature vector that represents a word *embedded* into a vector space.

## Word analogies

King - Man ~= Queen - Woman

### Find analogies

Take three of the words, find the fourth: 

King - Man + Woman = ? (expect to find Queen)

Use vector distance to find the closest matching word

#### Distances

- Euclidean distance: $|a - b|^2$
- Cosine distance: $1 - \frac{a^T b}{|a||b|}$

They usually *emerge* from the training process of models like Word2Vec and GloVe, not in TF-IDF.

### Dimensionality

Matrices used in NLP have a lot of dimensions, which makes it difficult to be plotted in a 2D graph. 

The algorithm `t-SNE` provides a way to reduce dimensionality of vectors, preserving structure of data. 

In [1]:
import sys
import os
sys.path.append(os.path.abspath('../vendor/machine_learning_examples'))

In [2]:
import os
print(os.path.exists('../vendor/machine_learning_examples/nlp_class2/large_files'));

True


In [None]:
%run -i '../vendor/machine_learning_examples/nlp_class2/tfidf_tsne.py'

reading: enwiki-20181120-pages-articles1-30.txt
reading: enwiki-20181120-pages-articles1-13.txt
reading: enwiki-20181120-pages-articles1-31.txt
START inf
END inf
the 305593
of 150481
and 119801
in 110170
to 92204
a 85766
is 41697
as 38703
was 37531
for 32933
by 31518
that 31340
with 29561
on 27286
from 21104
it 18696
are 18522
his 16984
or 16662
an 15956
be 15641
at 15591
which 14977
he 14283
were 14257
this 13837
not 10823
also 10759
have 10130
had 10055
their 10051
but 9798
has 9562
its 9129
one 8399
first 8390
other 8046
they 7745
been 7356
such 7010
after 6930
who 6881
more 6608
new 6460
all 6271
most 6184
used 6162
some 6155
two 6117
into 6107
can 6075
when 5956
time 5843
would 5496
only 5358
there 5301
many 5175
than 5169
during 5151
these 5114
may 5110
between 4699
over 4423
while 4165
about 3944
world 3944
however 3853
states 3741
years 3702
known 3621
game 3614
no 3599
later 3567
people 3549
then 3547
both 3478
up 3474
system 3457
use 3427
war 3412
being 3354
where 3317
three 

organizations 353
sign 353
expected 353
authors 353
tour 353
launch 353
tardis 353
moon 352
dallas 352
professional 351
cast 350
movie 350
contemporary 350
room 350
applications 350
arts 350
friends 349
zone 349
universe 348
machines 348
occurs 347
freedom 347
electric 347
going 347
democratic 347
kings 347
deep 346
hard 346
instance 346
attention 346
unlike 346
officers 346
ruled 346
germanic 346
cut 345
arms 345
convention 345
gun 344
sequence 344
ranked 344
recently 344
iron 344
child 344
paris 344
efforts 344
told 343
require 343
spread 342
soul 342
historians 342
19 342
notes 341
kind 341
increasingly 341
1989 341
know 340
saying 340
prevent 340
d 340
daughter 340
poland 340
gdp 340
weeks 339
sir 339
note 339
equivalent 339
scene 338
southeast 338
safety 338
effective 338
week 338
paper 337
helped 337
executive 337
materials 337
smith 337
valley 336
leaders 335
volume 335
contact 334
secret 334
growing 334
digital 333
depending 333
map 333
hungary 333
techniques 332
historian 332


## Pretrained word vectors in Glove

Small vocabulary size of 400k words.

In [7]:
%run -i '../vendor/machine_learning_examples/nlp_class2/pretrained_glove.py'

Loading word vectors...
Found 400000 word vectors.
king - man = queen - woman
france - paris = britain - london
france - paris = italy - rome
paris - france = rome - italy
france - french = england - english
japan - japanese = china - chinese
japan - japanese = italy - italian
japan - japanese = australia - australian
december - november = july - june
miami - florida = houston - texas
einstein - scientist = matisse - painter
china - rice = chinese - bread
man - woman = he - she
man - woman = uncle - aunt
man - woman = brother - sister
man - woman = friend - wife
man - woman = actor - actress
man - woman = father - mother
heir - heiress = queen - princess
nephew - niece = uncle - aunt
france - paris = japan - tokyo
france - paris = china - beijing
february - january = october - november
france - paris = italy - rome
paris - france = rome - italy
neighbors of: king
	prince
	queen
	ii
	emperor
	son
neighbors of: france
	french
	belgium
	paris
	spain
	netherlands
neighbors of: japan
	japan

## Pretrained vectors word2vec

Excessive number of words: ~3M words
Some short phrases like: "New_York" (multiwords)

In [3]:
%run -i '../vendor/machine_learning_examples/nlp_class2/pretrained_w2v.py'

FileNotFoundError: [Errno 2] No such file or directory: '../large_files/GoogleNews-vectors-negative300.bin'

## Text classification with word vectors

- Bag of words not considering order in sentences, e.g. "dog toy" = "toy dog". TF-IDF is an example that uses bag of words. 
- Get the average for all the vectors in a sentence / document. This way a single vector per document is obtained.
- Use these vectors into any supervised classifier algorithm. 

- Evaluation of classification model can also be used as an evaluation of how good are the underneath word embeddings for the purpose of classifying the documents. 

In [4]:
%run -i '../vendor/machine_learning_examples/nlp_class2/bow_classifier.py'

Loading word vectors...


UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3134: ordinal not in range(128)