# Semantics and Word Vectors
Sometimes called "opinion mining", [Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis) defines ***sentiment analysis*** as
<div class="alert alert-success" style="margin: 20px">"the use of natural language processing ... to systematically identify, extract, quantify, and study affective states and subjective information.<br>
Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event."</div>

Up to now we've used the occurrence of specific words and word patterns to perform test classifications. In this section we'll take machine learning even further, and try to extract intended meanings from complex phrases. Some simple examples include:
* Python is relatively easy to learn.
* That was the worst movie I've ever seen.

However, things get harder with phrases like:
* I do not dislike green eggs and ham. (requires negation handling)

The way this is done is through complex machine learning algorithms like [word2vec](https://en.wikipedia.org/wiki/Word2vec). The idea is to create numerical arrays, or *word embeddings* for every word in a large corpus. Each word is assigned its own vector in such a way that words that frequently appear together in the same context are given vectors that are close together. The result is a model that may not know that a "lion" is an animal, but does know that "lion" is closer in context to "cat" than "dandelion".

It is important to note that *building* useful models takes a long time - hours or days to train a large corpus - and that for our purposes it is best to import an existing model rather than take the time to train our own.



## Word Vectors
<font color="light blue">Word2Vec is a two-layer neural net that processes text. It's input is a text corpus and output is a set of vectors for words in that corpus. The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is it detects similarities mathematically. 
    
Word2Vec creates vectors that are distributed numerical representations of word features, features such as context of individual words. It does so without human intervention. Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word's meaning based on past appearences. Those guesses can be used to establish a word's association with other words (eg. "man" is to "boy", what 'woman' is to "girl".

Word2Vec trains words against other words that neighbor them in the input corpus. It does so in one of the two ways, either using context to predict a target word(a method known as continuous bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram.
    
Recall that each word is now represented by a vector. In spaCy each of these vectors has 300 dimensions. This means that now we can apply cosine similarity to measure how similar word vectors are to each other. This means we can also perform vector arithmetic with the word vectors. **new_vector= king-man+woman**. This creates new vectors(not directly associated with a word) that we can then attempt to find most similar vectors to. **new_vector closet to vector for queen**.

## Vector values
So what does a word vector look like? Since spaCy employs 300 dimensions, word vectors are stored as 300-item arrays.

In [68]:
# import spacy and load the language library
import spacy
nlp=spacy.load('en_core_web_lg')

In [45]:
nlp(u"Lion").vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [46]:
nlp(u"Lion").vector.shape

(300,)

What's interesting is that Doc and Span objects themselves have vectors, derived from the averages of individual token vectors. <br>This makes it possible to compare similarities between whole documents.

In [47]:
doc=nlp(u"The quick brown fox jumped over a lazy dog")
doc.vector

array([-2.29145795e-01, -3.46055627e-03, -5.60528897e-02,  7.85977766e-02,
        9.12922155e-03,  1.10382795e-01, -1.26714230e-01, -7.90637732e-02,
        1.31968096e-01,  1.86862671e+00, -2.59036660e-01, -9.45902020e-02,
       -1.29910111e-01, -1.72616780e-01, -2.08401456e-01,  1.87006649e-02,
        8.01126659e-02,  9.91218865e-01,  2.73393337e-02, -2.94356763e-01,
       -1.45484447e-01, -9.41558629e-02, -5.07920086e-02, -1.69811353e-01,
        1.41727030e-01, -8.95571038e-02, -1.84179127e-01, -1.76457226e-01,
        1.65122882e-01, -2.20417902e-01, -1.91546515e-01,  2.51313895e-01,
        6.73556626e-02, -5.30913323e-02,  1.28895223e-01, -5.74297756e-02,
        7.14288801e-02, -1.10088736e-01, -8.49754438e-02, -1.26965329e-01,
        2.06004441e-01,  7.04980046e-02, -6.15093037e-02, -1.66130662e-01,
        1.48633450e-01,  4.71172146e-02, -1.81588233e-01,  5.00197560e-02,
        1.43082336e-01,  2.85028890e-02, -2.06958458e-01,  2.00484216e-01,
        5.18219313e-04,  

In [49]:
doc.vector.shape

(300,)

## Identifying similar vectors
The best way to expose vector relationships is through the `.similarity()` method of Doc tokens.

In [53]:
doc=nlp(u"lion cat pet")
for token in doc:
    for token1 in doc:
        print(token.text,token1.text,token.similarity(token1))

lion lion 1.0
lion cat 0.5265438
lion pet 0.39923766
cat lion 0.5265438
cat cat 1.0
cat pet 0.7505457
pet lion 0.39923766
pet cat 0.7505457
pet pet 1.0


### Opposites are not necessarily different
Words that have opposite meaning, but that often appear in the same *context* may have similar vectors.

In [54]:
doc1=nlp(u"like love hate")
for token in doc1:
    for token1 in doc1:
        print(token.text,token1.text,token.similarity(token1))

like like 1.0
like love 0.657904
like hate 0.65746516
love like 0.657904
love love 1.0
love hate 0.63930994
hate like 0.65746516
hate love 0.63930994
hate hate 1.0


## Vector norms
It's sometimes helpful to aggregate 300 dimensions into a [Euclidian (L2) norm](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm), computed as the square root of the sum-of-squared-vectors. This is accessible as the `.vector_norm` token attribute.

In [59]:
tokens=nlp(u"man child Aryan")

In [60]:
for token in tokens:
    print(token.text,token.has_vector,token.vector_norm,token.is_oov)

man True 6.352939 False
child True 6.831789 False
Aryan True 7.0427647 False


In [61]:
tokens=nlp(u"man child Mingisa")
for token in tokens:
    print(token.text,token.has_vector,token.vector_norm,token.is_oov)

man True 6.352939 False
child True 6.831789 False
Mingisa False 0.0 True


Indeed we see that "Minigisa" does not have a vector, so the vector_norm value is zero, and it identifies as *out of vocabulary*.

## Vector arithmetic
Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests
<pre>"queen" - "woman" + "man" = "king"</pre>
Let's try it out!

In [74]:
from scipy import spatial
cosine_similarity=lambda vec1,vec2 : 1- spatial.distance.cosine(vec1,vec2)

queen=nlp.vocab["queen"].vector
man=nlp.vocab["man"].vector
woman=nlp.vocab["woman"].vector

new_vector=queen - woman + man
computed_similarities=[]

# For all words in vocab 
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity=cosine_similarity(new_vector,word.vector)
                computed_similarities.append((word,similarity))
computed_similarities=sorted(computed_similarities,key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])
            

['queen', 'king', 'man', 'he', 'cuz', 'let', 'u', 'nothin', 'lovin', 'nuff']
