<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 1 Lesson 2*

As we learned yesterday, machines cannot intrepret raw text. We need to transform that text into something we/machines can more readily analyze. Yesterday, we did simple counts of counts to summarize the content of Amazon reviews. Today, we'll extend those concepts to talk about vector representations such as Bag of Words (BoW) and word embedding models. We'll use those representations for search, visualization, and prepare for our classification day tomorrow. 

Processing text data to prepare it for maching learning models often means translating the information from documents into a numerical format. Bag-of-Words approaches (sometimes referred to as Frequency-Based word embeddings) accomplish this by "vectorizing" tokenized documents. This is done by representing each document as a row in a dataframe and creating a column for each unique word in the corpora (group of documents). The presence or lack of a given word in a document is then represented either as a raw count of how many times a given word appears in a document (CountVectorizer) or as that word's TF-IDF score (TfidfVectorizer).

On the python side, we will be focusing on `sklearn` and `spacy` today.  

## Case Study

We're going to pretend we're on the datascience team at the BBC. We want to recommend articles to visiters to on the BBC website based on the article they just read. 

## Learning Objectives
* <a href="#p1">Part 1</a>: Represent a document as a vector
* <a href="#p2">Part 2</a>: Query Documents by Similarity
* <a href="#p3">Part 3</a>: Apply word embedding models to create document vectors

In [1]:
""" Import Statements """

# Classics
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

import spacy
nlp = spacy.load("en_core_web_lg")

**Warm Up (_3 Minutes_)**

Extract the tokens from this sentence using Spacy. Text is from [OpenAI](https://openai.com/blog/better-language-models/)

In [2]:
text = "We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humans—specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl."

In [3]:
import spacy
from spacy.tokenizer import Tokenizer

# nlp = spacy.load("en_core_web_sm")

# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

def tokenize(text):
    doc_tokens = [token.text for token in tokenizer(text)]    
    return doc_tokens

print(tokenize(text))

['We', 'created', 'a', 'new', 'dataset', 'which', 'emphasizes', 'diversity', 'of', 'content,', 'by', 'scraping', 'content', 'from', 'the', 'Internet.', 'In', 'order', 'to', 'preserve', 'document', 'quality,', 'we', 'used', 'only', 'pages', 'which', 'have', 'been', 'curated/filtered', 'by', 'humans—specifically,', 'we', 'used', 'outbound', 'links', 'from', 'Reddit', 'which', 'received', 'at', 'least', '3', 'karma.', 'This', 'can', 'be', 'thought', 'of', 'as', 'a', 'heuristic', 'indicator', 'for', 'whether', 'other', 'users', 'found', 'the', 'link', 'interesting', '(whether', 'educational', 'or', 'funny),', 'leading', 'to', 'higher', 'data', 'quality', 'than', 'other', 'similar', 'datasets,', 'such', 'as', 'CommonCrawl.']


In [4]:
doc = nlp(text)

print([token.lemma_ for token in doc if (token.is_punct != True) and (token.pos != 'PRON')])

['-PRON-', 'create', 'a', 'new', 'dataset', 'which', 'emphasize', 'diversity', 'of', 'content', 'by', 'scrape', 'content', 'from', 'the', 'internet', 'in', 'order', 'to', 'preserve', 'document', 'quality', '-PRON-', 'use', 'only', 'page', 'which', 'have', 'be', 'curate', 'filter', 'by', 'human', 'specifically', '-PRON-', 'use', 'outbound', 'link', 'from', 'Reddit', 'which', 'receive', 'at', 'least', '3', 'karma', 'this', 'can', 'be', 'think', 'of', 'as', 'a', 'heuristic', 'indicator', 'for', 'whether', 'other', 'user', 'find', 'the', 'link', 'interesting', 'whether', 'educational', 'or', 'funny', 'lead', 'to', 'high', 'datum', 'quality', 'than', 'other', 'similar', 'dataset', 'such', 'as', 'CommonCrawl']


In [5]:
%pwd

'/home/nedderlander/github/DS-Unit-4-Sprint-1-NLP/module2-vector-representations'

In [47]:
# Build a freaking generator to return our BBC Data
import os 

def gather_data_generator(filefolder):
    
    _list = []
    
    files = os.listdir(filefolder)
    
    for article in files:
        
        path = os.path.join(filefolder, article)
        
        if path[-3:] == 'txt':
            with open(path, 'rb') as f:
                _list.append(f.read())  
    
    yield _list

    
# pseudocode

def extract_tokens(generator):
    
    for g in generator:
        #extract tokens
        yield tokens

In [50]:
def gather_data(filefolder):
    
    _list = []
    
    files = os.listdir(filefolder)
    
    for article in files:
        
        path = os.path.join(filefolder, article)
        
        if path[-3:] == 'txt':
            with open(path, 'rb') as f:
                _list.append(f.read())  
    
    return _list

In [51]:
data = (gather_data('./data'))

In [53]:
# generators can iterate without being read into memory, this is v clever 
# or I can convert it to a list to keep it in memory

data[3][:50]

b'A decade of good website design\r\n\r\nThe web looks v'

In [None]:
# print contents of sub directory
# print(os.listdir('./data'))

files = os.listdir('./data')

with open(os.path.join('./data', files[0]), 'rb') as f:
    print(type(f.read()))


## Represent a document as a vector
<a id="p1"></a>

In this section, we are going to create Document Term Matrices (DTM). Each column represents a word. Each row represents a document. The value in each cell can be range of different things. The most traditional: counts of appearences of words, does the word appear at all (binary), and term-frequency inverse-document frequence (TF-IDF). 

**Discussion:** Don't we loose all the context and grammer if we do this? So Why does it work?


### CountVectorizer

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["We created a new dataset which emphasizes diversity of content, by scraping content from the Internet."," In order to preserve document quality, we used only pages which have been curated/filtered by humans—specifically, we used outbound links from Reddit which received at least 3 karma."," This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl."]

# create the transform
vectorizer = CountVectorizer(stop_words='english')

# tokenize and build vocab
vectorizer.fit(data)


# Create a Vocabulary
# The vocabulary establishes all of the possible words that we might use.
vectorizer.vocabulary_

# The vocabulary dictionary does not represent the counts of words!!
dtm = vectorizer.transform(data) #document term matrix

In [67]:
dtm.todense()

matrix([[0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [68]:
vectorizer.get_feature_names()

['00',
 '000',
 '000s',
 '0051',
 '007',
 '01',
 '028',
 '04m',
 '05',
 '0530',
 '056',
 '0630',
 '080',
 '0800',
 '0870',
 '10',
 '100',
 '1000',
 '100m',
 '100s',
 '101',
 '102',
 '104',
 '106',
 '106cm',
 '1080',
 '10cm',
 '10m',
 '10s',
 '10th',
 '10x7in',
 '11',
 '110',
 '115',
 '117',
 '11b',
 '11m',
 '12',
 '120',
 '120bn',
 '120gb',
 '125',
 '128',
 '129',
 '12cm',
 '13',
 '130',
 '130cm',
 '132',
 '133',
 '133m',
 '135m',
 '137',
 '139',
 '13th',
 '14',
 '145',
 '149',
 '14m',
 '14mbps',
 '15',
 '150',
 '152m',
 '159',
 '15m',
 '15mb',
 '15mbps',
 '15th',
 '16',
 '160gb',
 '167',
 '16bn',
 '16k',
 '17',
 '170',
 '171',
 '1731',
 '1761',
 '178',
 '178m',
 '17m',
 '17th',
 '18',
 '180',
 '188',
 '1891',
 '18m',
 '18mbps',
 '18s',
 '18th',
 '19',
 '190',
 '1900',
 '191',
 '1911',
 '1955',
 '1956',
 '196',
 '1964',
 '1970s',
 '1973',
 '1974',
 '1976',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1987',
 '1989',
 '1990s',
 '1991',
 '1992',
 '1

In [69]:
# Get Word Counts for each document

dtm_df = pd.DataFrame(dtm.todense(), columns = vectorizer.get_feature_names())

dtm_df.head()

Unnamed: 0,00,000,000s,0051,007,01,028,04m,05,0530,...,zip,zodiac,zombie,zombies,zone,zonealarm,zones,zoom,zooms,zurich
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
dtm_df.shape

(401, 11820)

In [62]:
# summarize encoded vector

In [None]:
# Apply CountVectorizer to our Data





# Use custom Spacy Vectorizer

#this callable must return a list of tokens for each object called and the first 
vectorizer = CountVectorizer(stop_words='english', tokenizer=spacy_tokenizer) 
#this is pseudocode btw




### TfidfVectorizer

## Term Frequency - Inverse Document Frequency (TF-IDF)

<center><img src="https://mungingdata.files.wordpress.com/2017/11/equation.png?w=430&h=336" width="300"></center>

Term Frequency: Percentage of words in document for each word

Document Frequency: A penalty for the word existing in a high number of documents.

The purpose of TF-IDF is to find what is **unique** to each document. Because of this we will penalize the term frequencies of words that are common across all documents which will allow for each document's most different topics to rise to the top.

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words = 'english', max_features = 5000, min_df = .05)
#mind_df, term must appear in either an int (example 5 docs) or a float(5% of docs)

# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(data)
#use this on a sample of docs all in memory to train model (no generators)

#can create a generator pipeline to handle new data and make predictions
# dtm = tfidf.transform(new_data)

# Print word counts

# Get feature names to use as dataframe column headers

# View Feature Matrix as DataFrame


In [None]:
# Tunning Parameters


## Query Documents by Similarity
<a id="p2"></a>

### Cosine Similarity (Brute Force)

In [70]:
# Calculate Distance of TF-IDF Vectors

from sklearn.metrics.pairwise import cosine_similarity

dist_matrix = cosine_similarity(dtm)

In [71]:
# Turn it into a DataFrame
dist_df = pd.DataFrame(dist_matrix)

In [84]:
dist_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,391,392,393,394,395,396,397,398,399,400
0,1.0,0.175063,0.031822,0.149486,0.076234,0.095939,0.110547,0.027691,0.222113,0.031822,...,0.140452,0.080911,0.107702,0.053819,0.175142,0.11608,0.088611,0.028932,0.152952,0.053792
1,0.175063,1.0,0.077464,0.255899,0.1519,0.153011,0.354004,0.043334,0.417465,0.077464,...,0.196538,0.14951,0.179127,0.038017,0.998663,0.168985,0.168154,0.093611,0.214897,0.123356
2,0.031822,0.077464,1.0,0.139035,0.047548,0.022439,0.045628,0.106282,0.140457,1.0,...,0.074605,0.081104,0.13312,0.078673,0.077499,0.033388,0.025587,0.002654,0.159593,0.061995
3,0.149486,0.255899,0.139035,1.0,0.121694,0.060103,0.09777,0.074727,0.382392,0.139035,...,0.367423,0.152869,0.301701,0.061812,0.256013,0.164738,0.067162,0.029853,0.264163,0.164098
4,0.076234,0.1519,0.047548,0.121694,1.0,0.072472,0.085963,0.034479,0.193083,0.047548,...,0.124918,0.099788,0.102051,0.050818,0.151967,0.143509,0.077916,0.056934,0.146441,0.113574


In [85]:
dist_df.shape
# our distance matrix is 401 x 401
# Each row is the similariyt of one document to all other documents (including itself)

(401, 401)

In [87]:
# Grab the row
dist_df[0].sort_values(ascending=False)[:5]

0      1.000000
177    1.000000
206    0.720042
174    0.581019
160    0.512111
Name: 0, dtype: float64

In [88]:
# Display by Most similiar to Least Similar
data[0][:100]

b'Broadband in the UK growing fast\r\n\r\nHigh-speed net connections in the UK are proving more popular th'

In [89]:
data[177][:100]

b'Broadband in the UK growing fast\r\n\r\nHigh-speed net connections in the UK are proving more popular th'

In [90]:
data[206][:100]

b"Broadband in the UK gathers pace\r\n\r\nOne person in the UK is joining the internet's fast lane every 1"

In [91]:
data[160][:100]

b'Broadband soars in 2004\r\n\r\nIf broadband were a jumbo jet, then 2003 would have seen it taxiing down '

### NearestNeighbor (K-NN) 

To address the computational inefficiencies of the brute-force approach, a variety of tree-based data structures have been invented. In general, these structures attempt to reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample. The basic idea is that if point  is very distant from point , and point  is very close to point , then we know that points  and  are very distant, without having to explicitly calculate their distance. In this way, the computational cost of a nearest neighbors search can be reduced to  or better. This is a significant improvement over brute-force for large data.

To address the inefficiencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes tree construction more costly than that of the KD tree, but results in a data structure which can be very efficient on highly structured data, even in very high dimensions.

A ball tree recursively divides the data into nodes defined by a centroid  and radius , such that each point in the node lies within the hyper-sphere defined by  and . The number of candidate points for a neighbor search is reduced through use of the triangle inequality:

With this setup, a single distance calculation between a test point and the centroid is sufficient to determine a lower and upper bound on the distance to all points within the node. Because of the spherical geometry of the ball tree nodes, it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data. In scikit-learn, ball-tree-based neighbors searches are specified using the keyword algorithm = 'ball_tree', and are computed using the class sklearn.neighbors.BallTree. Alternatively, the user can work with the BallTree class directly.

In [116]:
dtm

<401x750 sparse matrix of type '<class 'numpy.float64'>'
	with 34492 stored elements in Compressed Sparse Row format>

In [117]:
# Instantiate
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')

# Fit on TF-IDF Vectors
nn.fit(dtm.todense())

NearestNeighbors(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [101]:
# Query Using kneighbors 
nn.kneighbors(dtm.todense()[0])

(array([[ 0.        ,  0.        , 17.52141547, 21.9544984 , 22.737634  ]]),
 array([[  0, 177, 206, 124, 174]]))

In [118]:
medium_random_ass_tech = ["""
Estimated reading time: 12 minutes.

    “It’s easier to fool people than to convince them that they’ve been fooled.” — Unknown.

I’m an expert on how technology hijacks our psychological vulnerabilities. That’s why I spent the last three years as a Design Ethicist at Google caring about how to design things in a way that defends a billion people’s minds from getting hijacked.

When using technology, we often focus optimistically on all the things it does for us. But I want to show you where it might do the opposite.

Where does technology exploit our minds’ weaknesses?

I learned to think this way when I was a magician. Magicians start by looking for blind spots, edges, vulnerabilities and limits of people’s perception, so they can influence what people do without them even realizing it. Once you know how to push people’s buttons, you can play them like a piano.
That’s me performing sleight of hand magic at my mother’s birthday party

And this is exactly what product designers do to your mind. They play your psychological vulnerabilities (consciously and unconsciously) against you in the race to grab your attention.

I want to show you how they do it.
Hijack #1: If You Control the Menu, You Control the Choices

Western Culture is built around ideals of individual choice and freedom. Millions of us fiercely defend our right to make “free” choices, while we ignore how those choices are manipulated upstream by menus we didn’t choose in the first place.

This is exactly what magicians do. They give people the illusion of free choice while architecting the menu so that they win, no matter what you choose. I can’t emphasize enough how deep this insight is.

When people are given a menu of choices, they rarely ask:

    “what’s not on the menu?”
    “why am I being given these options and not others?”
    “do I know the menu provider’s goals?”
    “is this menu empowering for my original need, or are the choices actually a distraction?” (e.g. an overwhelmingly array of toothpastes)

How empowering is this menu of choices for the need, “I ran out of toothpaste”?

For example, imagine you’re out with friends on a Tuesday night and want to keep the conversation going. You open Yelp to find nearby recommendations and see a list of bars. The group turns into a huddle of faces staring down at their phones comparing bars. They scrutinize the photos of each, comparing cocktail drinks. Is this menu still relevant to the original desire of the group?

It’s not that bars aren’t a good choice, it’s that Yelp substituted the group’s original question (“where can we go to keep talking?”) with a different question (“what’s a bar with good photos of cocktails?”) all by shaping the menu.

Moreover, the group falls for the illusion that Yelp’s menu represents a complete set of choices for where to go. While looking down at their phones, they don’t see the park across the street with a band playing live music. They miss the pop-up gallery on the other side of the street serving crepes and coffee. Neither of those show up on Yelp’s menu.
"""]

In [119]:
new = tfidf.transform(medium_random_ass_tech)

nn.kneighbors(new.todense())

(array([[1.22377759, 1.24033055, 1.25311757, 1.25437609, 1.25628275]]),
 array([[278, 279, 150, 383,  35]]))

In [123]:
data[278]

b'Mobile gig aims to rock 3G\r\n\r\nForget about going to a crowded bar to enjoy a gig by the latest darlings of the music press.\r\n\r\nNow you could also be at a live gig on your mobile, via the latest third generation (3G) video phones. Rock outfit Rooster are playing what has been billed as the first ever concert broadcast by phone on Tuesday evening from a London venue. The 45-minute gig is due to be "phone cast" by the 3G mobile phone operator, 3. 3G technology lets people take, watch and send video clips on their phones, as well as swap data much faster than with 2G networks like GSM. People with 3G phones in the UK can already download football and music clips on their handsets.\r\n\r\nSome 1,000 fans of the London-based band will have to pay five pounds for a ticket and need a 3G handset.\r\n\r\n"Once you have paid, you can come and go as much as you like, because we expect the customers to be mobile," said 3 spokesperson Belinda Henderson. "It\'s like going to a concert hall,

## Apply word embedding models to create document vectors
<a id="p3"></a>

### BoW discards textual context

One of the limitations of Bag-of-Words approaches is that any information about the textual context surrounding that word is lost. This also means that with bag-of-words approaches often the only tools that we have for identifying words with similar usage or meaning and subsequently consolidating them into a single vector is through the processes of stemming and lemmatization which tend to be quite limited at consolidating words unless the two words are very close in their spelling or in their root parts-of-speech.

### Embedding approaches preserve more textual context
Word2Vec is an increasingly popular word embedding technique. Like Bag-of-words it learns a real-value vector representation for a predefined fixed-size vocabulary that is generated from a corpus of text. However, in contrast to BoW, Word2Vec approaches are much more capable of accounting for textual context, and are better at discovering words with similar meanings or usages (semantic or syntactic similarity).

### Word2Vec Intuition
### The Distribution Hypothesis

In order to understand how Word2Vec preserves textual context we have to understand what's called the Distribution Hypothesis (Reference: Distribution Hypothesis Theory  -https://en.wikipedia.org/wiki/Distributional_semantics. The Distribution Hypothesis operates under the assumption that words that have similar contexts will have similar meanings. Practically speaking, this means that if two words are found to have similar words both to the right and to the left of them throughout the corpora then those words have the same context and are assumed to have the same meaning. 

> "You shall know a word by the company it keeps" - John Firth

This means that we let the usage of a word define its meaning and its "similarity" to other words. In the following example, which words would you say have a similar meaning? 

**Sentence 1**: Traffic was light today

**Sentence 2**: Traffic was heavy yesterday

**Sentence 3**: Prediction is that traffic will be smooth-flowing tomorrow since it is a national holiday

What words in the above sentences seem to have a similar meaning if all you knew about them was the context in which they appeared above? 

Lets take a look at how this might work in action, the following example is simplified, but will give you an idea of the intuition for how this works.

#### Corpora:

1) "It was the sunniest of days."

2) "It was the raniest of days."

#### Vocabulary:

{"it": 1, "was": 2, "the": 3, "of": 4, "days": 5, "sunniest": 6, "raniest": 7}

### Vectorization

|       doc   | START_was | it_the | was_sunniest | the_of | sunniest_days | of_it | days_was | it_the | was_raniest | raniest_days | of_END |
|----------|-----------|--------|--------------|--------|---------------|-------|----------|--------|-------------|--------------|--------|
| it       | 1         | 0      | 0            | 0      | 0             | 0     | 1        | 0      | 0           | 0            | 0      |
| was      | 0         | 1      | 0            | 0      | 0             | 0     | 0        | 1      | 0           | 0            | 0      |
| the      | 0         | 0      | 1            | 0      | 0             | 0     | 0        | 0      | 1           | 0            | 0      |
| sunniest | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0      | 0           | 0            | 0      |
| of       | 0         | 0      | 0            | 0      | 1             | 0     | 0        | 0      | 0           | 1            | 0      |
| days     | 0         | 0      | 0            | 0      | 0             | 0     | 0        | 0      | 0           | 0            | 1      |
| raniest  | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0      | 0           | 0            | 0      |

Each column vector represents the word's context -in this case defined by the words to the left and right of the center word. How far we look to the left and right of a given word is referred to as our "window of context." Each row vector represents the the different usages of a given word. Word2Vec can consider a larger context than only words that are immediately to the left and right of a given word, but we're going to keep our window of context small for this example. What's most important is that this vectorization has translated our documents from a text representation to a numeric one in a way that preserves information about the underlying context. 

We can see that words that have a similar context will have similar row-vector representations, but before looking that more in-depth, lets simplify our vectorization slightly. You'll notice that we're repeating the column-vector "it_the" twice. Lets combine those into a single vector by adding them element-wise. 

|       *   | START_was | it_the | was_sunniest | the_of | sunniest_days | of_it | days_was | was_raniest | raniest_days | of_END |
|----------|-----------|--------|--------------|--------|---------------|-------|----------|-------------|--------------|--------|
| it       | 1         | 0      | 0            | 0      | 0             | 0     | 1        | 0           | 0            | 0      |
| was      | 0         | 2      | 0            | 0      | 0             | 0     | 0        | 0           | 0            | 0      |
| the      | 0         | 0      | 1            | 0      | 0             | 0     | 0        | 1           | 0            | 0      |
| sunniest | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0           | 0            | 0      |
| of       | 0         | 0      | 0            | 0      | 1             | 0     | 0        | 0           | 1            | 0      |
| days     | 0         | 0      | 0            | 0      | 0             | 0     | 0        | 0           | 0            | 1      |
| raniest  | 0         | 0      | 0            | 1      | 0             | 0     | 0        | 0           | 0            | 0      |

Now, can you spot which words have a similar row-vector representation? Hint: Look for values that are repeated in a given column. Each column represents the context that word was found in. If there are multiple words that share a context then those words are understood to have a closer meaning with each other than with other words in the text.

Lets look specifically at the words sunniest and raniest. You'll notice that these two words have exactly the same 10-dimensional vector representation. Based on this very small corpora of text we would conclude that these two words have the same meaning because they share the same usage. Is this a good assumption? Well, they are both referring to the weather outside so that's better than nothing. You could imagine that as our corpora grows larger we will be exposed a greater number of contexts and the Distribution Hypothesis assumption will improve. 

### Word2Vec Variants

#### Skip-Gram

The Skip-Gram method predicts the neighbors’ of a word given a center word. In the skip-gram model, we take a center word and a window of context (neighbors) words to train the model and then predict context words out to some window size for each center word.

This notion of “context” or “neighboring” words is best described by considering a center word and a window of words around it. 

For example, if we consider the sentence **“The speedy Porsche drove past the elegant Rolls-Royce”** and a window size of 2, we’d have the following pairs for the skip-gram model:

**Text:**
**The**	speedy	Porsche	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (the, speedy), (the, Porsche)

**Text:**
The	**speedy**	Porsche	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (speedy, the), (speedy, Porsche), (speedy, drove)

**Text:**
The	speedy	**Porsche**	drove	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (Porsche, the), (Porsche, speedy), (Porsche, drove), (Porsche, past)

**Text:**
The	speedy	Porsche	**drove**	past	the	elegant	Rolls-Royce

*Training Sample with window of 2*: (drove, speedy), (drove, Porsche), (drove, past), (drove, the)

The **Skip-gram model** is going to output a probability distribution i.e. the probability of a word appearing in context given a center word and we are going to select the vector representation that maximizes the probability.

With CountVectorizer and TF-IDF the best we could do for context was to look at common bi-grams and tri-grams (n-grams). Well, skip-grams go far beyond that and give our model much stronger contextual information.

![alt text](https://www.dropbox.com/s/c7mwy6dk9k99bgh/Image%202%20-%20SkipGrams.jpg?raw=1)

## Continuous Bag of Words

This model takes thes opposite approach from the skip-gram model in that it tries to predict a center word based on the neighboring words. In the case of the CBOW model, we input the context words within the window (such as “the”, “Proshe”, “drove”) and aim to predict the target or center word “speedy” (the input to the prediction pipeline is reversed as compared to the SkipGram model).

A graphical depiction of the input to output prediction pipeline for both variants of the Word2vec model is attached. The graphical depiction will help crystallize the difference between SkipGrams and Continuous Bag of Words.

![alt text](https://www.dropbox.com/s/k3ddmbtd52wq2li/Image%203%20-%20CBOW%20Model.jpg?raw=1)

## Notable Differences between Word Embedding methods:

1) W2V focuses less document topic-modeling. You'll notice that the vectorizations don't really retain much information about the original document that the information came from. At least not in our examples.

2) W2V can result in really large and complex vectorizations. In fact, you need Deep Neural Networks to train your Word2Vec models from scratch, but we can use helpful pretrained embeddings (thank you Google) to do really cool things!

*^ All that noise....AND Spacy has pretrained a Word2Vec model you can just use? WTF JC?*

Let's take a look at how to do it. 

In [None]:
# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = ___.___
print(bananas_vector)

In [None]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = ____.____(____)
print(similarity)

In [None]:
# import the PCA module from sklearn
from sklearn.decomposition import PCA

def get_word_vectors(words):
    # converts a list of words into their word vectors
    return [nlp(word).vector for word in words]

words = ['car', 'truck', 'suv', 'elves', 'dragon', 'sword', 'king', 'queen', 'prince', 'horse', 'fish' , 'lion']

# intialise pca model and tell it to project data down onto 2 dimensions
pca = PCA(n_components=2)

# fit the pca model to our 300D data, this will work out which is the best 
# way to project the data down that will best maintain the relative distances 
# between data points. It will store these intructioons on how to transform the data.
pca.fit(get_word_vectors(words))

# Tell our (fitted) pca model to transform our 300D data down onto 2D using the 
# instructions it learnt during the fit phase.
word_vecs_2d = pca.transform(get_word_vectors(words))

# let's look at our new 2D word vectors
word_vecs_2d

In [None]:
# create a nice big plot 
plt.figure(figsize=(20,15))

# plot the scatter plot of where the words will be
plt.scatter(word_vecs_2d[:,0], word_vecs_2d[:,1])

# for each word and coordinate pair: draw the text on the plot
for word, coord in zip(words, word_vecs_2d):
    x, y = coord
    plt.text(x, y, word, size= 15)

# show the plot
plt.show()

### Extract Document Vectors

Let's see how much the quality of our query will work when we try a new embedding model.

Steps:
* Extract Vectors from Each Document
* Search using KNN
