In [20]:
# Imports
# Basics
from __future__ import print_function, division
import pandas as pd 
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics.pairwise as smp

from sklearn.decomposition import NMF

# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Vector Space Models

Yesterday, we used [**Latent Dirichlet Allocation (LDA)**](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) to map text documents ([20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)) from a **word space** to a **topic space** that could give us the **topic distribution** of different documents in our corpus as well as allow us to make **conceptual comparisons** of documents in the reduced topic space.

Today, we'll continue mapping text documents from a highly dimensional word (or token) space into a much reduced **semantic space** which allows us to make valuable **conceptual comparisons** between arbitrary blocks of text in this new vector space.

Thus the **input is a large corpus of text documents** and the **output is a reduced semantic space for those input documents and words**.  These starting/ending points are constant, but we'll take 2 different approaches for the process in between:
1.  [**Latent Semantic Indexing (LSI)**](https://en.wikipedia.org/wiki/Latent_semantic_analysis) - performs a [**Singular Value Decomposition (SVD)**](https://en.wikipedia.org/wiki/Singular_value_decomposition) on a [**document-term matrix**](https://en.wikipedia.org/wiki/Document-term_matrix) with [**TFIDF Weightings**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to map all the terms in the corpus into a reduced **term space** and all the documents into a reduced **document space**.  
    - The 2 spaces are related by a simple transformation, so we can perform arbitrary **term-term**, **doc-doc**, and **doc-term comparisons** via [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity).
    - These 2 spaces make up the **"dual space"**
        - Every document is the weighted sum of all of its terms
        - Every term is the weighted sum of all the documents it occurs in (very useful!)
2. [**Non-Negative Matrix Factorization (NMF)**](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) - performs a different type of matrix factorization on the **Term-Document Matrix** to yield reduced Term/Document vector spaces.  
  - Yields **word vectors** just like all our methods today.
    - These vectors are often quite similar to LSI vectors.
  - Yields **document vectors** as well.
  - Decomposition has some nice properties.
3.  [**Word2Vec**](https://en.wikipedia.org/wiki/Word2vec) - uses a neural network to yield **term space**
    - Has additional nice properties of term vectors, such as conceptual additivity (see below)

## Goals &zwnj;
- Continue to use gensim to implement text modeling
- Build an LSI vector space from a training set
- Use the LSI space to compare terms and documents to one another conceptually
- Use the LSI space to perform document clustering and classification
- Build an NMF vector space from a training set
- Use the NMF space to compare terms and documents to one another conceptually
- Use the NFM space to perform document clustering and classification
- Use Word2vec to create a vector space for words in a training set
- Use the Word2vec space to do simple comparisons between different combinations of words
- Discuss various other considerations, tasks, and extensions for VSMs like LSI, NMF, and Word2vec

## Agenda
- Vector Space Models
  - What?
  - Why?
  - How?
- TFIDF
- Latent Semantic Indexing
- Non-negative Matrix Factorization
- Word2Vec

# What are Vector Space Models?
- Map raw text into vector space
- Allow semantic (conceptual) comparison between text chunks
- **Input**: Corpus of raw text documents
- **Output**: Vectors for text documents (and usually terms)

## Why do we need Vector Space Models?
- Early NLP (1950s-1980s): focused on **linguistics** and **handwritten rules**
  - Extremely complex, impossible to maintain/scale
- 1980s: ML introduced for NLP
  - Linguistics $\rightarrow$ "***Corpus Linguistics***"
  - Learn from text via large ***corpora*** of labeled raw text documents
- **Idea**: Map text to mathematical entities $\rightarrow$ ***Word Vectors***
- No more complex rules systems! $\rightarrow$ *Unsupervised*

## How do VSMs work?
- Start with raw text
- Translate text to vectors (e.g. Counts, TFIDF)
- Optional (Probable): **Reduce the space**
  - Use Probabilistic Inference (LDA)
  - Use Dimensionality Reduction via **Matrix Factorization** (SVD, NMF)
  - Use a Neural Network (Word2Vec)
- Once we have "Semantic" (meaning) vectors (**word vectors**):
  - Do all sorts of ML with them!

## Examples of VSMs
- **LDA**:
  - Word Space $\rightarrow$ Topic Space via Bayesian Inference
- **TFIDF**:
  - Word Space not reduced, but augmented
- **LSA/LSI** and **NMF**:
  - Word Space $\rightarrow$ Semantic Space via SVD or NMF 
- **Word2Vec**:
  - Word Space $\rightarrow$ Semantic Space via Neural Network

## Term Frequency Inverse Document Frequency (TFIDF)

## TFIDF
- Creates vectors for documents based on unique term counts (frequencies)
- Weights the frequency counts by TFIDF weighting
- Does **not** reduce dimensionality
- Frequent preprocessing step for matrix factorization methods

### Term-Document Matrix (TDM)
- Start with corpus of documents (raw text)
- Create matrix:
  - Rows are our **term vocabulary** - unique terms over all documents
  - Columns are all documents
  - Entries are respective term frequency for each document
- "Terms" means tokens, whatever is extracted from tokenization in preprocessing

### TFIDF Weighting
- Term Frequency Inverse Document Frequency
- Common weighting scheme applied to term-document matrix
- Any function that is:
  - Directly proportional to term frequency **within document** (local weight)
  - Inversely proportional to term frequency **in all documents** (global weight)
- **Motivation**: Highly common terms are not useful to distinguish documents

### TFIDF: Relation to VSMs
- TFIDF vectors are a VSM on their own! 
- TFIDF weightings empirically improve VSMs
- Some VSMs (LSI, NMF) almost certainly require this step, some don't (LDA, Word2Vec)

### Preprocessing Considerations
- Tokenization: Determines our unique terms $\rightarrow$ size of matrix
- Stopwords
- Stemming
- Named Entity Recognition
- Punctuation
- Min/Max Frequency Threshold (how often should term occur to keep it?)
- Phrase Extraction
- Part-of-Speech Tagging
- Word-sense Disambiguation (bush vs George Bush)
- etc

### Common TFIDF Weighting Scheme
- **Entropy**:
  - Term Frequency for term $i$ in document $j$: $tf_{ij}$
  - Term Frequency for term $i$ over all documents: $gf_i$
  - Relative Frequency for term $i$ in document $j$: $p_{ij} = tf_{ij}/gf_i$
  - Number of documents in corpus: $n$
  - Local Weight ("TF"): $lw_{ij} = \log(tf_{ij}+1)$
  - Global Weight ("IDF"): $gw_i = 1 + \sum\limits_j \frac{p_{ij}\log p_{ij}}{\log n}$
  - TFIDF Weight: 
$$
tfidf = lw_{ij}\times gw_i = \left(\log(tf_{ij}+1)\right) \times \left(1 + \sum\limits_j \frac{p_{ij}\log p_{ij}}{\log n}\right)
$$

**TFIDF in `sklearn`**: 
Let's implement TFIDF with the [20 Newsgroups Dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)
- Here's TFIDF in `sklearn` with `TfidfVectorizer`:

In [None]:
# Load in the 20 Newsgroups data
ng = datasets.fetch_20newsgroups()
# Get the raw text docs
ng_text = ng.data 

# Vectorize the text using TFIDF
tfidf = TfidfVectorizer(stop_words="english", 
                        token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                        min_df=10)
tfidf_vecs = tfidf.fit_transform(ng_text)
pd.DataFrame(tfidf_vecs.todense(), 
             columns=tfidf.get_feature_names()
            ).head()

### Applications of TFIDF
- We have a vector space!
  - Term vectors are the rows
  - Document vectors are the columns
- We can try to do ML
  - e.g.: Naive Bayes for Text Classification
- **BUT**...
  - Many unique terms $\rightarrow$ many unique features!
  - Curse of Dimensionality :(
  - Dimensionality Reduction seems prudent :)

### Naive Bayes Text Classification
- Naive Bayes empirically avoids the curse somewhat on TFIDF vectors
  - Performs decent enough on high-dimensional text classification tasks
- How does it work?
  - The observations are the weighted TFIDF vectors
    - This is a (weighted) bag of words for document $j$ and terms $\{w_i\}$
$$
P(Class C | \{w_i\}) = \frac{(\text{Likelihood} \times \text{Prior}}{\text{Evidence}} = \frac{P(\{w_i\} | C) \times P(C)}{P(\{w_i\})}
$$
- Naive Assumption: 
$$
P(\{w_i\} | C) = \prod\limits_i P(w_i|C)
$$
- This is just a multinomial distribution where **given a class C each word has probability $p_i$ of appearing**!
- Thus:
  - **Likelihood is multinomial**:
    - $P(w_i)|C)$: Number of times word $w_i$ appears in class C documents divided by total number of words in Class C documents
  - **Prior**: Proportion of documents of class C
  - We can use Multinomial Naive Bayes!

Let's try simple Naive Bayes classification on TFIDF vectors from above in `sklearn`:

In [None]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(tfidf_vecs, 
                                                    ng.target, 
                                                    test_size=0.33)

# Train 
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Test 
nb.score(X_test, y_test)

## Latent Semantic Analysis/Indexing (LSA/LSI)

### What is LSI?
- Latent Semantic Indexing (Analysis $\rightarrow$ LSA, same thing)
- **TFIDF space** $\rightarrow$ **Semantic space**
  - **Dimensionality Reduction**!
- **SVD** performs the reduction!

### Steps in LSI
- Create Term-Document Matrix
- Apply TFIDF Weightings
- Perform SVD on TFIDF Matrix
  - Results in **Term space** and **Document space**
  - Spaces linked by simple transformation
- Keep the top $k$ components (dimensions, this is a parameter $\rightarrow$ your choice!)
- Use the resulting vectors!

### SVD for LSI
<img  src="svd.png"/>
<img  src="lsa.png"/>

### LSI Preprocessing and Other Considerations
- Everything from TFIDF
- Often remove numerics, highly infrequent terms, etc
- Named Entity Recognition (NER)
  - Tagging entities dramatically improves results for some data
- Bottlenecks
  - TDM creation (parallelizable)
  - SVD (good approximation algorithms for **incremental SVD**)
- Do I even need to train on my data?
  - If you have pre-existing word vectors from similar domain, maybe not!  Use those!
- Do I need to train on all of my data?
  - Maybe as little as 10% is needed for **large** datasets (tens of millions of docs)

### LSI Parameters
- Number of dimensions ($k$) to reduce to:
  - Depends on data size
  - 300 old standard, 500-1000 probably better for large datasets
  - 100 or less is fine for **small** data
- TFIDF Weightings:
  - Entropy most successful empirically

### Results of LSI
- **Output**: Word (semantic) vectors!
- With these we can:
  - Make conceptual term-term, term-doc, doc-doc, comparisons via **Cosine Similarity**
  - Perform other ML tasks:
    - Classification
    - Clustering
    - Regression (less common)

### LSI with `gensim`
- Let's try out LSI with `gensim`!
- First we need to export our TFIDF vectors (from earlier) to `gensim` and let it know the mapping of row index to term:

In [None]:
# Convert sparse matrix of counts to a gensim corpus
# Need to transpose it for gensim which wants 
# terms by docs instead of docs by terms
tfidf_corpus = matutils.Sparse2Corpus(tfidf_vecs.transpose())

# Row indices
id2word = dict((v, k) for k, v in tfidf.vocabulary_.items())

# This is a hack for Python 3!
id2word = corpora.Dictionary.from_corpus(tfidf_corpus, 
                                         id2word=id2word)

- Now let's build an LSI space!

In [None]:
# Build an LSI space from the input TFIDF matrix, mapping of row id to word, and num_topics
# num_topics is the number of dimensions to reduce to after the SVD
# Analagous to "fit" in sklearn, it primes an LSI space
lsi = models.LsiModel(tfidf_corpus, id2word=id2word, num_topics=300)

### Using the LSI Space in `gensim`
- We have a trained LSI space
- We want to see where original documents lie in that 300-dimensional space:

In [None]:
# Retrieve vectors for the original tfidf corpus in the LSI space ("transform" in sklearn)
lsi_corpus = lsi[tfidf_corpus]

# Dump the resulting document vectors into a list so we can take a look
doc_vecs = [doc for doc in lsi_corpus]
doc_vecs[0]

#### Conceptual Similarity Between Documents
- Compare any document in the space to any other
- Cosine Similarity:
$$
\text{sim} = \frac{x_1 \cdot x_2}{\lVert{x_1}\rVert \lVert{x_2}\rVert}
$$
- In `gensim`:

In [None]:
# Create an index transformer that calculates similarity based on our space
index = similarities.MatrixSimilarity(doc_vecs, 
                                      num_features=len(id2word))

# Return the sorted list of cosine similarities to the first document
sims = sorted(enumerate(index[doc_vecs[0]]), key=lambda item: -item[1])
sims

In [None]:
# Let's take a look at how we did
for sim_doc_id, sim_score in sims[0:3]: 
    print("Score: " + str(sim_score))
    print("Document: " + ng_text[sim_doc_id])

### Conceptual Similarity Between Arbitrary Text Blobs
- We have vectors in the LSI space
- We can compare any blob of text to any other blob of text!  
- How?
  - Need to send any blob of text through our entire preprocessing + LSI transformation pipeline
  - So `gensim` can index them and perform the comparisons.  

Let's create some text for a simple example:

In [None]:
# Create some test text blobs to compare pairwise
text_blobs = ['space', 'nasa', 'science', 'armenians', 'israel', 
              'space nasa program', 'turkish middle east', 
              'computer graphics', 'computers', 'data science']

### Conceptual Similarity Between Arbitrary Text Blobs
Now we need to do our transformations.  Following what we did above, the are:
* Use `TFIDF` from above to get tfidf vectors
* Convert the numpy array to a gensim corpus
* Perform LSI transformation from above on the corpus

Should build a function for all this, but let's just try them sequentially here:

In [None]:
# Get tfidf matrix
test_vecs = tfidf.transform(text_blobs).transpose()
# Convert to gensim corpus
test_corpus = matutils.Sparse2Corpus(test_vecs)
# LSI transformation
test_lsi = lsi[test_corpus]

### Conceptual Similarity Between Arbitrary Text Blobs
- We have LSI vectors for all of our test "documents" (text blobs)!  
- We just need to index them
- Then we can compare them to one another via cosine similarity.  
- To index them we use the `MatrixSimilarity` that we did above:

In [None]:
# Index our test text blobs
test_index = similarities.MatrixSimilarity(test_lsi)

# Iterate and print out all pairwise similarities
# For each test text blob that we're looking at
for i, sims in enumerate(test_index):
    # We get a list of similarities to all indexed text blobs
    # Print the text blob we're currently examining
    print("Similarities to {}:".format(text_blobs[i]))
    # Print the similarities of the current blob to all others with labels
    sims_with_labels = [(score, text_blobs[j]) for j, score in enumerate(sims)]
    # Sort the results by decreasing similarity and print them out
    sorted_sims_with_labels = sorted(sims_with_labels, reverse=True)
    print(sorted_sims_with_labels)
    print('\n')

- So cool!!  
- We can compare ***any*** arbitrary collection of words (arbitrary "documents") to any other arbitrary collection of words with our gensim LSI index.  
- We just need to make sure we index those documents first!


### LSI for Machine Learning
- We have (very good, 300-dimensional) vectors for our documents now!  
- So we can do any ML we want on our documents!
- First we need to convert back to `sklearn` land:

In [None]:
# Convert the gensim-style corpus vecs to a numpy array for sklearn manipulations
ng_lsi = matutils.corpus2dense(lsi_corpus, num_terms=300).transpose()
ng_lsi.shape

#### LSI for Text Clustering
- Let's try clustering our documents with `sklearn`:

In [19]:
# Create KMeans
kmeans = KMeans(n_clusters=20)

# Cluster
ng_lsi_clusters = kmeans.fit_predict(ng_lsi)

# Take a look
print(ng_lsi_clusters[0:50])
ng_text[0:5]

[ 7  9  9  2  2  5  9 14  9 13  0 18  2 17  9 19  9  7  9  8 16  3  9  8 14
  9  2  9 18  7  7  9  9 19  9  3  0 12  0 19  1  2  6  1  9  9  9  0  9 17]


["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [None]:
kmeanssnsnsansanseans.

#### LSI for Text Classification
- Try some simple classification on the result LSI vectors for the 20 NG set:

In [None]:
# Need pairwise Cosine for KNN
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics.pairwise as smp

# Train/Test
X_train, X_test, y_train, y_test = train_test_split(ng_lsi, ng.target, 
                                                    test_size=0.33)

# Fit KNN classifier to training set with cosine distance
knn = KNeighborsClassifier(n_neighbors=3, metric=smp.cosine_distances)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

## Non-Negative Matrix Factorization (NMF)

### What is NMF?
- Non-negative Matrix Factorization
- **TFIDF space** $\rightarrow$ **Semantic space**
  - **Dimensionality Reduction**!
  - Just like LSI in that way!
- Only difference: Different Matrix Factorization
  - LSI is SVD:
$$
\text{X} = \text{U}\Sigma\text{V}^T
$$
  - NMF:
$$
\text{X} = \text{TD}
$$


### NMF Factorization
$$
\text{X} = \text{TD}
$$
- X: Term-Document Space (familiar TFIDF matrix)
- T: Term-Feature Space (familiar reduced term space, though not the same one!)
- D: Document-Feature Space (familiar reduced doc space, though not the same one!)
- NMF Vectors are often very similar to LSI vectors

### NMF Factorization
<img src='NMF.png'/>

### Steps in NMF for Text
- Same as LSI!
  - Create Term-Document Matrix
  - Apply TFIDF weightings
  - Factorize Matrix (this time with NMF instead of SVD)
  - Use resulting "semantic" vectors!

### NMF for ML
- Just like with LSI, can use NMF vectors for ML
- `sklearn` has NMF, `gensim` does not
- We'll reuse the TFIDF and data pieces from before
- Let's reduce the TFIDF matrix in `sklearn`:

In [18]:
# Reduce TFIDF Matrix to 300 dimensions
from sklearn.decomposition import NMF
nmf = NMF(n_components=300)
nmf_vecs = nmf.fit_transform(tfidf_vecs)

#### NMF for Text Clustering
- Now let's cluster the docs in NMF space:

In [None]:
# KMeans clustering on Newsgroups
# Use the data loaded from earlier
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=20)
kmeans.fit_predict(nmf_vecs)

#### NMF for Text Classification
- And now classification:

In [None]:
# Fit KNN classifier to training set with cosine distance
# Use our train/test split from LSI example
knn = KNeighborsClassifier(n_neighbors=3, metric=smp.cosine_distances)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

**Woooohoooo!  We've classified text with NMF!**

## Word2Vec

### What is word2vec?
- VSM for text analysis
- **Input**: Corpus of text documents
- **Output**: Reduced vector space for all terms in documents
- **Training**: Neural Network based on sliding windows of text

### Why word2vec?
- Other VSMs (LSI, NMF, etc) **don't capture word order** (Bag of Words)!
- It would be nice if we could, at least a little!
- word2vec does somewhat
- word2vec can capture analogy relationships:
  - e.g.: King is to man as Queen is to woman

### How does word2vec work?
- We use a **shallow** (1 hidden layer) neural network 
- Train on "**context windows**", small sequences of words in text (5-10 maybe)
- The input is either:
  - A word in the context window ("Skip-Grams")
  - All words but 1 in a context window ("CBOW")
- The output is either:
  - All the other words in a context window ("Skip-Grams")
  - The missing word from the context window ("CBOW")

### Preprocessing Considerations
- Everything from our previous VSMs!
- aka Tokenization, stemming, stopwords, Entity extraction, etc

### Training word2Vec
- Use a neural network on context windows
- 2 main approaches for inputs and labels:
  - **Skip-Grams**
  - **Continuous Bag of Words (CBOW)**
  - Vectors usually similar, subtle differences, also differences in computational time

#### Context Windows
- **Observations** for word2vec: All **context windows** in a corpus
  - Size of a context window is a chosen parameter
- e.g.:
  - Document: "The quick brown fox jumped over the lazy dog."
  - Window size: 5
  - Window 1: "<span class="burk">The quick brown fox jumped</span> over the lazy dog."
  - Window 2: "The <span class="burk">quick brown fox jumped over</span> the lazy dog."
  - Window 3: "The quick <span class="burk">brown fox jumped over the</span> lazy dog."
  - Window 4: "The quick brown <span class="burk">fox jumped over the lazy</span> dog."
  - Window 5: "The quick brown fox <span class="burk">jumped over the lazy dog</span>."

#### "One-hot" Encoding Word Vectors
- We need to be able to represent a **sequence of words** as a vector
- To do this, we need to assign each word an index from 0 to V
  - V is the size of the vocabulary aka # distinct words in the corpus
- A **word vector** is:
  - 1 for the index of that word
  - 0 for all other entries  
<img src='1-hot.png'/>

#### One-hot Encoding Context Windows
- We need vectors for context windows
- A sequence of words will have a vector that's just the concatenation of its word vectors
  - Thus, for window size $d$ the vector is of length $V \times d$
  - Only $d$ entries (one for each word) will be nonzero (1s)
  
<img src='catdog.png'/>

#### Skip-Grams
- Build a neural network with 1 hidden layer
- **Inputs**: 
  - The middle word of the context window (one-hot encoded)
  - Dimensionality: $V$
- **Outputs**: 
  - The other words of the context window (one-hot encoded)
  - Dimensionality: $V \times (d-1)$
- Turn the crank!  
<img src='skip_gram.png'/>

### Continuous Bag of Words (CBOW)
- Build a neural network with 1 hidden layer
- Just reverse of Skip-Grams!
- **Inputs**: 
  - The other words of the context window (one-hot encoded)
  - Dimensionality: $V \times (d-1)$
- **Outputs**: 
  - The middle word of the context window (one-hot encoded)
  - Dimensionality: $V$
- Turn the crank!  
<img src='cbow.png'/>

#### Dimensionality Reduction
- Number of nodes in hidden layer, $N$, is a parameter
- It is the (reduced) dimensionality of our resulting word vector space!
- Fit the neural net $\rightarrow$ find weights matrix $W$
  - In the new space, $x_N = W^Tx$
  - Checking dimensions:
    - $x$: $V \times 1$
    - $W^T$: $N \times V$
    - $x_N$: $N \times 1$

#### So What is Happening?
- We're learning the words likely to appear near each word
- This context information ultimately leads to vectors for related words falling near one another!

### Nice Properties of Word2Vec Vectors
- word2vec (somewhat magically!) captures nice geometric relations between words
<img src='vector_queen2.png' align='right'/>
- e.g.: Analogies
  - King is to Queen as man is to woman
  - The vector between King and Queen is the same as that between man and woman!
- Works for all sorts of things: capitals, cities, etc

### Word2Vec for ML
- Again we get **word vectors**!
- So we can use them for ML!

#### Using Existing Word Vectors
- word2vec takes **A LOT** of data to train it well
- What if you don't have **A LOT** of data?
  - Steal someone else's vectors!
  - Google has trained a [giant set](https://code.google.com/p/word2vec/)
  - So has [Stanford NLP](http://nlp.stanford.edu/projects/glove/) (slightly different, same idea)
  - Many others
- As long as the domain is similar, should be better than yours
  - Need words to have the same meaning as in your dataset

#### word2vec in `gensim`
- Very simple example for usage
- We'll train our own (don't!  steal vectors!)
- It needs a list of lists representing the sentences in a corpus
- Here's how that goes:

In [None]:
# Make sure gensim and Word2Vec are installed and functional
# Create some dummy data
sentences = documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# The type of input that Word2Vec is looking for.. 
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

print(texts)

In [None]:
# Train word2vec
w2v = models.Word2Vec(texts, size=100, window=5, min_count=1, workers=4,sg=1)
# Check out the resulting vector for "computer"
w2v['computer']

**Now on a slightly Larger Corpus**:  
- Using Project Gutenberg Corpus (Books) from NLTK:

In [None]:
# An Illustration.. 

import os

# Create an iterator Class that can be used as a gensim corpus (defines how to read in the text data)
class MySentences(object):
     def __init__(self, dirname):
        self.dirname = dirname
 
     def __iter__(self):
         for fname in os.listdir(self.dirname):
                for line in open(
                    os.path.join(self.dirname, fname), 
                    encoding='utf-8', errors='ignore'):
                    yield line.split()

# Instantiate the corpus from a text file of documents
# You'll need to change the path!
sentences = MySentences('/Users/paulburkard//nltk_data/corpora/gutenberg') # a memory-friendly iterator
# Create a Word2vec model
w2v = models.Word2Vec(sentences,min_count=3,workers=5)

#### Calculating Similarities
- Have word vectors for all terms!  
- Gensim provides a number of methods to then do comparisons in this vector space 
- Here are a few:

In [None]:
# Words close to woman and king but not man
w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

**Not so hot!**  
- Unsurprising: Need much more data
- A lesson in why you should steal vectors :)
- Let's try some pairwise comparisons:

In [None]:
# Similarity between woman and man
print(w2v.similarity('woman','man'))

# Similarity between bags of words
print(w2v.n_similarity(['woman', 'girl'], ['man', 'boy']))

# Finding words that don't match others in a bag
print(w2v.doesnt_match("breakfast man dinner lunch".split()))

#### Word2Vec for Text Clustering
- We won't do it here, but the procedure is exactly the same as LSI above:
  - Export the vectors back to `sklearn`
  - Try whatever clustering algorithm you like on the reduced vectors
- **OR** more likely you just take the vectors as given from Google or Stanford

#### Word2Vec for Text Classification
- We won't do it here, but the procedure is exactly the same as LSI above:
  - Export the vectors back to `sklearn`
  - Try whatever classifying algorithm you like on the reduced vectors
    - Almost certainly KNN with Cosine Similarity
- **OR** more likely you just take the vectors as given from Google or Stanford

## Applications of Word Vectors (VSMs) 
- With word vectors, we can do so many cool things:
  - ML algorithms
  - Machine Translation
  - Many of those things I mentioned on NLP day 1
  - Seed Deep Learning with them to do **even cooler stuff**
- Basically, we know the state of the world (the meaning of words)...
  - The possibilities are endless!