# Word2Vec Exercises

## Introduction

We will be using 2 well-known text datasets to explore the capabilities of Word2Vec:
- [Spam Classification Dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/): Collection of SMS text messages labeled as Spam/Not Spam (Ham)
- [20 Newsgroups Dataset](http://qwone.com/~jason/20Newsgroups/): Famous text classification dataset from user discussion forums with 20 classes, including...
  - Computer Graphics (comp.graphics)
  - Microsoft Windows (comp.os.ms-windows.misc)
  - IBM Hardware (comp.sys.ibm.pc.hardware)
  - Mac Hardware (comp.sys.mac.hardware)
  - Windows XP (comp.windows.x)
  - For Sale (misc.forsale)
  - Automobiles (rec.autos)
  - Motorcycles (rec.motorcycles)
  - Baseball (rec.sport.baseball)
  - Hockey (rec.sport.hockey)
  - General Politics (talk.politics.misc)
  - Politics - Gun Control (talk.politics.guns)
  - Middle East Politics (talk.politics.mideast)
  - Cryptography (sci.crypt)
  - Electronics (sci.electronics)
  - Medicine (sci.med)
  - Space (sci.space)
  - Religion (talk.religion.misc)
  - Atheism (alt.atheism)
  - Christianity (soc.religion.christian)
  
To perform our tasks, we will both derive our own **word vectors** from the data as well as borrow Google's massive set of word vectors trained on the web ([Google Vectors]()).

In [None]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd

# Necessary for adding accessory_functions module to path
import os, sys
lib_path = os.path.abspath(os.path.join('..', '..'))
sys.path.append(lib_path)
from accessory_functions import google_vec_file, nltk_path

# Python 2 compatibility
from __future__ import print_function

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

In [None]:
google_vec_file

## Question 1
**Loading Google Word2Vec Vectors**
* Load the Google vectors into an object `google_model` using `gensim` (This step will take awhile, as it has to load 3 million vectors into the appropriate Word2Vec format).
* Confirm that you have 3 million vectors of length 300.

In [None]:
# Load the Google vectors
google_model = # student section here

Google's model contains an extensive vocabulary.

In [None]:
type(google_model.vocab)

In [None]:
# Number of Vectors
len(google_model.vocab.keys())

In [None]:
# Size of the Vectors
google_model.vector_size

## Question 2
**Exploring Word2Vec Vectors**
* Print out a few word vectors from the Google set
* Print out the similarity between the following pairs (feel free to experiment with more if you like):
  * baseball, bat
  * baseball, ocean
  * bat, fly
* What sorts of patterns do you notice?  Where does it succeed?  Where does it fail?  How might one improve it?
* Print out the most similar words to the following words:
  * baseball
  * president
* Print out words similar to the positive words and dissimilar to the negative words for the following positive/negative groups:
* Print out the words that don't match the others in each of the following groups:

In [None]:
# Word Vectors
# student section here



In [None]:
# Pairwise Similarity
# student section here



* Word Sense Disambiguation

In [None]:
# Most similar words
# student section here



In [None]:
# Positive Negative Similar Words
# student section here



In [None]:
# Words that don't match
# student section here



## Question 3
**Document Vectors from Word Vectors**
* Compare the following sample documents to each other by cosine similarity
* Compare the same set of documents together by Word Mover's Distance  

***Note***: You will need to take care to first do the following:
  * Split the documents into lists of words.
  * Remove all words that aren't in the Google vector vocabulary (Word2Vec errors otherwise).

In [None]:
# Comparing via Cosine Similarity
# student section here



In [None]:
# Comparing via Word Mover's Distance
# student section here



## Vocabulary Features

Each word contains an array of 300 features.

In [None]:
len(google_model.word_vec('cat'))

In [None]:
google_model.word_vec('cat')[:20]

The cosine similarity between words can be computed and produces intuitive trends.

In [None]:
print(google_model.similarity('cat', 'cat'))
print(google_model.similarity('cat', 'dog'))
print(google_model.similarity('cat', 'car'))

In [None]:
print(google_model.similarity('car', 'truck'))
print(google_model.similarity('car', 'drive'))

Word2Vec captures some interesting similarities between words, such as the relationship between **man --> king** and **woman --> queen**.

In [None]:
google_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

In [None]:
google_model.wmdistance("Obama is the president of the United States".lower().split(), 
                        "Bush was the president of the United States".lower().split())

##### Question: Why does the previous command take so much longer than others?
Because it has to generate a new vector that is **woman** + **king** - **man** and compare that vector to all 3 million vectors, then sort to find the closest 3.  The 3 million are stored in such a way that they can be compared quickly, but any new vector is not.

It can also detect words that don't belong in a sequence.

In [None]:
google_model.doesnt_match("breakfast cereal dinner lunch".split())

## Question 4
**Training a Word2Vec Model**
* Read the spam dataset into a Dataframe.
* Preprocess the documents using our previous `accesory_functions`.
* Split each document into a list of words.
* Use the results to train your own word2vec model with `gensim`.

##### Question: What does "training a word2vec model" mean?
Answer: generating good word vectors, or "word embeddings" for all unique words in the text dataset.  "Good" means similar words have their vectors cluster close together.

In [None]:
import pandas as pd

spam_data = pd.read_csv('../data/spam.csv', sep='\t', 
                        header=None, names=["label", "text"])                         

spam_data.head()

Now let's train the Word2Vec model.  `gensim` requires the documents to be represented as a ***list of sentences*** to train Word2Vec.  Here we'll do this by calling `split()` on our document text to turn each document into a list of words.

In [None]:
from gensim.models import Word2Vec
# Generate sentences for training word2vec
sentences = spam_data.text.str.split()
# Train a Word2Vec model
# student section here




Check out some conceptual comparisons with our word2vec model.

In [None]:
spam_model.most_similar('call')

Not bad, but if you investigate deeper you realize the results are so-so at best:

In [None]:
spam_model.most_similar('love')

In [None]:
spam_model.most_similar('start')

Word2Vec requires **A LOT** of data to get a really good set of vectors.  Thankfully, as you saw above, Google (and others) have done this work for you, and you can usually just load in their vectors for your tasks.

## Word2Vec for Machine Learning

As you know, once we have vectors for examples, we can perform Machine Learning (both supervised and unsupervised).

#### From Word Vectors to Document Vectors
Consider the case of document classification.  From Word2Vec we have vectors for words, but our examples to classify are documents.  

How do we get vectors for whole documents?

The most common answer is to take an average of all the word vectors in a document.  Let's try that with our spam data.

In [None]:
# Function to take a document as a list of words and return the document vector
def get_doc_vec(words, model):
    good_words = []
    for word in words:
        # Words not in the original model will fair
        try:
            if model[word] is not None:
                good_words.append(word)
        except:
            continue
    # If no words are in the original model
    if len(good_words) == 0:
        return None
    # Return the mean of the vectors for all the good words
    return model[good_words].mean(axis=0)

Use our function to generate the document vectors for our spam data.

In [None]:
# Make a copy of the data to not disturb the original
spam_data1 = spam_data.copy()
spam_vecs = spam_data1.text.str.split().map(lambda x: get_doc_vec(x, spam_model))
spam_vecs

Some of the documents have no good words in them.  Let's drop them from our dataset, but before we do add them back into the original DataFrame:

In [None]:
# Add to dataframe
spam_data1['vecs'] = spam_vecs
# Drop the bad docs
spam_data1 = spam_data1.dropna()
spam_data1.shape

Now let's just convert the format that we have into a final DataFrame with 100 features and 1 label for use in our document classification task:

In [None]:
# Create a Numpy array of the document vectors
spam_np_vecs = np.zeros((len(spam_data1), 100))
for i, vec in enumerate(spam_data1.vecs):
    spam_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
spam_w2v_data = pd.concat([spam_data1.reset_index().label, pd.DataFrame(spam_np_vecs)], axis=1)

Now we've arrived at a familiar point.  We have features and 1 label and we can use them to perform text classification.  As you already know how to do this, it is left for the exercises.

# Word2Vec Exercises
In these exercises, we'll finish experimenting with the spam data by using your word2vec vectors for text classification.  You will classify spam/ham using **both** your vectors and the pretrained Google vectors.  Then we'll move on to a richer dataset for you to perform the entire pipeline for Text Classification with Word2Vec.

## Spam Classification with Word2Vec

### Question 1
Use the spam word2vec dataframe from above to train and evaluate any classification algorithm for spam/ham.  **Hint**: Try a K-Nearest Neighbors Classifier.

Use the Google vectors from the beginning of this notebook to generate a new spam word2vec dataframe for text classification.

Build yet another spam/ham classification model using the word2vec vectors from Google.

Compare the performance of the 2 classifiers.

#### Trained Word2Vec Spam Classifier

In [None]:
## Training a Classifier with our own trained vectors
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# student section here
# Split the data


# Train a KNN or Logistic Regression classifier




#### Document Vectors from Google Vectors

In [None]:
# Make a copy of the spam dataframe for the Google work
spam_data2 = spam_data.copy()

# Retrieve the document vectors based on google word vectors
spam_google_vecs = spam_data2.text.str.split().map(lambda x: get_doc_vec(x, google_model))

# Add to dataframe
spam_data2['vecs'] = spam_google_vecs

# Drop the bad docs
spam_data2 = spam_data2.dropna()

# Create a Numpy array of the document vectors
spam_np_vecs = np.zeros((len(spam_data2), 300))
for i, vec in enumerate(spam_data2.vecs):
    spam_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
spam_google_data = pd.concat([spam_data2.reset_index().label, pd.DataFrame(spam_np_vecs)], axis=1)

#### Google Word2Vec Spam Classifier

In [None]:
## Training a Classifier with Google's vectors
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# student section here
# Split the data


# Train a KNN or Logistic Regression classifier




You should see that there is not much difference in performance between your vectors and Google's for this task (though Google's is better).  **However**, this is a fairly trivial problem/dataset, and we were already performing quite well on it.  So let's try something harder!

## 20 Newsgroups Classification with Word2Vec

### The Data

We will be using a portion of a data set containing approximately 20,000 posts partitioned evenly across 20 different newsgroups. This data set is quite famous. We will be using a sample of this data set, containing 5 topics and about 3,000 posts.

We will begin by loading the data.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from accessory_functions import preprocess_series_text, nltk_path

topic_list = ['sci.space', 'comp.sys.mac.hardware', 'rec.autos',
              'rec.sport.baseball', 'sci.med']

# Retrieve the data into a DataFrame
dataset = fetch_20newsgroups(shuffle=True, random_state=1, data_home='../Data',
                             categories=topic_list,
                             remove=('headers', 'footers', 'quotes'))
ng_data = pd.DataFrame(dataset['data'], columns=['text'])
ng_data['label'] = dataset['target']

# Preprocess the text
ng_data['text'] = preprocess_series_text(ng_data.text, nltk_path=nltk_path)


print(len(ng_data))
ng_data.head()

### Question 2
Train a Word2Vec model to generate word vectors from the 20 Newsgroups data.

Use your Word2Vec model to generate document vectors from these word vectors.

Combine these vectors with the 20 Newsgroups class labels to create a DataFrame for classification.

Train a classification model for these five 20 Newsgroups classes and evaluate its performance.  **Hint**: Try a K-Nearest Neighbors Model.

#### Training Word2Vec on 20 Newsgroups

In [None]:
from gensim.models import Word2Vec

# student section here

# Generate sentences for training word2vec

# Train a Word2Vec model


# Did it work?
ng_model.most_similar('baseball')

#### Document Vectors from Word Vectors

In [None]:
# Make a copy of the spam dataframe for the Google work
ng_data1 = ng_data.copy()

# Retrieve the document vectors based on google word vectors
ng_vecs = ng_data1.text.str.split().map(lambda x: get_doc_vec(x, ng_model))

# Add to dataframe
ng_data1['vecs'] = ng_vecs

# Drop the bad docs
ng_data1 = ng_data1.dropna()

# Create a Numpy array of the document vectors
ng_np_vecs = np.zeros((len(ng_data1), 100))
for i, vec in enumerate(ng_data1.vecs):
    ng_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
ng_w2v_data = pd.concat([ng_data1.reset_index().label, pd.DataFrame(ng_np_vecs)], axis=1)

#### Trained Word2Vec 20 Newsgroups Classifier

In [None]:
## Training a Classifier with our own trained vectors
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# student section here

# Split the data



# Train a KNN or Logistic Regression classifier



### Question 3
Use the Google vectors to generate document vectors for the 20 Newsgroups data.

Combine these vectors with the 20 Newsgroups class labels to create a DataFrame for classification.

Train a classification model for these five 20 Newsgroups classes and evaluate its performance.  Train a classification model for these five 20 Newsgroups classes and evaluate its performance.  **Hint**: Try a K-Nearest Neighbors Model.

Note the performance of the Google vectors vs your own Word2Vec training.

#### 20 Newsgroups Document Vectors from Google Word Vectors

In [None]:
# Make a copy of the spam dataframe for the Google work
ng_data2 = ng_data.copy()

# Retrieve the document vectors based on google word vectors
ng_google_vecs = ng_data2.text.str.split().map(lambda x: get_doc_vec(x, google_model))

# Add to dataframe
ng_data2['vecs'] = ng_google_vecs

# Drop the bad docs
ng_data2 = ng_data2.dropna()

# Create a Numpy array of the document vectors
ng_np_vecs = np.zeros((len(ng_data2), 300))
for i, vec in enumerate(ng_data2.vecs):
    ng_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
ng_google_data = pd.concat([ng_data2.reset_index().label, pd.DataFrame(ng_np_vecs)], axis=1)

#### Google Word2Vec Newsgroups Classifier

In [None]:
## Training a Classifier with Google's vectors
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# student section here

# Split the data



# Train a KNN or Logistic Regression classifier



**WOW!**  Look how much better the Google vectors did.  This should demonstrate how valuable a good set of word vectors can be.

### Question 4
Recall TFIDF Vectorizer from the previous exercises.  Generate a DataFrame for classification using `TfidfVectorizer` and the 20 Newsgroups subset.

Train a classification model on the TFIDF results using either Logistic Regression or Naive Bayes.

Compare the results between your TFIDF model, trained word2vec, and Google word2vec.  Which does best?

#### 20 Newsgroups Classification with TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


# students section here

# Generate TFIDF Vectors


# Split the data


# Train a Logistic Regression


# Evaluate


Well, in this case the TFIDF did great, on par or better than word2vec.  However, TFIDF will often do very well in cases where certain trigger words are highly useful in distinguishing between classes.  If you look through the different document categories that we chose, they are ***highly*** different.  In cases where the differences between classes are more subtle, you should expect a Word2Vec model to strongly outperform TFIDF.  With more time, you can try downloading all 20 categories from 20 Newsgroups and seeing how the various models perform.  The full set has classes that are much tougher to tease apart than the relatively disjoint subset that we used.