##### Vector Space Models for Text - LSI and Word2vec

Yesterday, we used [**Latent Dirichlet Allocation (LDA)**](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in [gensim](http://radimrehurek.com/gensim/index.html) to map text documents ([20 Newsgroups dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)) from a **word space** to a **topic space** that could give us the **topic distribution** of different documents in our corpus as well as allow us to make **conceptual comparisons** of documents in the reduced topic space.

Today, we'll continue mapping text documents from a highly dimensional word (or token) space into a much reduced **semantic space** which allows us to make valuable **conceptual comparisons** between arbitrary blocks of text in this new vector space.

Thus the **input is a large corpus of text documents** and the **output is a reduced semantic space for those input documents and words**.  These starting/ending points are constant, but we'll take 2 different approaches for the process in between:
1.  [**Latent Semantic Indexing (LSI)**](https://en.wikipedia.org/wiki/Latent_semantic_analysis) - performs a [**Singular Value Decomposition (SVD)**](https://en.wikipedia.org/wiki/Singular_value_decomposition) on a [**document-term matrix**](https://en.wikipedia.org/wiki/Document-term_matrix) with [**TFIDF Weightings**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to map all the terms in the corpus into a reduced **term space** and all the documents into a reduced **document space**.  
    - The 2 spaces are related by a simple transformation, so we can perform arbitrary **term-term**, **doc-doc**, and **doc-term comparisons** via [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity).
    - These 2 spaces make up the **"dual space"**
        - Every document is the weighted sum of all of its terms
        - Every term is the weighted sum of all the documents it occurs in (very useful!)
2.  [**Word2Vec**](https://en.wikipedia.org/wiki/Word2vec) - uses a neural network to yield **term space**
    - Has additional nice properties of term vectors, such as conceptual additivity (see below)

## Goals
- Continue to use gensim to implement text modeling
- Build an LSI vector space from a training set
- Use the LSI space to compare terms and documents to one another conceptually
- Use the LSI space to perform document clustering and classification
- Use Word2vec to create a vector space for words in a training set
- Use the Word2vec space to do simple comparisons between different combinations of words
- Discuss various other considerations, tasks, and extensions for VSMs like LSI and Word2vec

#### Install gensim

In [1]:
## pip install --upgrade gensim

##### imports

In [2]:
# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## LSI
For us, LSI will consist of the following steps:
1. Getting the data
2. Text Preprocessing
3. Create Document-Term matrix
4. Apply TFIDF weights to document-term matrix
5. Perform SVD on TFIDF matrix
6. Use resulting term and document space
    - Term-term comparisons
    - Term-document comparisons
    - Document-document comparisons
    - Document clustering
    - Document classification

### Getting the Data
Let's retain only a subset of the 20 categories in the original 20 Newsgroups Dataset.

In [3]:
import pickle
with open("aaai_topics.pkl", 'r') as datafile:
    categories = pickle.load(datafile)

with open("aaai_abstracts.pkl", 'r') as datafile:
    abstracts = pickle.load(datafile)

for i, j in enumerate(abstracts):
    abstracts[i]=str(j)


### Preprocessing
We'll need to generate a term-document matrix of word (token) counts for use in LSI.  LSI requires that we go a step further in our processing by adding TFIDF weightings to our counts matrix.  We could just use the `TfidfVectorizer` in `sklearn`, but `gensim` has its own `TfidfModel`, so we'll save that for later.

We'll use `sklearn`'s `CountVectorizer` to generate our term-document matrix of counts. We'll make use of a few parameters to accomplish the following preprocessing of the text documents all within the `CountVectorizer`:
* `analyzer=word`: Tokenize by word
* `ngram_range=(1,2)`: Keep all 1 and 2-word grams
* `stop_words=english`: Remove all English stop words
* `token_pattern=\\b[a-z][a-z]+\\b`

In [4]:
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(analyzer='word',
                                  ngram_range=(1, 2), stop_words='english',
                                  token_pattern='\\b[a-z][a-z]+\\b')
count_vectorizer.fit(abstracts)

CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='\\b[a-z][a-z]+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
# Create the term-document matrix
# Transpose it so the terms are the rows
ng_vecs = count_vectorizer.transform(abstracts).transpose()
ng_vecs.shape

(34527, 398)

##### Convert to gensim
We need to convert our sparse `scipy` matrix to a `gensim`-friendly object called a Corpus:

In [6]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(ng_vecs)

##### Map matrix rows to words (tokens)
We need to save a mapping (dict) of row id to word (token) for later use by gensim:

In [7]:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.iteritems())

### TFIDF
LSI requires us to go one step further than LDA in preprocessing, we need to calculate [TFIDF weights](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) from our word counts term-document matrix.  Here's how we do it in gensim:

In [8]:
# Create a TFIDF transformer from our word counts (equivalent to "fit" in sklearn)
tfidf = models.TfidfModel(corpus)

In order to give each document vectors in the "TFIDF space" we need to actually do the transform step with our TfidfModel like so:

In [9]:
# Create a TFIDF vector for all documents from the original corpus ("transform" in sklearn)
tfidf_corpus = tfidf[corpus]

#### Using TFIDF
At this point, we already have mapped our original inputs (text documents) into a vector space (TFIDF space) with a dimensionality equal to the total number of unique terms in our corpus.  That means that, in theory, we could go ahead and see if this vector space can tell us anything interesting.  We could try **comparing documents** by something like [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).  We could try machine learning methods like document **clustering** on the vectors, or **classification/regression** if we have labeled documents.

A common approach for document classification is to now try [Naive Bayes Classification](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification) using our TFIDF space.  This can work quite well if documents of a given class have very class-specific words such that only documents of a given class use them frequently.  For instance, this often quite well in detecting authors by word choice.

##### Curse of Dimensionality
Although TFIDF (and even counts) gives us vectors, as you might guess we're likely to run into the curse of dimensionality.  In a dataset of any reasonable size, there are likely to be tons of unique terms and thus really high-dimensional vectors.  Naive Bayes is highly resistant to the curse, thus explaining why it can work well for the right dataset.  

However, we should be heavily inclined now to try some dimensionality reduction to alleviate the curse...  
Enter: SVD

### SVD
For LSI, the SVD does our dimensionality reduction on the TFIDF space.  In effect, it reduces the space by coupling together terms of very similar meaning and thus alleviating redundant and collinear features (terms).

So now that we've taken care of the TFIDF bit, let's crank through the SVD and build the LSI space in gensim:

In [10]:
# Build an LSI space from the input TFIDF matrix, mapping of row id to word, and num_topics
# num_topics is the number of dimensions to reduce to after the SVD
# Analagous to "fit" in sklearn, it primes an LSI space
lsi = models.LsiModel(tfidf_corpus, id2word=id2word, num_topics=200)

Now that we have a trained LSI space, we want to do the transform step to figure out where all of the original documents lie in that num_topics=300 dimensional space:

In [11]:
# Retrieve vectors for the original tfidf corpus in the LSI space ("transform" in sklearn)
lsi_corpus = lsi[tfidf_corpus]

In [12]:
# Dump the resulting document vectors into a list so we can take a look
doc_vecs = [doc for doc in lsi_corpus]

### Conceptual Similarity Between Documents
Now that we have vectors in the LSI space, we can compare any indexed document to any other index document in the space via cosine similarity.  gensim allows us to do this like so:

In [13]:
# Create an index transformer that calculates similarity based on our space
index = similarities.MatrixSimilarity(doc_vecs)



In [22]:
# Return the sorted list of cosine similarities to the first document
recs = []
for i in range(398):
    sims = sorted(enumerate(index[doc_vecs[i]]), key=lambda item: -item[1])
    recs.append(sims[1:4])



In [23]:
with open("rec.pkl", 'w') as datafile:
    pickle.dump(recs, datafile)

How'd we do??  Let's check the most similar doc!

In [16]:
abstracts[0]

'Transfer learning considers related but distinct tasks defined on heterogenous domains and tries to transfer knowledge between these tasks to improve generalization performance. It is particularly useful when we do not have sufficient amount of labeled training data in some tasks, which may be very costly, laborious, or even infeasible to obtain. Instead, learning the tasks jointly enables us to effectively increase the amount of labeled training data. In this paper, we formulate a kernelized Bayesian transfer learning framework that is a principled combination of kernel-based dimensionality reduction models with task-specific projection matrices to find a shared subspace and a coupled classification model for all of the tasks in this subspace. Our two main contributions are: (i) two novel probabilistic models for binary and multiclass classification, and (ii) very efficient variational approximation procedures for these models. We illustrate the generalization performance of our algo

In [17]:
abstracts[1]

'Transfer learning uses relevant auxiliary data to help the learning task in a target domain where labeled data are usually insufficient to train an accurate model. Given appropriate auxiliary data, researchers have proposed many transfer learning models. How to find such auxiliary data, however, is of little research in the past. In this paper, we focus on this auxiliary data retrieval problem, and propose a transfer learning framework that effectively selects helpful auxiliary data from an open knowledge space (e.g. the World Wide Web). Because there is no need of manually selecting auxiliary data for different target domain tasks, we call our framework Source Free Transfer Learning (SFTL). For each target domain task, SFTL framework iteratively queries for the helpful auxiliary data based on the learned model and then updates the model using the retrieved auxiliary data. We highlight the automatic constructions of queries and the robustness of the SFTL framework. Our experiments on 

Well my word, that looks pretty darn similar!

### Machine Learning with LSI Vectors
We have (very good, 300-dimensional) vectors for our documents now!  So we can do any machine learning we want on our documents!

In [18]:
# Convert the gensim-style corpus vecs to a numpy array for sklearn manipulations
X = matutils.corpus2dense(lsi_corpus, num_terms=200).transpose()
X.shape

(398, 200)

#### Clustering LSI Vectors:
Let's try clustering our documents with `sklearn` to see if we can notice any obvious clusters:

In [19]:
from sklearn.cluster import AgglomerativeClustering
from collections import defaultdict

cluster_quantity = defaultdict(int)
agg_clust = AgglomerativeClustering(n_clusters = 22,linkage = 'complete',affinity='l1')
clusters = agg_clust.fit_predict(X)

for i in clusters:
    cluster_quantity[str(i)]+=1
s = 0
for k,v in cluster_quantity.iteritems():
    s+=v
for k,v in cluster_quantity.iteritems():
    print k,int(v/float(s)*100)

20 1
21 1
1 6
0 2
3 5
2 4
5 1
4 4
7 4
6 6
9 8
8 2
11 3
10 22
13 1
12 2
15 4
14 5
17 0
16 9
19 1
18 2


In [20]:
with open("clusters.pkl", 'w') as datafile:
    pickle.dump(clusters, datafile)

In [21]:
# Create our cluster predictions for each document
preds = kmeans.fit_predict(X)

NameError: name 'kmeans' is not defined

In [28]:
preds[0:20]

array([2, 0, 2, 2, 2, 4, 4, 4, 2, 3, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2], dtype=int32)

We could examine this further to see how it compares to original labels, but instead maybe we should just classify since we have a labelled dataset?

#### Classifying LSI Vectors:
Let's try some simple classification on the result LSI vectors for the 20 NG set and see how we do:

In [31]:
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics.pairwise as smp
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X, ng_train.target, test_size=0.3)

In [32]:
# Fit KNN classifier to training set with cosine distance
knn = KNeighborsClassifier(n_neighbors=3, metric=smp.cosine_distances)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30,
           metric=<function cosine_distances at 0x1077a8b18>,
           metric_params=None, n_neighbors=3, p=2, weights='uniform')

In [33]:
# Score against test set
knn.score(X_test, y_test)

0.79941291585127205

Other very cool methods!: 

https://radimrehurek.com/gensim/models/word2vec.html

### But if you really want to refine your model, you'll need more data:


https://code.google.com/p/word2vec/

Download:  'freebase-vectors-skipgram1000-en.bin.gz'

### Using Word2vec in Models
The output is the same type of thing that we got for LSI: a semantic space of terms.  Thus, we can do all of the same types of things that we did with those vectors (term-term, doc-doc, term-doc, ML algorithms, etc) with the added benefits of some of the geometric relationships between terms that word2vec yields.

####   Some things to keep in Mind when using Word2Vec:

1) Word2vec requires a lot of data to train.

As we've illustrated, you can download pretrained vectors. However, if you would need to train your own data 
you will need a lot of it!  (Think Hundreds of Millions of Words!) 

OTHER REFERENCES:

- https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
- http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/


## Other Considerations, Extensions, and Applications for LSI/Word2vec
### Entity Extraction
### Stopword Selection
### Punctuation
### Stemming
### Alternative Weighting Schemes
### Optimal Dimensionality Selection
### Full Feature Utilization
### Multilingual Corpora
### Machine Translation
### Language Identification
### Majority Folding
### Term Folding
### Term Folding + Document Folding
### Recommendation