# DS-SF-23 | Demo 14 | Latent Variable Modeling with `gensim`

`gensim` (http://radimrehurek.com/gensim) is a library of language processing tools focused on latent variable models of text.

In [3]:
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import feature_extraction
from gensim import matutils, models

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

The data is about sentiments on Amazon reviews.

In [4]:
reviews = []
sentiments = []

with open(os.path.join('..', 'datasets', 'amazon-reviews.txt')) as f:
    for line in f.readlines():
        line = line.strip('\n')
        review, sentiment = line.split('\t')
        sentiment = np.nan if sentiment == '' else int(sentiment)

        reviews.append(review.lower())
        sentiments.append(sentiment)

df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,i try not to adjust the volume setting to avoi...,
1,so there is no way for me to plug it in here i...,0.0
2,"good case, excellent value.",1.0
3,i thought motorola made reliable products!.,
4,battery for motorola razr.,


In [6]:
df.dropna(inplace = True) # Let's drop the NaN

In [7]:
df.head()

Unnamed: 0,review,sentiment
1,so there is no way for me to plug it in here i...,0.0
2,"good case, excellent value.",1.0
5,great for the jawbone.,1.0
10,tied to charger for conversations lasting more...,0.0
11,the mic is great.,1.0


## LDA with `gensim`

### Let's first translate a set of documents (articles) into a matrix representation with a row per document and a column per feature (word or n-gram)

In [8]:
vectorizer = feature_extraction.text.CountVectorizer(stop_words = 'english')

In [9]:
documents = vectorizer.fit_transform(df.review)

In [10]:
# Let's now build a mapping of numerical ID to word

id2word = dict(enumerate(vectorizer.get_feature_names()))

In [11]:
id2word

{0: u'10',
 1: u'100',
 2: u'11',
 3: u'12',
 4: u'13',
 5: u'15',
 6: u'15g',
 7: u'18',
 8: u'20',
 9: u'2000',
 10: u'2005',
 11: u'2160',
 12: u'24',
 13: u'2mp',
 14: u'325',
 15: u'350',
 16: u'375',
 17: u'3o',
 18: u'42',
 19: u'44',
 20: u'45',
 21: u'4s',
 22: u'50',
 23: u'5020',
 24: u'510',
 25: u'5320',
 26: u'680',
 27: u'700w',
 28: u'8125',
 29: u'8525',
 30: u'8530',
 31: u'abhor',
 32: u'ability',
 33: u'able',
 34: u'abound',
 35: u'absolutel',
 36: u'absolutely',
 37: u'ac',
 38: u'accept',
 39: u'acceptable',
 40: u'access',
 41: u'accessable',
 42: u'accessing',
 43: u'accessory',
 44: u'accessoryone',
 45: u'accidentally',
 46: u'accompanied',
 47: u'according',
 48: u'activate',
 49: u'activated',
 50: u'activesync',
 51: u'actually',
 52: u'ad',
 53: u'adapter',
 54: u'adapters',
 55: u'add',
 56: u'addition',
 57: u'additional',
 58: u'address',
 59: u'adhesive',
 60: u'adorable',
 61: u'advertised',
 62: u'advise',
 63: u'aggravating',
 64: u'ago',
 65: u'al

### We want to learn which columns are correlated (i.e., likely to come from the same topic).  This is the word distribution.  We can also determine what topics are in each document, the topic distribution.

In [None]:
# First we convert our word-matrix into gensim's format

corpus = matutils.Sparse2Corpus(documents, documents_columns = False)

(Check https://radimrehurek.com/gensim/matutils as needed)

In [None]:
corpus

(Check https://radimrehurek.com/gensim/models/ldamodel as needed)

In [None]:
# Then we fit an LDA model

model = models.ldamodel.LdaModel(corpus = corpus, num_topics = 25, id2word = id2word, passes = 10)

In this model, we need to explicitly specify the number of topic we want the model to uncover.  This is a critical parameter, but there isn't much guidance on how to choose it.  Try to use domain expertise where possible.

In [None]:
model

### Goodness of fit

Now we need to assess the goodness of fit for our model.  Like other unsupervised learning techniques, our validation techniques are mostly about interpretation.

Use the following questions to guide you:
- Did we learn reasonable topics?
- Do the words that make up a topic make sense?
- Is this topic helpful towards our goal?

In [None]:
model.print_topics()

Some topics will be clearer than others.  The following topics represent clear concepts:
- Cooking and Recipes: 0.009 \* cup + 0.009 \* recipe + 0.007 \* make + 0.007 \* food + 0.006 \* sugar
- Cooking and recipes: 0.013 \* butter + 0.010 \* baking + 0.010 \* dough + 0.009 \* cup + 0.009 \* sugar
- Fashion and Style: 0.013 \* fashion + 0.006 \* like + 0.006 \* dress + 0.005 \* style

## Word2Vec with `gensim`

In [None]:
# Setup the body text
sentences = df.review.map(lambda review: review.split())

In [None]:
sentences

In [None]:
model = models.Word2Vec(sentences, size = 100, window = 5, min_count = 5, workers = 4)

`Word2Vec` has many arguments:
- `size` represents how many concepts or topics we should use
- `window` represents how many words surrounding a sentence we should use as our original feature
- `min_count` is the number of times that context or word must appear
- `workers` is the number of CPU cores to use to speed up model training

(Check http://radimrehurek.com/gensim/models/word2vec as needed)

In [None]:
model

### Most similar words

The model has a `most_similar` function that helps find the words most similar to the one you queried.  This will return words that are most often used in the same context.

In [None]:
model.most_similar(positive = ['great'])

In [None]:
vectorizer.get_feature_names()

In [None]:
sentences

In [None]:
sentences = list(map(lambda sentence: list(filter(lambda word: word in vectorizer.get_feature_names(), sentence)), sentences))

In [None]:
sentences

In [None]:
model = models.Word2Vec(sentences, size = 100, window = 5, min_count = 5, workers = 4)

In [None]:
model.most_similar(positive = ['great'])