## Latent Dirichlet Allocation

+ Most commonly used in natural language processing
+ Sometimes as an end in and of itself
+ Sometimes as a variable reduction technique


### Simple Example of LDA in NLP

Stolen from: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

+ Authors: 
    + Olivier Grisel <olivier.grisel@ensta.org>
    + Lars Buitinck
    + Chyi-Kwei Yau <chyikwei.yau@gmail.com>
+ License: BSD 3 clause

In [1]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups


### This code defines a custom function that we'll use later

In [2]:
n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()



### This code loads the dataset

In [3]:

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))


Loading dataset...


No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"


done in 35.758s.


In [6]:
dataset['data'][:10]

[u"\n  I was wondering if anyone can shed any light on just how it is that these\nelectronic odometers remember the total elapsed mileage?  What kind of\nmemory is stable/reliable enough, non-volatile enough and independent enough\n(of outside battery power) to last say, 10 years or more, in the life of a\nvehicle?  I'm amazed that anything like this could be expected to work for\nthis length of time (especially in light of all the gizmos I work with that\nare doing good to work for 2 months without breaking down somehow).\n\nSide question:  how about the legal ramifications of selling a used car with\na replaced odometer that starts over at 0 miles, after say 100/200/300K\nactual miles.  Looks like fraud would be fairly easy - for the price of a\nnew odometer, you can say it has however many miles you want to tell the\nbuyer it has.\n\nThanks for any insight.\n"]

### Use tf (raw term count) features for LDA-- turns all the words into numbers

In [13]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))



Extracting tf features for LDA...
done in 0.648s.


In [14]:

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=10, max_iter=10,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)




Fitting LDA models with tf features, n_samples=2000 and n_features=1000...


In [15]:
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

done in 5.837s.


In [64]:
X_example = lda.transform(tf)

In [65]:
X_example[0]

array([ 0.55592586,  0.00344938,  0.00344864,  0.00344876,  0.0034486 ,
        0.18650337,  0.00344884,  0.00344828,  0.23342976,  0.00344851])

In [66]:
#print(lda.print_topics(num_topics=10, num_words=3))

In [19]:

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
edu com mail send graphics ftp pub available contact university list version ca faq cs program information machines type sun
Topic #1:
don like just know think ve way good right use going make people ll sure really point doesn got time
Topic #2:
think christian atheism book pittsburgh faith new just radio like lot read play subject games alt time believe game president
Topic #3:
drive windows disk thanks use card drives hard using software file scsi pc does help problem controller 16 new dos
Topic #4:
hiv health aids april disease care medical research information 1993 national light new test study service said led children 10
Topic #5:
god people does just jesus good say don law life israel way fact believe think know time bible make true
Topic #6:
55 10 11 game 15 18 team 12 19 period 20 23 13 25 play 17 22 flyers 16 24
Topic #7:
car year new cars bike just good engine like oil insurance better 000 tires speed price model high used driving
Topic #8:
pe

### In class assignment

+ load in the training set (done for you below)
+ re-run LDA and use topics as input for model
+ Predict categories using some multinomial classifier 

In [20]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="train")

data = dataset.data

y = dataset.target

print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 2.432s.


In [38]:
import numpy as np

In [39]:
np.unique(y)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [43]:
data[0]

u"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [29]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=1000,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data)
print("done in %0.3fs." % (time() - t0))


Extracting tf features for LDA...
done in 3.275s.


In [35]:
tf

<11314x1000 sparse matrix of type '<type 'numpy.int64'>'
	with 286231 stored elements in Compressed Sparse Row format>

In [30]:
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=10, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [40]:
X = lda.transform(tf)

In [42]:
X[0]

array([ 0.55592586,  0.00344938,  0.00344864,  0.00344876,  0.0034486 ,
        0.18650337,  0.00344884,  0.00344828,  0.23342976,  0.00344851])

In [44]:
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)


Topics in LDA model:
Topic #0:
people gun armenian armenians war
Topic #1:
government people law mr president
Topic #2:
space program output entry data
Topic #3:
key car chip used use
Topic #4:
edu file com available mail
Topic #5:
god people does jesus think
Topic #6:
windows drive thanks use does
Topic #7:
ax max b8f g9v a86
Topic #8:
just don like know think
Topic #9:
10 00 25 15 12



In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [57]:
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=5)

In [58]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=5, n_neighbors=5, p=2,
           weights='uniform')

In [59]:
knn.predict(X)

array([17,  0, 16, ...,  9,  4, 12])

In [60]:
X_test = tf_vectorizer.transform(dataset.data)
X_test = lda.transform(X_test)



In [61]:
pred = knn.predict(X_test)
metrics.f1_score(dataset.target, pred, average='macro')

0.45437747714203358

### In class assignment:

+ I'll divide you into 3 segments
+ Each segment generates 100 sentences on the *same topic*
+ Save as a JSON and send to me
+ We'll run them through LDA

In [None]:
CLASSLIST = []

In [78]:

carter_list = ["chocolate chip cookies and best fresh from the oven.", "pumpkin pie is a good dessert for the fall season", "vegtables are an important part of any diet", "fruit is a healthy way to suffice your sweet tooth", "eggs are a filling way eat breakfast", "soda is a necessary evil.", "philz coffee is a great way to start your morning", "after making a big dinner with several courses, at least there are leftovers.", "turnkey is a great type of meat", "hot sauce makes everything better.", "hot dogs and garlic fries are best when watching a giants baseball game.", "I like ketchup more than mustard", "I wish a had a few more cook books.", "The worst part of cooking is cleaning the pots and pans afterwards.", "I had cereal with a banana every morning before school as a kid.", "Avocado is my favorite type of vegtable.", "I try to avoid fast food restaurants as much as possible.", "shrimp scampi is one of my all time favorite dishes.", "cooking is something I hope to do more of later in life.", "salmon is a great type of food"]

len(carter_list)


20