## Latent Dirichlet Allocation

+ Most commonly used in natural language processing
+ Sometimes as an end in and of itself
+ Sometimes as a variable reduction technique


### Simple Example of LDA in NLP

Stolen from: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

+ Authors: 
    + Olivier Grisel <olivier.grisel@ensta.org>
    + Lars Buitinck
    + Chyi-Kwei Yau <chyikwei.yau@gmail.com>
+ License: BSD 3 clause

In [1]:
import os
import numpy as np
import pandas as pd
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
os.chdir("/Users/adeniyiharrison/Desktop/General Assembly/DS-SF-32/dataset")

### This code defines a custom function that we'll use later

In [38]:
n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()



### This code loads the dataset

In [39]:

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 2.504s.


In [8]:
dataset["data"][11]

"I have a Roberto Clemente 1969 Topps baseball card for sale, in near-mint\ncondition (really as close to mint condition as you can get).  It lists for\n$55 in my most recent baseball card pricelist for May.  I am offering it for\n$50 and I'll pay the certified postage to ship it to you."

In [40]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
# Max df and min df trottles words, say a super common word comes up too frequently then drop it, if it never comes up drop it
# If below 1 then its a percentage if its above 1 then its a specific count

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))



Extracting tf features for LDA...
done in 0.412s.


In [41]:
# n_samples = 2000
# n_features = 1000
# n_topics = 10
# n_top_words = 20

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics = n_topics, max_iter = 5,
                                learning_method='online',
                                learning_offset = 50.,
                                random_state=0)




Fitting LDA models with tf features, n_samples=2000 and n_features=1000...


In [42]:
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

done in 4.601s.


In [36]:
pd.DataFrame(tf.A, columns = [tf_vectorizer.get_feature_names()]).iloc[:,100:].head()

Unnamed: 0,applications,apply,appreciated,approach,appropriate,apr,april,archive,area,areas,...,worth,wouldn,write,written,wrong,xfree86,year,years,yes,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1:
don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2:
christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3:
drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4:
hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5:
god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6:
55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7:
car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8:
people said

### In class assignment

+ load in the training set (done for you below)
+ re-run LDA and use topics as input for model
+ Predict categories using some multinomial classifier 

In [44]:
print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'), 
                            subset="train")

data = dataset.data

y = dataset.target

print("done in %0.3fs." % (time() - t0))


Loading dataset...
done in 2.222s.


In [98]:
pd.DataFrame(y)[0].value_counts()

10    600
15    599
8     598
9     597
11    595
13    594
7     594
14    593
5     593
12    591
2     591
3     590
6     585
1     584
4     578
17    564
16    546
0     480
18    465
19    377
Name: 0, dtype: int64

In [46]:
data[2]

"Although I realize that principle is not one of your strongest\npoints, I would still like to know why do do not ask any question\nof this sort about the Arab countries.\n\n   If you want to continue this think tank charade of yours, your\nfixation on Israel must stop.  You might have to start asking the\nsame sort of questions of Arab countries as well.  You realize it\nwould not work, as the Arab countries' treatment of Jews over the\nlast several decades is so bad that your fixation on Israel would\nbegin to look like the biased attack that it is.\n\n   Everyone in this group recognizes that your stupid 'Center for\nPolicy Research' is nothing more than a fancy name for some bigot\nwho hates Israel."

In [68]:
# Use tf (raw term count) features for LDA.
countVect = CountVectorizer(stop_words = "english", 
                            max_df = .90, 
                            min_df = 10)

X = countVect.fit_transform(data)

In [69]:
np.shape(X.A)

(11314, 10441)

In [82]:
pd.DataFrame(
    np.sum(pd.DataFrame(X.A , columns = [countVect.get_feature_names()]).iloc[:,800:])
).sort_values(0, ascending = False).head(10)

Unnamed: 0,0
ax,62387
max,4585
people,4103
like,3964
don,3885
just,3752
know,3487
use,3179
think,3011
time,2968


In [83]:
lda = LatentDirichletAllocation(n_topics = n_topics, max_iter = 5,
                                learning_method='online',
                                learning_offset = 50.,
                                random_state=0)

lda.fit(X)


print("\nTopics in LDA model:")
tf_feature_names = countVect.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0:
year game don just think team good like time know ll games got going didn did season better play players
Topic #1:
car bike cars water engine air dod miles road oil vehicle ride radar auto gas speed ground high hot riding
Topic #2:
key government public president encryption use security chip keys clipper information technology law privacy national private administration new number data
Topic #3:
edu com mail ftp email send cs graphics file available list dos version thanks pub address files pc windows ca
Topic #4:
people said armenian israel armenians war turkish jews years killed children israeli russian government did time women turkey food went
Topic #5:
10 00 15 25 20 11 12 14 16 17 13 18 30 24 19 50 21 23 22 27
Topic #6:
people god don think just say does know believe like jesus time make way right did good question point things
Topic #7:
like use know just don does problem drive good time used work ve need new want thanks bit using make
Topic #8:
a

In [84]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [97]:
X = lda.transform(X)
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors = 10, weights = "distance")
print("Accuracy: ", cross_val_score(knn, X, y, cv = 10).mean())

Accuracy:  0.331789749371


In [105]:
X

array([[ 0.0018873 ,  0.0018869 ,  0.00188689, ...,  0.00188711,
         0.00188679,  0.00188689],
       [ 0.25293131,  0.02497489,  0.00212803, ...,  0.09812988,
         0.00212766,  0.00212806],
       [ 0.00217472,  0.00217439,  0.00217423, ...,  0.00217452,
         0.00217391,  0.00217424],
       ..., 
       [ 0.89999326,  0.01111195,  0.01111132, ...,  0.01111266,
         0.01111111,  0.01111134],
       [ 0.00555642,  0.00555589,  0.00555628, ...,  0.76227748,
         0.00555606,  0.00555625],
       [ 0.68316872,  0.00105296,  0.00105351, ...,  0.24421006,
         0.00105263,  0.00105312]])

In [100]:
from sklearn.model_selection import GridSearchCV, KFold


gs = GridSearchCV(estimator = RandomForestClassifier(),
            param_grid = {"n_estimators" : np.arange(10,21,1).tolist()},
            cv = KFold(n_splits = 5))

gs.fit(X,y)
algo = gs.best_estimator_

In [103]:
algo

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [101]:
algo.score(X,y)

0.96906487537564079

In [104]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y, algo.predict(X))

array([[466,   0,   0,   0,   0,   0,   0,  12,   0,   0,   0,   0,   0,
          0,   0,   1,   0,   0,   0,   1],
       [  0, 565,   1,   0,   0,   1,   0,  16,   0,   0,   0,   0,   0,
          1,   0,   0,   0,   0,   0,   0],
       [  0,   2, 561,   0,   0,   1,   0,  27,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   1, 572,   1,   0,   1,  13,   0,   0,   0,   0,   1,
          0,   0,   0,   0,   0,   1,   0],
       [  0,   0,   0,   1, 555,   0,   0,  22,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0, 590,   0,   3,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0],
       [  0,   1,   0,   0,   2,   0, 572,   8,   0,   0,   0,   0,   2,
          0,   0,   0,   0,   0,   0,   0],
       [  0,   1,   0,   0,   0,   0,   0, 592,   0,   0,   0,   0,   1,
          0,   0,   0,   0,   0,   0,   0],
       [  0,   1,   0,   0,   1,   0,   0,  15, 581,   0,   0,  

In [92]:
algo = lda.fit(X)
X_test = countVect.transform(data)
X_test = algo.transform(X_test)

pred = algo.predict(X_test)
metrics.f1_score(dataset.target, pred, average='macro')

AttributeError: 'LatentDirichletAllocation' object has no attribute 'predict'

### In class assignment:

+ I'll divide you into 3 segments
+ Each segment generates 100 sentences on the *same topic*
+ Save as a JSON and send to me
+ We'll run them through LDA

In [None]:
CLASSLIST = []

In [127]:
s1 = ["My favorite type of food is tacos, but it used to be fried chicken.", "My favorite type of taco is al pastor.", "My favorite mexican resturant is El Rancho.",
            "I also like all the resturants in my immediate neighborhood.", "Corn dogs are quite nice as well however my friends make fun of me",
            "Yo dog I love dog, but not like the food", "Sup hot stuff, you like hot food or cold food", "I raise chickens in my farm that I dont eat",
            "Sometimes I go fishing with my father", "Working at the food bank is very fulfilling to me", 
            "If I eat too much Im not going to feel like drinking", "The breweries in San Diego are plentiful", 
            "The food in San Deigo are not as good as the stuff in SF", "LA has really solid mexican food which I love",
            "Im pretty hungry right now, where should we grab lunch?", "Is dinner going to be taken care of at the Reynold's house?",
            "If I pay for breakfast will you cover lunch or dinner babe?"]

s2 = ["Fast Food is not good for health", "Indian food is spicy", "I like Thai food", "There are two new restaurants opened around the block", 
      "Can I get this sweet dish?", "McBurger has 3000 calories", "Nuts are good for health", "Vegetables are bad", 
      "Cheese cake is good", "He ate all the fried food", "In istanbul, a burger cost $30", "The new hotel chain offers free buffet for 2 days",
      "Can I get a diet coke?", "How to cook fiesta salsa?", "These fries are tasty but bad for health", 
      "This chinese restaurant serves the best soup", "Please order a pizza for me?", "Dinner is ready", 
      "Doing breakfast is good for health", "Please dont throw extra food, donate it to someone hungry"]

s3 = ["chocolate chip cookies and best fresh from the oven.", "pumpkin pie is a good dessert for the fall season", "vegtables are an important part of any diet", "fruit is a healthy way to suffice your sweet tooth", "eggs are a filling way eat breakfast", "soda is a necessary evil.", "philz coffee is a great way to start your morning", "after making a big dinner with several courses, at least there are leftovers.", "turnkey is a great type of meat", "hot sauce makes everything better.", "hot dogs and garlic fries are best when watching a giants baseball game.", "I like ketchup more than mustard", "I wish a had a few more cook books.", "The worst part of cooking is cleaning the pots and pans afterwards.", "I had cereal with a banana every morning before school as a kid.", "Avocado is my favorite type of vegtable.", "I try to avoid fast food restaurants as much as possible.", "shrimp scampi is one of my all time favorite dishes.", "cooking is something I hope to do more of later in life.", "salmon is a great type of food"]

s4 = ["You should eat well, but not like Charles Barkley well.",
"There are like 17 cooking shows. All of them seem to be related to Top Chef.",
"Guy Fearri is not a chef so much as the lead from Smashmouth pretending to be a chef.",
"Salt is not a food. But it goes well on food.",
"Vegetarians who still eat fish are not vegetarians. They are just against eating things that have eyes.",
"Vegans are basically food Taliban. Do not make me feel bad because I have good things in my life.",
"They say cows shitting causes global warming. That means we should eat less cows. Maybe more veal though. What is the shit to meat produced ratio where we can still enjoy meat, but not destroy the only planet we have.",
"My mother said pre-heat the oven. Instead I turned on the microwave.",
"Turkey is the worst of the bird dishes.",
"Dog is a food someplaces.",
"To make rice, you just get rice, and then add water.",
"Food Trucks are not made of food.",
"Instagram is mostly a forum for posting food photos. ALso for Smirnoff ICe ads.",
"Pasta is a delicacy.",
"I refused to believe that gushers are a food.",
"If you travel exclusively for local dishes, you have too much money.",
"Happiness: a good bank account, a good cook, and a good digestion.",
"Food Porn and Porn Food are not the same thing, and you should google only one.",
"France thinks it has the best food in Europe, but really Italy does. In Asia, Thailand is to France, as Vietnam is to Italy. I will not negotiate on this."]

s5 = ["My favorite food is a delicious cheeseburger.", "Common toppings on cheeseburgers include mayo, ketchup, pickles, grilled onion, lettuce, and aged cheddar.", "If I had to choose my favorite cuisine, it would be Italian.", "I love Philly Cheesesteak Sandwiches, gotta have those grilled onions, White American cheese, hot and sweet peppers on a hoagie.", "Despite all this, I also try to eat vegetarian a few days a week.", "Need a cheap week night meal?", "How about a baked potato with all the fixins'!", "We're talking about sour cream, green onion, salt, pepper, cheese, and bacon.", "Cooking can really help one save money, going out to eat gets expensive and adds up quickly over time.", "The easiest meals to cook for me are breakfast.", "Been a breakfast lover since day one, pancakes, bacon, sausage, eggs any way ya like em', the whole nine yards.", "Breakfast is not the most important meal of the day, it was a marketing scam from the early 20th century.", "Talk about a conspiracy, Kellog's is at the bottom of this one.", "Right before class I ate at the Halal Guys.", "They serve up delicious combo platter featuring gyro style beef, chicken, falafel, and a variety of sauces.", "Watch out for the hot sauce, my god!", "It is one of the hottest things I've ever eaten, and they put it in ketchup bottles.", "Many a drunken fool has accidentally lit his mouth on fire with that sauce.", "Another great breakfast...a whiskey ginger.", "Only drink those on the weekends though.", "The people who cook garlic fries are doing God's work.",]

In [129]:
sentences

['My favorite type of food is tacos, but it used to be fried chicken.',
 'My favorite type of taco is al pastor.',
 'My favorite mexican resturant is El Rancho.',
 'I also like all the resturants in my immediate neighborhood.',
 'Corn dogs are quite nice as well however my friends make fun of me',
 'Yo dog I love dog, but not like the food',
 'Sup hot stuff, you like hot food or cold food',
 'I raise chickens in my farm that I dont eat',
 'Sometimes I go fishing with my father',
 'Working at the food bank is very fulfilling to me',
 'If I eat too much Im not going to feel like drinking',
 'The breweries in San Diego are plentiful',
 'The food in San Deigo are not as good as the stuff in SF',
 'LA has really solid mexican food which I love',
 'Im pretty hungry right now, where should we grab lunch?',
 "Is dinner going to be taken care of at the Reynold's house?",
 'If I pay for breakfast will you cover lunch or dinner babe?',
 'Fast Food is not good for health',
 'Indian food is spicy',

In [132]:
sentences = s1+s2+s3+s4+s5

sentences = {"Food": sentences}

In [133]:
os.chdir("/Users/adeniyiharrison/Desktop")
import json
with open("Food Sentences.json", "w") as x:
    json.dump(sentences, x)

In [134]:
sentences

{'Food': ['My favorite type of food is tacos, but it used to be fried chicken.',
  'My favorite type of taco is al pastor.',
  'My favorite mexican resturant is El Rancho.',
  'I also like all the resturants in my immediate neighborhood.',
  'Corn dogs are quite nice as well however my friends make fun of me',
  'Yo dog I love dog, but not like the food',
  'Sup hot stuff, you like hot food or cold food',
  'I raise chickens in my farm that I dont eat',
  'Sometimes I go fishing with my father',
  'Working at the food bank is very fulfilling to me',
  'If I eat too much Im not going to feel like drinking',
  'The breweries in San Diego are plentiful',
  'The food in San Deigo are not as good as the stuff in SF',
  'LA has really solid mexican food which I love',
  'Im pretty hungry right now, where should we grab lunch?',
  "Is dinner going to be taken care of at the Reynold's house?",
  'If I pay for breakfast will you cover lunch or dinner babe?',
  'Fast Food is not good for health'