# Text mining intro

---

You are currently looking at **version 1.0** of this notebook.

---

## 3. Topic modeling
The goal of topic modeling is to identify the major concepts underlying a piece of text.  
Topic modeling uses "Unsupervised Learning". No apriori knowledge is necessary.  
Though it is helpful in cleaning up results!

---
## Setup notebook
---

### Import the generic libraries used in this notebook

In [None]:
%matplotlib inline

import string
import numpy as np
import pandas as pd
import requests
import json
import re
from collections import OrderedDict, Counter
import pprint

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

### Manage warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

### Set defaults and constants

In [None]:
# Set pandas defaults
pd.set_option('max_rows', 10)                                # Show max 10 rows: head(5) ... tail(5)
pd.set_option('display.float_format', lambda x: '%.3f' % x)  # Set precision of DataFrames/Series

### Check current working directory and file structure

In [None]:
!pwd
# !ls

In [None]:
import nltk
# nltk.download() # download datasets = corpi
from nltk import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords, inaugural, PlaintextCorpusReader
from nltk.probability import FreqDist
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

### LDA: Latent Dirichlet Allocation Model
 - identify potential topics using pruning techniques like 'upward closure'
 - compute conditional probabilities for topic word-sets
 - identify the most likely topics, over multiple passes probabilistically picking topics in each pass
 - intuitive explanation: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

In [None]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS

#### Prepare the text

In [None]:
striptext = strip_text(PlaintextCorpusReader("data/", "Nikon_coolpix_4300.txt").raw())
sentences = sent_tokenize(striptext)

#words = word_tokenize(striptext)
#tokenize each sentence into word tokens
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for sentence in sentences]
len(texts)

<h4>Create a (word, frequency) dictionary for each word in the text</h4>

In [None]:
print(text)

In [None]:
text

In [None]:
dictionary = corpora.Dictionary(texts)                # (word_id, frequency) pairs
corpus = [dictionary.doc2bow(text) for text in texts] # (word_id, frequency) pairs by sentence

In [None]:
dictionary.token2id.items();

In [None]:
dictionary.keys();

In [None]:
corpus[5]

In [None]:
texts[5]

In [None]:
dictionary[73], dictionary[4]

### LDA analysis
parameters:  
 - Number of topics: The number of topics you want generated. The larger the document, the more the desirable topics
 - Passes: The LDA model makes through the document. More passes, slower analysis

In [None]:
# Set parameters
num_topics = 5
passes = 10 

In [None]:
lda = LdaModel(corpus, id2word=dictionary, num_topics=num_topics, passes=passes)

<h4>See results</h4>

In [None]:
pp = pprint.PrettyPrinter(indent=2)
pp.pprint(lda.print_topics(num_words=3))

### Matching topics to documents
- sort topics by probability
- using sentences as documents here, so this is less than ideal

In [None]:
from operator import itemgetter
topics = lda.get_document_topics(corpus[0], minimum_probability=0.05, per_word_topics=False)
sorted(topics, key=itemgetter(1), reverse=True)

### Making sense of the topics
 - draw wordclouds

In [None]:
def draw_wordcloud(lda, n_topics, min_size=0, STOPWORDS=[]):
    topics = lda.show_topic(n_topics, topn=50)
    
    df_ = pd.DataFrame([(word, prob) for word, prob in topics 
                        if len(word) >= min_size if word not in STOPWORDS], 
                       columns=['word', 'prob'])
    
    multip = 100 * df_['prob'] / df_['prob'].sum()
    df_['multip'] =  multip.astype('int32')
    word_list = (df_['word'] + ' ') * df_['multip']
    text = ''.join(word_list)
    wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=3000, height=3000).generate(text)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show();

In [None]:
draw_wordcloud(lda, 2)

### Let's look at Presidential addresses to see what sorts of topics emerge from there
 - Each document will be analyzed for topic</li>
 - The corpus will consist of 58 documents, one per presidential address

In [None]:
REMOVE_WORDS = {'shall','generally','spirit','country','people','nation','nations','great','better'}

# Create a word dictionary (id, word)
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word not in REMOVE_WORDS and word.isalnum()]
        for sentence in sentences]
dictionary = corpora.Dictionary(texts)

# Create a corpus of documents
text_list = []
for fileid in inaugural.fileids():
    text = inaugural.words(fileid)
    doc = []
    for word in text:
        if word in STOPWORDS or word in REMOVE_WORDS or not word.isalpha() or len(word) < 5:
            continue
        doc.append(word)
    text_list.append(doc)
    
by_address_corpus = [dictionary.doc2bow(text) for text in text_list]

<h2>Create the model</h2>

In [None]:
lda = LdaModel(by_address_corpus, id2word=dictionary, num_topics=20, passes=10)

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=10))

<h2>We can now compare presidential addresses by topic</h2>

In [None]:
len(by_address_corpus)

In [None]:
topics = lda.get_document_topics(by_address_corpus[0], minimum_probability=0, per_word_topics=False)
sorted(topics, key=itemgetter(1), reverse=True)

In [None]:
draw_wordcloud(lda, 18)

In [None]:
print(lda.show_topic(12, topn=5))
print(lda.show_topic(18, topn=5))

## Similarity
 - Given a corpus of documents, when a new document arrives, find the document that is the most similar

In [None]:
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities

In [None]:
doc1 = """
Many, many years ago, I used to frequent this place for their amazing french toast. 
It's been a while since then and I've been hesitant to review a place I haven't been to in 7-8 years... 
but I passed by French Roast and, feeling nostalgic, decided to go back.

It was a great decision.

Their Bloody Mary is fantastic and includes bacon (which was perfectly cooked!!), olives, 
cucumber, and celery. The Irish coffee is also excellent, even without the cream which is what I ordered.

Great food, great drinks, a great ambiance that is casual yet familiar like a tiny little French cafe. 
I highly recommend coming here, and will be back whenever I'm in the area next.

Juan, the bartender, is great!! One of the best in any brunch spot in the city, by far.
"""

In [None]:
doc2 = """
I went to Mexican Festival Restaurant for Cinco De Mayo because I had been there years 
prior and had such a good experience. This time wasn't so good. The food was just 
mediocre and it wasn't hot when it was brought to our table. They brought my friends food out 
10 minutes before everyone else and it took forever to get drinks. We let it slide because the place was 
packed with people and it was Cinco De Mayo. Also, the margaritas we had were slamming! Pure tequila. 

But then things took a turn for the worst. As I went to get something out of my purse which was on 
the back of my chair, I looked down and saw a huge water bug. I had to warn the lady next to me because 
it was so close to her chair. We called the waitress over and someone came with a broom and a dustpan and 
swept it away like it was an everyday experience. No one seemed phased.

Even though our waitress was very nice, I do not think we will be returning to Mexican Festival again. 
It seems the restaurant is a shadow of its former self.
"""

In [None]:
all_text = [community_data.raw()] + [le_monde_data.raw()] + [amigos_data.raw()] + [heights_data.raw()]
doc_list = [community_data, le_monde_data, amigos_data, heights_data]
documents = [doc.raw() for doc in doc_list]
assert all_text == documents

texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]

In [None]:
def get_lsi(texts):
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    return dictionary, corpus, models.LsiModel(corpus, id2word=dictionary, num_topics=2)

In [None]:
def get_doc_similarity(doc, dictionary, corpus, lsi):
    '''Match new doc against known docs to ge similarity'''
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[vec_lsi]
    return sorted(enumerate(sims), key=lambda x: -x[1])

In [None]:
dictionary, corpus, lsi = get_lsi(texts)
get_doc_similarity(doc1, dictionary, corpus, lsi)

In [None]:
get_doc_similarity(doc2, dictionary, corpus, lsi)