<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## LDA Walkthrough and Exercises

## The Data

We will be using a portion data set containing approximately 20,000 posts partitioned evenly across 20 different newsgroups. This data set is quite famous. We will be using a sample of this data set, containing 5 topics and about 3,000 posts.

We will begin by loading the data.

In [1]:
import nltk
from accessory_functions import nltk_path

# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

topic_list = ['sci.space', 'comp.sys.mac.hardware', 'rec.autos',
              'rec.sport.baseball', 'sci.med']

dataset = fetch_20newsgroups(shuffle=True, random_state=1, data_home='../data',
                             categories=topic_list,
                             remove=('headers', 'footers', 'quotes'))

data = pd.DataFrame(dataset['data'], columns=['text'])
print(len(data))

2956


## Preprocess the Data

Next we will preprocess the data using the convenience method from `accessory_functions`.

In [3]:
from accessory_functions import preprocess_series_text

data['text'] = preprocess_series_text(data.text, 
                                      nltk_path=nltk_path)

In [4]:
data.head()

Unnamed: 0,text
0,otoh u get lucky unplugged replugged scsi adb ...
1,yes everyone else may wonder fred well would o...
2,umm perhaps could explain right talk
3,like alomar like differ opinion city likely po...
4,wow know uranus long way think far away


## Create Numerical Features

Use Count Vectorizer to create a document-term matrix.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

n_features = 1000
cv = CountVectorizer(max_df=0.95, min_df=2, 
                     max_features=n_features)
X = cv.fit_transform(data.text)

print(X.shape)

(2956, 1000)


In [6]:
pd.DataFrame(X.toarray(), columns=cv.get_feature_names()).head()

Unnamed: 0,ab,ability,able,ac,accept,access,accord,acid,act,activity,...,write,wrong,yankee,yeah,year,yeast,yes,yet,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Create an LDA Model

Use Scikit-learn's [`LatentDirichletAllocation`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to fit an LDA model.

In [7]:
LatentDirichletAllocation?

Object `LatentDirichletAllocation` not found.


In [8]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = len(topic_list)
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

## Print Top Words
Print the top words associated with each topic.

In [9]:
def print_top_words(model, feature_names, n_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % (topic_idx+1))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [10]:
n_top_words = 20

print("Topics in LDA model:")
cv_feature_names = cv.get_feature_names()
print_top_words(lda, cv_feature_names, n_top_words)

Topics in LDA model:
Topic #1:
one would like get people think know make thing cause say see use food many could well time take much
Topic #2:
year game get good go last think would well team run one hit player time first like win play say
Topic #3:
patient use health post edu medical information group number disease software mail report new pain keyboard cancer center program please
Topic #4:
car get use would drive one mac like problem know go work new apple time good thanks make think system
Topic #5:
space launch nasa edu satellite system orbit data com mission earth use program shuttle lunar year also moon rocket first



## Visualize the LDA Model

A visualization of the topic model can be easily created with `pyLDAvis`. 

In [11]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, X, cv)

## Question

* Fit an LDA model with a different number of topics and compare the top 20 words to those from the model above.
* Create a different document-term matrix by changing input parameters (max_features, etc.) or by switching to `TfidfVectorizer` and use this to fit another LDA model