# Introduction to text analysis VIII #

## topic detection ##

Topic detection is a way to extract relevant information from texts. The topic is a set of words (from the text) having particular relevance in terms of probability. They apper to be words that characterize the topics (one or more) discussed in the documents.

**definitions:**

* Document: A single text, paragraph or even tweet to be classified
* Word/Term: a single component of a document
* Topic: a set of words describing a group (cluster) of documents

** each document usually is as a mixture of several topics **

### mixture of topics ###

A model such as LDA will produce an classification such as the following:

* Sentences 1 and 2: 100% Topic A
* Sentences 3 and 4: 100% Topic B
* Sentence 5: 60% Topic A, 40% Topic B

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)

Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

### methodologies ###

* latent dirichlet allocation (lda) 
* Non negative matrix factorization
* Clustering 

**latent dirichlet allocation (lda)** 

It's a complex mathematical model (based on Bayesian statistics and Dirichlet and Multinomial distributions) to establish the words in a set of documents that are the most representative. The starting point is definining a 

* fixed number of topics K
* to each topic k we associate a probability p = p(k|w) i.e. the probability of seeing the topic k given the set of words w in the document d
* to each topic k we associate a probability s = s(k,d) i.e. the probability of a k topic belonging to the document d. The distribution s represents the mixture of topics related to d
* A word in the document is picked by randomly extracting from a topic and from a document according to s and p distributions
* An optimization is performed fitting the s,p distributions to the actual distribution of words in the documents.

**Non negative matrix factorization **

![title](NMF.png)

* **V**  is the matrix representing all documents
* **H** is the matrix representing documents given the topics
* **W** is the matrix representing the topics


the factorization is made using objective functions such as *Frobenius Norm *

### Main features ###

** LDA **

* Slow method
* Quite accurate for large corpora where each document is a mixture of topics
* Most adopted 

** NMF **

* Fast method
* Accurate with small corpora (i.e. tweets) or tweets with no mixture of topics
* not commonly adopted


## hands on ##

In [1]:
import string
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import re

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [3]:
from sklearn.datasets import fetch_20newsgroups

** get the corpus **

20 newsgroup

In [4]:
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print(len(docs_raw))

11314


In [5]:
stops_it = stopwords.words('italian')
stops_en = stopwords.words('english')

translator = str.maketrans(' ', ' ', string.punctuation) ## remove the punctuation

In [6]:
def minimumSize(tokens,llen = 2):
    ## remove words smaller than llen chars
    tks = []
    for t in tokens:
        if(len(t) > llen):
            tks.append(t)
    return tks

def removeStops(tokens,stops = stops_it):
    # remove stop words
    remains = []
    for t in tokens:
        if(t not in stops):
            remains.append(t)
    return remains

def processText(text):
    ## tokenizer with stop words removal and minimum size 
    tks = word_tokenize(text)
    tks = [t.translate(translator) for t in tks] ## remove the punctuation
    tks = minimumSize(tks)
    tks = removeStops(tks,stops_en)
    return tks

### TFIDF vectorizer ###

It transforms each word in the D documents in a sparse matrix representing a normalized frequency of each word in each document. 

In [7]:
n_features = 1000 
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,tokenizer=processText)

**n_features** it's the number of individual ters from the corpus to use (notice that rarely a language by humans uses more than few thousands of distinct words ). Having a large dataset it is safe to use large number for n_features, for short corpus n_features must be non large

**max_df** is the probability at which the more probable words must be removed (removes the most common words)

**min_df** removes the words appearing less than 2 times in the dataset.


In [8]:
corpusT = docs_raw[0:500] ## let's use the first 500 documents

tfidf = tfidf_vectorizer.fit_transform(corpusT)

In [9]:
tfidf

<500x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 12953 stored elements in Compressed Sparse Row format>

** associate names (words) to each feature **

In [10]:
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

** LDA **

In [11]:
n_topics = 20
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=10,
                                learning_method = 'batch')


* ** n_topics ** is somehow arbitrary. 
* ** max_iter ** stops the iteration after maximum 10
* ** learning method ** is usually online but can be also batch (slower) when all data are processed at time

In [12]:
lda.fit(tfidf) ## fit the model

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=20, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [13]:
def mostImportantWordsPerTopic(feature_names,topic,n_top_words):
    mwords = []
    sort_topic = topic.argsort()
    mw = sort_topic[:-n_top_words - 1:-1] ## reversed list    
    for idx in mw:
        mwords.append(feature_names[idx])
    return mwords
        

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        
        most_important_words = mostImportantWordsPerTopic(feature_names,topic,n_top_words)

        message = "Topic #%d: " % topic_idx
        message += " ".join(most_important_words)
        print(message)
    print()

** Printing the topics **

In [14]:
n_top_words = 10
print_top_words(lda, tfidf_feature_names, n_top_words)

Topic #0: clipper board thank required logic image anybody display left thanks
Topic #1: people one would know think may time even want anyone
Topic #2: cause post following group drugs based asking armenian several bbs
Topic #3: keys advance package red referred software mac gear object plus
Topic #4: god received message post neither hell heaven contact require follow
Topic #5: spacecraft soon tower thanks net manager love available legal jesus
Topic #6: problems printer change course obvious file shuttle software murray anyone
Topic #7: anything nature gun development weeks children death hello jesus believe
Topic #8: battery child car print files sense charge scope pro possibly
Topic #9: national ask wish knows excuse circuit anybody program reserve audio
Topic #10: use would like please email get also need thanks anyone
Topic #11: drug april worked land program occurs launched asking marijuana among
Topic #12: water motif starters technology killed israel cases sale morality great

** NMF **

In [15]:
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)

** parameters **

NMF is basically free of parameters :).
    
* alpha : regolarization parameter (used to smooth the frequencies and to improve the fit)
* l1_ratio : regolarization parameter (used to smooth the frequencies and to improve the fit)

In [16]:
n_top_words = 10
print_top_words(nmf, tfidf_feature_names, n_top_words)

Topic #0: would get one think people like know much could may
Topic #1: thanks please email address list advance information available net anybody
Topic #2: use simms memory machine mac could several need answer work
Topic #3: year last yes old three years great game time mask
Topic #4: problem found used light known check error however think running
Topic #5: file printer print manager like another port instead name driver
Topic #6: window box control want left get option application manager upper
Topic #7: looking card working must email condition mail appreciated buy spend
Topic #8: problems pain obvious gave anybody following ask sure also cars
Topic #9: possible yes phone crypto interest invalid fire eternal understanding soviet
Topic #10: things apparently worse like also little exactly seem basically reality
Topic #11: post message product real research feel could server sorry error
Topic #12: program windows files april run software microsoft image code version
Topic #13: lost 

### Bonus visualization ###

In [17]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [18]:
import warnings
warnings.filterwarnings('ignore')

** default visualization of topics and frequency in a multidimensional space **

In [19]:
pyLDAvis.sklearn.prepare(lda, tfidf, tfidf_vectorizer)

** another decomposition method **

In [20]:
pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer,mds='mmds')