#### Agenda
The majority of the focus should be on:

In the beginning, we should talk in general about modeling in NLP, supervised and unsupervised models

Explain them Sentiment analysis (supervised), where it can be used and what models are suitable for this type of task

Explain topic modeling purpose (Dimensionality Reduction, Tagging, Unsupervised learning), and introduce LDA (no need to explain in too much detail)

### Topic Modelling

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

When I say topic, what is it actually and how it is represented?

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

* The quality of text processing.
* The variety of topics the text talks about.
* The choice of topic modeling algorithm.
* The number of topics fed to the algorithm.
* The algorithms tuning parameters.

[source: 'https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#4whatdoesldado' ]

<img src='./LDA1.jpg'>

#### More on LDA
* LDA uses dirichlet probability distribution

#### Assumptions of LDA
* documents with similar topics use similar group of words.
* Latent topics can be found by searching of groups of words that occur in documents across a corpus

 [source: Udemy course NLP Processing with Python by Jose Portilla]

### LDA in action
#### [source: Codebase adapted from Udemy course NLP Processing with Python by Jose Portilla]

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('./npr.csv')

In [3]:
npr

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."
...,...
11987,The number of law enforcement officers shot an...
11988,"Trump is busy these days with victory tours,..."
11989,It’s always interesting for the Goats and Soda...
11990,The election of Donald Trump was a surprise to...


## Preprocessing
**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [5]:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [36]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [37]:
dtm = tfidf_vectorizer.fit_transform(npr['Article'])

In [38]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA

In [39]:
from sklearn.decomposition import LatentDirichletAllocation

In [40]:
LDA = LatentDirichletAllocation(n_components=4,random_state=42)

In [41]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=4, random_state=42)

## Showing Stored Words

In [43]:
len(cv.get_feature_names())

54777

In [44]:
import random

In [45]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(cv.get_feature_names()[random_word_id])

feedings
tango
outraged
tech
44th
bonhoeffer
asphalt
nonresident
tellers
problem


In [None]:
# for i in range(10):
#     random_word_id = random.randint(0,54776)
#     print(cv.get_feature_names()[random_word_id])

### Showing Top Words Per Topic

In [46]:
len(LDA.components_)

4

In [47]:
LDA.components_

array([[ 0.25027237,  0.2502344 ,  0.25011414, ...,  0.25025818,
         0.25024137,  0.2501001 ],
       [ 0.62297762,  3.777409  ,  0.250054  , ...,  0.2501196 ,
         0.25011356,  0.25005292],
       [ 0.30937568,  0.29640429,  0.65884802, ...,  0.26047484,
         0.25020682,  0.34501652],
       [ 3.60623188, 85.16153562,  0.25217095, ...,  0.65911154,
         0.39039157,  0.25610006]])

In [48]:
len(LDA.components_[0])

54777

In [49]:
single_topic = LDA.components_[0]

In [50]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([21712, 38837, 16656, ..., 27476, 20194, 26164], dtype=int64)

In [51]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([12529, 30735, 50693, 15491, 48904, 13425, 23506, 27476, 20194,
       26164], dtype=int64)

In [52]:
top_word_indices = single_topic.argsort()[-10:]

In [53]:
for index in top_word_indices:
    print(cv.get_feature_names()[index])

cáceres
meagan
twitch
dubose
tensing
demme
honnold
kohlhepp
gambia
jammeh


These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [54]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['adama', 'gambian', 'ecowas', 'coxie', 'blimp', 'cáceres', 'meagan', 'twitch', 'dubose', 'tensing', 'demme', 'honnold', 'kohlhepp', 'gambia', 'jammeh']


THE TOP 15 WORDS FOR TOPIC #1
['comey', 'election', 'republican', 'security', 'russian', 'isis', 'obama', 'senate', 'house', 'campaign', 'russia', 'said', 'president', 'clinton', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['pop', 'prince', 'artists', 'singer', 'sound', 'musicians', 'guitar', 'desk', 'audio', 'jazz', 'band', 'songs', 'song', 'album', 'music']


THE TOP 15 WORDS FOR TOPIC #3
['don', 'year', 'percent', 'think', 'years', 'women', 'time', 'trump', 'health', 'new', 'just', 'like', 'said', 'people', 'says']




In [55]:
topic_results = LDA.transform(dtm)

In [56]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 3, 3, 3], dtype=int64)

In [57]:
npr['Topic'] = topic_results.argmax(axis=1)

In [58]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",3
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",3
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",3


### Topic Modelling
[source: 'https://nlpforhackers.io/topic-modeling/' ]

There are several scenarios when topic modeling can prove useful. Here are some of them:

Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature

Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.

Uncovering Themes in Texts – Useful for detecting trends in online publications for example

### Topic Modeling Algorithms
There are several algorithms for doing topic modeling. The most popular ones include

LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models

LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra

NMF – Non-Negative Matrix Factorization – Based on Linear Algebra

#### Here are some things all these algorithms have in common:

The number of topics (n_topics) as a parameter. None of the algorithms can infer the number of topics in the document collection.

All of the algorithms have as input the Document-Word Matrix (or Document-Term Matrix). DWM[i][j] = The number of occurrences of word_j in document_i

All of them output 2 matrices: WTM (Word Topic Matrix) and TDM (Topic Document Matrix). The matrices are significantly smaller and the result of their multiplication should be as close as possible to the original DWM matrix.

### How does LDA work?
[source: 'https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/' ]

Suppose you have the following set of sentences:

* I like to eat broccoli and bananas.
* I ate a banana and spinach smoothie for breakfast.
* Chinchillas and kittens are cute.
* My sister adopted a kitten yesterday.
* Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

Sentences 1 and 2: 100% Topic A

Sentences 3 and 4: 100% Topic B

Sentence 5: 60% Topic A, 40% Topic B

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)

Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?


### LDA Model
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

* Decide on the number of words N the document will have (say, according to a Poisson distribution).
* Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
* Generate each word w_i in the document by:
* First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
* Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
* Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

#### Example
Let’s make an example. According to the above process, when generating some particular document D, you might

* Pick 5 to be the number of words in D.
* Decide that D will be 1/2 about food and 1/2 about cute animals.
* Pick the first word to come from the food topic, which then gives you the word “broccoli”.
* Pick the second word to come from the cute animals topic, which gives you “panda”.
* Pick the third word to come from the cute animals topic, giving you “adorable”.
* Pick the fourth word to come from the food topic, giving you “cherries”.
* Pick the fifth word to come from the food topic, giving you “eating”.

So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

### Sentiment ANalysis
#### NLTK's VADER module
VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
<br>To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html

In [59]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Arunabh\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [60]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer


sid = SentimentIntensityAnalyzer()

VADER's `SentimentIntensityAnalyzer()` takes in a string and returns a dictionary of scores in each of four categories:
* negative
* neutral
* positive
* compound *(computed by normalizing the scores above)*

In [61]:
a = 'The weather today is horrible. I dont feel like getting out'
sid.polarity_scores(a)

{'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.6818}

In [62]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

### Sentiment ANalysis on Amazon Reviews

For this exercise we're going to apply `SentimentIntensityAnalyzer` to a dataset of 10,000 Amazon reviews. Like our movie reviews datasets, these are labeled as either "pos" or "neg". At the end we'll determine the accuracy of our sentiment analysis with VADER.

In [63]:
import numpy as np
import pandas as pd

df = pd.read_csv('./amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [64]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [65]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [66]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [68]:
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


#### check accuracy

In [69]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [70]:
accuracy_score(df['label'],df['comp_score'])

0.7091

In [71]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [72]:
print(confusion_matrix(df['label'],df['comp_score']))

[[2622 2475]
 [ 434 4469]]
