# Topic Modeling

Topic models have been designed specifically for the purpose of extracting various distinguishing concepts or topics from a large corpus having various types of documents where each document talks about one or more concepts. These concepts can be anything from thoughts, opinions, facts, outlooks, statements and so on. The main aim of topic modeling is to use mathematical and statistical techniques to discover hidden and latent semantic structures in a corpus. 
Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms which are distinguishable from each other and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections amongst words which co-occur together frequently in various documents. There are various frameworks and algorithms to build topic models. The most popular ones include
- Latent Semantic Indexing 
- Latent Dirichlet Allocation 
- Non-negative Matrix Factorization

The last technique we will look at is non-negative matrix factorization (NNMF), which is another matrix decomposition technique similar to SVD but operates on non-negative matrices and works well for multivariate data.  NNMF can be formally defined as, given a non-negative matrix V, the objective is to find two non-negative matrix factors, W and H such that when they are multiplied, they can approximately reconstruct V. Mathematically this is represented by 
$$ V ≈ WH $$

such that all three matrices are non-negative. 

To get to this approximation, we usually use a cost function like the Euclidean distance or L2 norm between two matrices or the Frobenius norm which is a slight modification of the L2 norm. 

This implementation is available in the NMF class in the scikit-learn decomposition module which we will be using in the section.

# Import necessary dependencies

In [None]:
import pandas as pd
import numpy as np
import warnings
import nltk

warnings.filterwarnings("ignore")

In [None]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/00/92/a05b76a692ac08d470ae5c23873cf1c9a041532f1ee065e74b374f218306/contractions-0.0.25-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 3.8MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 8.5MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  C

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import contractions
from bs4 import BeautifulSoup
import unicodedata
import re
import nltk
import numpy as np

ps = nltk.porter.PorterStemmer()
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('but')
stop_words.remove('not')


def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def expand_contractions(text):
    return contractions.fix(text)


def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


def pre_process_document(document):
    
    # strip HTML
    document = strip_html_tags(document)
    
    # lower case
    document = document.lower()
    
    # remove extra newlines (often might be present in really noisy text)
    document = document.translate(document.maketrans("\n\t\r", "   "))
    
    # remove accented characters
    document = remove_accented_chars(document)
    
    # expand contractions    
    document = expand_contractions(document)
               
    # remove special characters and\or digits    
    # insert spaces between special characters to isolate them    
    special_char_pattern = re.compile(r'([{.(-)!}])')
    document = special_char_pattern.sub(" \\1 ", document)
    document = remove_special_characters(document, remove_digits=True)        
    
    # remove stopwords
    document = remove_stopwords(document, is_lower_case=True, stopwords=stop_words)
        
    # remove extra whitespace
    document = re.sub(' +', ' ', document)
    document = document.strip()
    
    return document


pre_process_corpus = np.vectorize(pre_process_document)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Load and normalize data

In [None]:
dataset = pd.read_csv(r'/content/drive/My Drive/nlp/nlp_workshop_dhs18-master/nlp_workshop_dhs18-master/Unit 14 - Topic Modeling/movie_reviews.csv.bz2', compression='bz2')

# take a peek at the data
print(dataset.head())
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])

# build train and test datasets
train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]

# normalize datasets
norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [None]:
norm_reviews=np.concatenate((norm_train_reviews, norm_test_reviews), axis=0)

# Extract features from positive and negative reviews

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# consolidate all normalized reviews
# = norm_train_reviews+norm_test_reviews
# get tf-idf features for only positive reviews
positive_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'positive']
ptvf = TfidfVectorizer(use_idf=True, min_df=0.02, max_df=0.75, ngram_range=(1, 2), sublinear_tf=True)
ptvf_features = ptvf.fit_transform(positive_reviews)
# get tf-idf features for only negative reviews
negative_reviews = [review for review, sentiment in zip(norm_reviews, sentiments) if sentiment == 'negative']
ntvf = TfidfVectorizer(use_idf=True, min_df=0.02, max_df=0.75, ngram_range=(1, 2), sublinear_tf=True)
ntvf_features = ntvf.fit_transform(negative_reviews)
# view feature set dimensions
print(ptvf_features.shape, ntvf_features.shape)

(25000, 893) (25000, 912)


In [None]:
!pip install pyLDAvis
!pip install django-model-utils

Collecting django-model-utils
  Downloading https://files.pythonhosted.org/packages/f2/f0/ec6a32eab77a5a6f00779d9c30b36e014421fce5976c45dafd7cb40b8b50/django_model_utils-4.0.0-py2.py3-none-any.whl
Installing collected packages: django-model-utils
Successfully installed django-model-utils-4.0.0


# Topic Modeling on Reviews

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
from sklearn.decomposition import NMF
#import topic_model_utils as tmu
#import django-model-utils
pyLDAvis.enable_notebook()
total_topics = 10

## Display and visualize topics for positive reviews

In [None]:
# build topic model on positive sentiment review features
pos_nmf = NMF(n_components=total_topics, solver='cd', max_iter=500,
               random_state=42, alpha=.1, l1_ratio=.85)
pos_nmf.fit(ptvf_features)      
# extract features and component weights
pos_feature_names = np.array(ptvf.get_feature_names())
pos_weights = pos_nmf.components_
# extract and display topics and their components
pos_feature_names = np.array(ptvf.get_feature_names())
feature_idxs = np.argsort(-pos_weights)[:, :15]
topics = [pos_feature_names[idx] for idx in feature_idxs]
for idx, topic in enumerate(topics):
    print('Topic #'+str(idx+1)+':')
    print(', '.join(topic))
    print()

Topic #1:
like, would, think, but, know, people, really, get, say, see, could, no, even, want, something

Topic #2:
movie, movies, great, love, movie not, watch, see, great movie, story, recommend, see movie, loved, watch movie, wonderful, movie but

Topic #3:
show, series, episode, episodes, tv, shows, season, television, great, characters, watch, watching, new, love, every

Topic #4:
good, well, but, really, acting, pretty, plot, action, also, bad, better, job, quite, cast, done

Topic #5:
best, ever, seen, one best, one, ever seen, movies, greatest, made, amazing, never, not seen, performance, films, brilliant

Topic #6:
life, man, story, one, two, but, young, also, character, no, world, new, time, way, scene

Topic #7:
film, films, great, film not, see, great film, film but, wonderful, made, acting, cinema, music, watch, recommend, characters

Topic #8:
saw, years, first, dvd, ago, time, remember, years ago, still, first time, old, saw movie, video, since, watched

Topic #9:
funny,

In [45]:
ptvf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.75, max_features=None,
                min_df=0.02, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=True, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [None]:
pyLDAvis.sklearn.prepare(pos_nmf, ptvf_features, ptvf, mds='mmds')

## Display and visualize topics for negative reviews

In [None]:
# build topic model on negative sentiment review features
neg_nmf = NMF(n_components=total_topics, solver='cd', max_iter=500,
              random_state=42, alpha=.1, l1_ratio=.85)
neg_nmf.fit(ntvf_features)      
# extract features and component weights
neg_feature_names = ntvf.get_feature_names()
neg_weights = neg_nmf.components_
# extract and display topics and their components
neg_feature_names = np.array(ntvf.get_feature_names())
feature_idxs = np.argsort(-neg_weights)[:, :15]
topics = [neg_feature_names[idx] for idx in feature_idxs]
for idx, topic in enumerate(topics):
    print('Topic #'+str(idx+1)+':')
    print(', '.join(topic))
    print()

Topic #1:
but, one, no, like, would, get, also, characters, story, character, well, two, way, time, first

Topic #2:
movie, bad, movies, watch, acting, good, even, like, but, movie not, really, see, would, watching, could

Topic #3:
film, films, film not, acting, bad, film but, good, script, but, awful, made, plot, watch, director, poor

Topic #4:
ever, worst, seen, ever seen, movie ever, worst movie, one worst, ever made, one, movies, made, worse, possibly, horrible, far

Topic #5:
horror, budget, low, low budget, horror movie, gore, blood, flick, scary, killer, monster, genre, dead, movies, cheap

Topic #6:
book, read, novel, story, based, version, movie, characters, reading, disappointed, made, love, completely, written, original

Topic #7:
show, funny, comedy, not funny, jokes, tv, episode, shows, series, humor, watch, laugh, stupid, television, kids

Topic #8:
waste, waste time, time, not waste, money, complete, watching, please, total, wasted, crap, horrible, nothing, spent, not 

In [None]:
pyLDAvis.sklearn.prepare(neg_nmf, ntvf_features, ntvf, mds='mmds')