# Milestone 2 Topic Trending Analysis
Nowadays, the media is full of overwhelming information in different topics and different areas. Although large amount of data is helpful in big data mining, most of data is not meaningful to us. In order to extract meaningful information from a large dataset which contains millions of quotations, trending topic analysis is a Natural Language Processing (NLP) technique that allows us to automatically extract meaningful information from text by identifying recurrent themes or topics. Hot topic analysis can help providing many meaningful information to be used in many recommendataion system, for example, social media monitoring tools and marketing.

In [None]:
%load_ext autoreload
%autoreload 2

import warnings; warnings.simplefilter('ignore')
import os, codecs, string, random
import numpy as np
from numpy.random import seed as random_seed
from numpy.random import shuffle as random_shuffle
import matplotlib.pyplot as plt
from itertools import chain
%matplotlib inline  

seed = 42
random.seed(seed)
np.random.seed(seed)

from sklearn.feature_extraction.text import TfidfVectorizer

#NLP libraries
import spacy, nltk, gensim, sklearn
import pyLDAvis.gensim_models

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models import Phrases
from gensim import models
from gensim import corpora
from gensim.models import CoherenceModel
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

## Load the dataset
We use the quote-2019.json and load it using chunksize 1000000. In total, we get the 22 chunks and aggregate the chunks by the date. 
After that, we combine these 22 chunks to get a single dataframe with shape 365*2. One column is the date(by day), the other are all the quotations from the single day.

In [None]:
filename = '/media/shanci/DataDisk/ada_quotebank/quotes-2019.json.bz2' 

def process_chunk(chunk, idx):
    chunk['date'] = chunk.date.apply(lambda x: x.date())  
    transformed = chunk.groupby('date')['quotation'].apply(lambda x:x.str.cat(sep=' ')).reset_index()
        
    transformed.to_json('/media/shanci/DataDisk/ada_quotebank/chunk_{}.json'.format(idx))

with pd.read_json(filename, lines=True, compression='bz2', chunksize=1000000) as df_reader:
    for idx, chunk in enumerate(df_reader):
        process_chunk(chunk, idx)

In [None]:
fileroot = '/media/shanci/DataDisk/ada_quotebank/' 
json_list = []
for i in range(22):
    json = pd.read_json(fileroot + "chunk_{}.json".format(i))
    json_list.append(json)
    

In [None]:
total_json = pd.concat(json_list)
data = total_json.groupby('date')['quotation'].apply(lambda x:x.str.cat(sep=' ')).reset_index()
data.head()

## Get Sentences
First data processing step is to separate the quotation to single sentences. This can be done by sent_tokenize from nltk.tokenize.

In [None]:
data['sentences'] = data.quotation.progress_map(sent_tokenize)
data['sentences'].head(1).tolist()[0][:3] # check the first day of the first three sentences

## Tokenization
Here we will build our tokenizer. Tokenization is a process to seprate the sentences into single words and punctuations. This can be done by word_tokenize from nltk.tokenize

In [None]:
data['tokens_sentences'] = data['sentences'].progress_map(lambda sentences: [word_tokenize(sentence) for sentence in sentences])
print(data['tokens_sentences'].head(1).tolist()[0][:3]) # check the first day of the first three sentences

## Part of Speech Tagging and Lemmatization
Lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document.

Generally, lemmatization is followed by Part of Speech Tagging (POS-Tag), which is the labeling of the words in a text according to their word types (noun, adjective, adverb, verb, etc.) POS tagging is a supervised learning solution that uses features like the previous word, next word, is first letter capitalized etc. NLTK has a function to get pos tags and it works after tokenization process. Detailed tags can be found here(https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.

In [None]:
data['POS_tokens'] = data['tokens_sentences'].progress_map(lambda tokens_sentences: [pos_tag(tokens) for tokens in tokens_sentences])
print(data['POS_tokens'].head(1).tolist()[0][:3])

In [None]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

lemmatizer = WordNetLemmatizer()
data['tokens_sentences_lemmatized'] = data['POS_tokens'].progress_map(
    lambda list_tokens_POS: [
        [
            lemmatizer.lemmatize(el[0], get_wordnet_pos(el[1])) 
            if get_wordnet_pos(el[1]) != '' else el[0] for el in tokens_POS
        ] 
        for tokens_POS in list_tokens_POS
    ]
)

## Remove Stopwords
Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

We have adopted stopwords file from https://github.com/ahmedbesbes/How-to-mine-newsfeed-data-and-extract-interactive-insights-in-Python/blob/master/data/stopwords.txt. We also add the stopword from nltk.corpus. Here you can also add your stopword into the additional_stop_words list.

In [None]:
stop_words = []

f = open('stopwords.txt', 'r')
for l in f.readlines():
    stop_words.append(l.replace('\n', ''))
    
additional_stop_words = ['t', 'will']
stop_words =  stop_words + additional_stop_words + stopwords.words('english')
print("The total stopword is {}".format(len(stop_words)))

In [None]:
# They remove non ascii characters and standardize the text (can't -> cannot, i'm -> i am). This will make the tokenization process more efficient.
def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i)<128)

def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = text.replace('(ap)', '')
    text = re.sub(r"\'s", " is ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r'\W+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r"\\", "", text)
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)
    text = re.sub('[^a-zA-Z ?!]+', '', text)
    text = _removeNonAscii(text)
    text = text.strip()
    return text

In [None]:
data['tokens'] = data['tokens_sentences_lemmatized'].map(lambda sentences: list(chain.from_iterable(sentences)))
data['tokens'] = data['tokens'].map(lambda tokens: [token.lower() for token in tokens if token.isalpha() 
                                                    and token.lower() not in my_stopwords and len(token)>1])

## Phrase detection
We need to automatically detect common phrases like multi-word expressions, word n-gram collocations from a stream of tokens.

In [None]:
tokens = data['tokens'].tolist()
bigram_model = Phrases(tokens)
trigram_model = Phrases(bigram_model[tokens], min_count=2)
tokens = list(trigram_model[bigram_model[tokens]])

## Create the dictionary
In Gensim, the dictionary object is used to create a bag of words (BoW) corpus which further used as the input to topic modelling and other models as well.

Corpus − It refers to a collection of documents as a bag of words (BoW).

Here we define three functions:
1. LDA_model, this function returns the LDA model
2. compute_coherence, this function is used to compute coherence score to evaluate the LDA model
3. display_topics, function to display topics and corresponding keywords:
3. explore model. this is a tuning function, used to explore different topics of LDA models.

In [None]:
dictionary_LDA = corpora.Dictionary(tokens)
dictionary_LDA.filter_extremes(no_below=2)
corpus = [dictionary_LDA.doc2bow(tok) for tok in tokens]

def LDA_model(num_topics,corpus, id2word, passes=1):
    return gensim.models.ldamodel.LdaModel(corpus=corpus,
                                               id2word=dictionary_LDA,
                                               num_topics=num_topics, 
                                               random_state=100,
                                               eval_every=10,
                                               chunksize=2000,
                                               passes=passes,
                                               per_word_topics=True
                                            )
def compute_coherence(model,tokens):
    coherence = CoherenceModel(model=model, 
                           texts=tokens,
                           dictionary=dictionary_LDA, coherence='c_v')
    return coherence.get_coherence()

def display_topics(model):
    topics = model.show_topics(num_topics=model.num_topics, formatted=False, num_words=10)
    topics = map(lambda c: map(lambda cc: cc[0], c[1]), topics)
    df = pd.DataFrame(topics)
    df.index = ['topic_{0}'.format(i) for i in range(model.num_topics)]
    df.columns = ['keyword_{0}'.format(i) for i in range(1, 10+1)]
    return df

def explore_models(corpus,id2word,tokens, rg=range(5, 25)):
    models = []
    coherences = []
    
    for num_topics in rg:
        lda_model = LDA_model(num_topics,corpus, id2word, passes=5)
        models.append(lda_model)
        coherence = compute_coherence(lda_model,tokens)
        coherences.append(coherence)
      

    fig = plt.figure(figsize=(15, 5))
    plt.title('Choosing the optimal number of topics')
    plt.xlabel('Number of topics')
    plt.ylabel('Coherence')
    plt.grid(True)
    plt.plot(rg, coherences)
    
    return coherences, models


coherences, models = explore_models(corpus,id2word,tokens, rg=range(5, 5, 25))

In [None]:
best_model = LDA_model(num_topics=40, passes=5)

display_topics(model=best_model)

Now let's build a document/topic matrix. A cell i,j is the probabily of topic j in the document i.

In [None]:
def get_document_topic_matrix(corpus, num_topics=best_model.num_topics):
    matrix = []
    for row in tqdm_notebook(corpus):
        output = np.zeros(num_topics)
        doc_proba = best_model[row][0]
        for doc, proba in doc_proba:
            output[doc] = proba
        matrix.append(output)
    matrix = np.array(matrix)
    return matrix

matrix = get_document_topic_matrix(corpus)

LDA outputs a distribution of topic for each document. We'll assume that a document's topic is the one with the highest probability.

In [None]:
doc_topic = best_model.get_document_topics(corpus)
lda_keys = []
for i, desc in enumerate(data['quotation']):
    lda_keys.append(np.argmax(matrix[i, :]))

run = False
if run: 
    tsne_model = TSNE(n_components=2, verbose=1, random_state=0, n_iter=500)
    tsne_lda = tsne_model.fit_transform(matrix)
    lda_df = pd.DataFrame(tsne_lda, columns=['x', 'y'])
    lda_df['topic'] = lda_keys
    lda_df['topic'] = lda_df['topic'].map(str)
    lda_df['description'] = data['description']
    lda_df['category'] = data['category']
    # lda_df.to_csv('./data/tsne_lda.csv', index=False, encoding='utf-8')
else:
    lda_df = pd.read_csv('./data/tsne_lda.csv')
    lda_df['topic'] = lda_df['topic'].map(str)


## NMF: Non-negative Matrix Factorization


In [None]:
from sklearn.decomposition import NMF

vectorizer = TfidfVectorizer(min_df=5, analyzer='word', ngram_range=(1, 2), stop_words='english')
vz = vectorizer.fit_transform(list(data['tokens'].map(lambda tokens: ' '.join(tokens))))

nmf = NMF(n_components=40, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(vz)

feature_names = vectorizer.get_feature_names()
no_top_words = 10

for topic_idx, topic in enumerate(nmf.components_[:10]):
    print("Topic %d:"% (topic_idx))
    print(" | ".join([feature_names[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]))

## Visualization of LDA model

In [None]:
vis = pyLDAvis.gensim_models.prepare(topic_model=lda_model, corpus=corpus, dictionary=dictionary_LDA)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

## TFIDF
tf-idf stands for term frequencey-inverse document frequency. It's a numerical statistic intended to reflect how important a word is to a document or a corpus (i.e a collection of documents).

To relate to this post, words correpond to tokens and documents correpond to descriptions. A corpus is therefore a collection of descriptions.

In [None]:
vectorizer = TfidfVectorizer(min_df=5, analyzer='word', ngram_range=(1, 2), stop_words='english')
vz = vectorizer.fit_transform(tokens)
print(vz.shape)

vz is a tfidf matrix.

its number of rows is the total number of documents (descriptions)

its number of columns is the total number of unique terms (tokens) across the documents (descriptions)

In [None]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['tfidf']

tfidf.tfidf.hist(bins=25, figsize=(15,7))

## WordCloud to visialise words
We can use wordcloud to display important word and unimportant world.

In [None]:
from wordcloud import WordCloud

def plot_word_cloud(terms):
    text = terms.index
    text = ' '.join(list(text))
    # lower max_font_size
    wordcloud = WordCloud(max_font_size=40).generate(text)
    plt.figure(figsize=(25, 25))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

plot_word_cloud(tfidf.sort_values(by=['tfidf'], ascending=True).head(40))

In [None]:
plot_word_cloud(tfidf.sort_values(by=['tfidf'], ascending=False).head(40))