# Topic Modelling Analysis 

This notebook was used as part of the research of using topic modelling as a means to filter out irrelevant hashtags that accompany Instagram images. 

Author: Argyris Argyrou, Cyprus University of Technology

### Using CRISP-DM methodology

**Libraries**

In [None]:
import pandas as pd
import numpy as np
import gensim
import pyLDAvis
import re
import collections
import wordninja 

Libraries: For NLP, using spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.lang.en import English

Libraries: stopwords

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models
from pprint import pprint 
from gensim.models import CoherenceModel

**Dataset—corpus**

In [None]:
import glob
glob.glob("../dataset/*.xlsx")

### Pre-process and vectorize the documents (hashtags)
Among other things, we will:

- Split the documents into tokens.
- Lemmatize the tokens (e.g: horse, horses to horse)
- Probabilistically split concatenated
- Remove stopwords
- Remove numbers, but not words that contain numbers
- Remove words that are only three character
- Compute a bag-of-words representation of the data


Dataset: Select Dataset

In [None]:
n_print = input("Enter the file name you like to anylize. e.g HORSE, CAR etc: ")

n_print=n_print.upper();
try:
   dataset = '../dataset/'+n_print+'.xlsx'
   df_dataset = pd.read_excel(dataset)
except IOError:
   print('hm, we don\'t have this dataset. Please try another one')

data = []
for index, row in df_dataset.iterrows():
    data.append(row["Hashtag"])
    
df_dataset = pd.DataFrame(data, columns = ['Hashtags'])

In [None]:
# df_dataset.head(10)

**Text Wrangling & Pre-processing**

cleaning: stopwords

In [None]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

cleaning: inst-stopwords

In [None]:
exec(open('../stops.py').read())  
df_instastop = pd.DataFrame(STOPS_LIST, columns = ['Insta stopwords (preview)'])
# df_instastop

cleaning: lemmatization

In [None]:
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

cleaning: split multiplewords (bi-grams)

In [None]:
def splitmultiplewords (mylist):
    """The method's for spliting words"""
    myninjalist = []
    for x in mylist:
        xtoken = x
        if (len(wordninja.split(xtoken))>1):
            ytoken = wordninja.split(xtoken)
            for y in ytoken:
                myninjalist.append(y)
        else:
            myninjalist.append(x) 
    return myninjalist

cleaning: Building a Text Normalizer

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
def prepare_text_for_lda(text):
    tokens = tokenizer.tokenize(text)
    tokens = splitmultiplewords(tokens)        
    tokens = [token for token in tokens if len(token) > 2]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    tokens = [index for index in tokens if not index in STOPS_LIST]
    return tokens

**corpus**: prepare corpus

In [None]:
#df_dataset.head(10)

In [None]:
docs = []  

for index, row in df_dataset.iterrows():
    line = row["Hashtags"].lower()
    tokens = prepare_text_for_lda(line)      
    docs.append(list(set(tokens)))

In [None]:
df = pd.DataFrame(docs)
#df

Dictionary: We are creating a dictionary from the data, then convert to bag-of-words corpus 

In [None]:
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

In [None]:
sorted(dictionary.items(), key=lambda x: x[1]) ## Let's sort our dictionary
pprint(dictionary.token2id)

## Topic Modeling, LDA
### Vectorize data
Finally, we transform the documents to a vectorized form. We simply compute the frequency of each word.

Bag-of-words representation of the documents.

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a bag-of-words--a sparse vector, in the form of [(word_id, word_count), ...].

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

### Discover topics

Build the LDA model

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=20)
topics = ldamodel.print_topics(num_words=5)
for topic in topics:
    print(topic)


In [None]:
num_topics = 10
num_words = 10

List = ldamodel.print_topics(num_topics, num_words)
Topic_words =[]
for i in range(0,len(List)):
    word_list = re.sub(r'(.\....\*)|(\+ .\....\*)', '',List[i][1])
    temp = [word for word in word_list.split()]
    Topic_words.append(temp)
    print('Topic ' + str(i) + ': ' + '' + str(word_list))

Evaluation: **Coherence Score** a measure of how good the model is. higher the better.

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(docs)

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=docs, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()

print('\nCoherence Score: ', coherence_lda)

Evaluation: **Perplexity**  a measure of how good the model is. lower the better.

In [None]:
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  

#### LDAvis

LDAVis is designed to help us interpret topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

- Saliency: a measure of how much the term tells you about the topic.
- Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. In our case **horse**, **life**,**rider**. When we have 5 or 10 topics, we can see certain topics are clustered together, this indicates the similarity between topics.