# Topic modeling first part
This notebook contains the implementation for the following:
* Load company data into DataFrame
* Do feature preparation that inlcudes the following:
    - Remove words with fewer than 3 words 
    - Remove stop words
    - Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
    - Words are stemmed — words are reduced to their root form.
* Use LDA to compute most 


# Reference
[topic model machinelearningplus](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#4whatdoesldado)

[topic model towardsdatascience](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)

## Step 1: Load data and necessary libraries and corpus

In [11]:

import pandas as pd

df = pd.read_csv("Sample 2001_2011/2001-2011-Table 1.csv")

In [3]:
df[['Company','Overview Text']].head()

Unnamed: 0,Company,Overview Text
0,Riverstone Networks,We are a leading provider of Internet infrastr...
1,Instinet Group LLC,We are the world's largest electronic agency s...
2,Alliance Data Systems,We are a leading provider of transaction servi...
3,Simplex Solutions,We provide software and services for integrate...
4,BAM! Entertainment,We are a rapidly emerging developer and publis...


In [35]:
df.reset_index()[['index', 'Company']].to_csv("company_lookup.csv", index=False)

In [12]:
import gensim
from gensim.models import CoherenceModel
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bohaocao/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Step 2: data preprocessing

In [13]:
stemmer = SnowballStemmer('english')
def lemmatize_stemming(text):
    """
    Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
    Words are stemmed — words are reduced to their root form.
    """
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    """
    Words that have fewer than 3 characters are removed.
    All stopwords are removed.
    """
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [14]:
processed_docs = df['Overview Text'].map(preprocess)

In [7]:
#processed docs are "features" that are preprocessed. Each Company overview text is transformed into a list of processed word now.
processed_docs

0      [lead, provid, internet, infrastructur, equip,...
1      [world, largest, electron, agenc, secur, broke...
2      [lead, provid, transact, servic, credit, servi...
3      [provid, softwar, servic, integr, circuit, des...
4      [rapid, emerg, develop, publish, interact, ent...
                             ...                        
117    [provid, home, improv, inform, servic, interne...
118    [oper, consum, drive, solut, member, research,...
119    [jive, mission, chang, work, get, believ, soci...
120    [boingo, make, simpl, connect, mobil, internet...
121    [lead, provid, carrier, neutral, coloc, data, ...
Name: Overview Text, Length: 122, dtype: object

In [15]:
"""
Create a dictionary from ‘processed_docs’ containing the preprocessed words and their ids.
"""
dictionary = gensim.corpora.Dictionary(processed_docs)
len(dictionary)

3215

In [16]:
"""
Filter:
less than 5 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.
"""

dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
len(dictionary)

888

In [17]:
"""
For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.
"""
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [9]:
tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

In [10]:
def create_model_and_report_metrics(corpus, is_tfidf, num_topics, texts, id2word):
    """
    Create a LDA model, and calculate the coherence metrica and perplexity.
    If corpus is bag of words model, set is_tfidf = False
    Else if corpus is turned into tfidf, set is_tfidf = True
    """
    
    model = gensim.models.LdaMulticore(corpus, num_topics=num_topics, id2word=id2word, passes=2, workers=4)

    # Compute Coherence Score
    coherence_model_lda = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print(f"model artifacts: num_topics:{num_topics}, is_tfidf:{is_tfidf}")
    print(f'Coherence Score for: ', coherence_lda)

    # Compute perplexity. This is a measure of how good the model is. lower the better.
    print(f'Perplexity: {model.log_perplexity(corpus)}')  

    return model

In [13]:
num_topics_choices = [5,10,15,20,25,30]
for num_topics in num_topics_choices:
    create_model_and_report_metrics(bow_corpus, False, num_topics, processed_docs, dictionary)

model artifacts: num_topics:5, is_tfidf:False
Coherence Score for:  0.26322951992943405
Perplexity: -6.489503261554325
model artifacts: num_topics:10, is_tfidf:False
Coherence Score for:  0.314644582351184
Perplexity: -6.618414945326977
model artifacts: num_topics:15, is_tfidf:False
Coherence Score for:  0.28617206875863677
Perplexity: -6.761746520052021
model artifacts: num_topics:20, is_tfidf:False
Coherence Score for:  0.28637884310156664
Perplexity: -6.93826133993981
model artifacts: num_topics:25, is_tfidf:False
Coherence Score for:  0.27679669011219155
Perplexity: -7.0510043628415096
model artifacts: num_topics:30, is_tfidf:False
Coherence Score for:  0.3001195510235722
Perplexity: -7.184997439144871


## When topic is 10 and 30, the coherence score is highest. 
Reserve on 30 topics as it's too fragmented.
10 topic could be a good num of topic candidate.

In [14]:
num_topics_choices = [5,10,15,20,25, 30]
for num_topics in num_topics_choices:
    create_model_and_report_metrics(tfidf_corpus, True, num_topics, processed_docs, dictionary)

model artifacts: num_topics:5, is_tfidf:True
Coherence Score for:  0.31184845526771676
Perplexity: -9.020624400098527
model artifacts: num_topics:10, is_tfidf:True
Coherence Score for:  0.3198289966666304
Perplexity: -10.891994555185837
model artifacts: num_topics:15, is_tfidf:True
Coherence Score for:  0.3162272119929373
Perplexity: -12.493137176564023
model artifacts: num_topics:20, is_tfidf:True
Coherence Score for:  0.3016396412496134
Perplexity: -13.764095644214285
model artifacts: num_topics:25, is_tfidf:True
Coherence Score for:  0.29040513651075045
Perplexity: -14.52806967155888
model artifacts: num_topics:30, is_tfidf:True
Coherence Score for:  0.287488957952191
Perplexity: -15.260933775997827


With tfidf, coherence score is pretty stable but perplexity is getter better and better.

Pick num of topic *10* for both models.

In [11]:
corpus_25 = create_model_and_report_metrics(bow_corpus, False, 10, processed_docs, dictionary)

model artifacts: num_topics:10, is_tfidf:False
Coherence Score for:  0.3012992056288897
Perplexity: -6.630464725838595


In [12]:
corpus_tfidf_25 = create_model_and_report_metrics(tfidf_corpus, True, 10, processed_docs, dictionary)

model artifacts: num_topics:10, is_tfidf:True
Coherence Score for:  0.2979577975563318
Perplexity: -10.796514052033865


## Step 3 :Visualize the topics

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.


In [13]:
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(corpus_25, bow_corpus, dictionary)
vis

In [14]:
vis = pyLDAvis.gensim.prepare(corpus_tfidf_25, tfidf_corpus, dictionary)
vis

The screenshots folder shows visuliation of the intertopic distance map for the 10 topic model and 10 topic model with tfidf

You can see that normal 10 topic model's topics are more evenly spread so this is the chosen model for the following steps.
## Print topics for model 10


In [15]:
corpus_25.print_topics()

[(0,
  '0.019*"email" + 0.018*"advertis" + 0.017*"home" + 0.016*"improv" + 0.013*"project" + 0.011*"user" + 0.008*"content" + 0.007*"supplier" + 0.007*"target" + 0.006*"design"'),
 (1,
  '0.025*"trade" + 0.013*"price" + 0.013*"client" + 0.012*"consum" + 0.011*"inform" + 0.010*"platform" + 0.010*"retail" + 0.009*"secur" + 0.009*"broker" + 0.009*"dealer"'),
 (2,
  '0.018*"print" + 0.015*"design" + 0.014*"process" + 0.012*"supplier" + 0.010*"order" + 0.009*"purchas" + 0.009*"inform" + 0.009*"tradit" + 0.009*"system" + 0.008*"improv"'),
 (3,
  '0.016*"septemb" + 0.014*"organ" + 0.014*"trade" + 0.010*"platform" + 0.010*"game" + 0.010*"perform" + 0.008*"decemb" + 0.008*"incom" + 0.008*"social" + 0.008*"enterpris"'),
 (4,
  '0.016*"data" + 0.013*"design" + 0.011*"purchas" + 0.011*"integr" + 0.009*"channel" + 0.009*"sell" + 0.008*"tradit" + 0.008*"interact" + 0.008*"process" + 0.007*"enterpris"'),
 (5,
  '0.020*"inform" + 0.012*"financi" + 0.012*"event" + 0.012*"advertis" + 0.012*"health" + 0.

In [78]:
corpus_25.show_topics()

[(0,
  '0.019*"email" + 0.018*"advertis" + 0.017*"home" + 0.016*"improv" + 0.013*"project" + 0.011*"user" + 0.008*"content" + 0.007*"supplier" + 0.007*"target" + 0.006*"design"'),
 (1,
  '0.025*"trade" + 0.013*"price" + 0.013*"client" + 0.012*"consum" + 0.011*"inform" + 0.010*"platform" + 0.010*"retail" + 0.009*"secur" + 0.009*"broker" + 0.009*"dealer"'),
 (2,
  '0.018*"print" + 0.015*"design" + 0.014*"process" + 0.012*"supplier" + 0.010*"order" + 0.009*"purchas" + 0.009*"inform" + 0.009*"tradit" + 0.009*"system" + 0.008*"improv"'),
 (3,
  '0.016*"septemb" + 0.014*"organ" + 0.014*"trade" + 0.010*"platform" + 0.010*"game" + 0.010*"perform" + 0.008*"decemb" + 0.008*"incom" + 0.008*"social" + 0.008*"enterpris"'),
 (4,
  '0.016*"data" + 0.013*"design" + 0.011*"purchas" + 0.011*"integr" + 0.009*"channel" + 0.009*"sell" + 0.008*"tradit" + 0.008*"interact" + 0.008*"process" + 0.007*"enterpris"'),
 (5,
  '0.020*"inform" + 0.012*"financi" + 0.012*"event" + 0.012*"advertis" + 0.012*"health" + 0.

In [24]:
lda_model = corpus_25
for index, score in sorted(lda_model[bow_corpus[2]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 25)))


Score: 0.5242593288421631	 
Topic: 0.025*"trade" + 0.013*"price" + 0.013*"client" + 0.012*"consum" + 0.011*"inform" + 0.010*"platform" + 0.010*"retail" + 0.009*"secur" + 0.009*"broker" + 0.009*"dealer" + 0.008*"process" + 0.007*"advertis" + 0.006*"travel" + 0.006*"septemb" + 0.006*"data" + 0.006*"incom" + 0.006*"total" + 0.006*"volum" + 0.006*"investor" + 0.006*"organ" + 0.005*"result" + 0.005*"state" + 0.005*"institut" + 0.005*"billion" + 0.005*"electron"

Score: 0.45973554253578186	 
Topic: 0.016*"payment" + 0.011*"transact" + 0.010*"account" + 0.010*"capit" + 0.009*"merchant" + 0.008*"financi" + 0.007*"gain" + 0.007*"advertis" + 0.007*"research" + 0.006*"card" + 0.006*"total" + 0.006*"trade" + 0.006*"repres" + 0.006*"growth" + 0.006*"septemb" + 0.006*"marketplac" + 0.006*"inform" + 0.006*"host" + 0.006*"retail" + 0.006*"experi" + 0.006*"invest" + 0.006*"global" + 0.005*"search" + 0.005*"volum" + 0.005*"march"


In [26]:
s = """
0.025*"trade" + 0.013*"price" + 0.013*"client" + 0.012*"consum" + 0.011*"inform" + 0.010*"platform" + 0.010*"retail" + 0.009*"secur" + 0.009*"broker" + 0.009*"dealer" + 0.008*"process" + 0.007*"advertis" + 0.006*"travel" + 0.006*"septemb" + 0.006*"data" + 0.006*"incom" + 0.006*"total" + 0.006*"volum" + 0.006*"investor" + 0.006*"organ" + 0.005*"result" + 0.005*"state" + 0.005*"institut" + 0.005*"billion" + 0.005*"electron"
"""
#[*s.split("+")][0].split("*")[1].strip()
len([*s.split("+")])

25

## Output topic

In [72]:
# bow_corpus
# corpus_tfidf

def output_topics(model, is_tfidf, num_topics):
    corpus = tfidf_corpus if is_tfidf else bow_corpus
    d = {"index": [], "score":[], "topic":[], "weight": []}
    for item_index, item in enumerate(corpus):
        for index, score in sorted(model[item], key=lambda tup: -1*tup[1]):
            topic_str = model.print_topic(index, num_topics)
            #print(topic_str)
            topic_weight_str_list = [*topic_str.split("+")]
            #print(len(topic_scores_str))
            for topic_weight_str in topic_weight_str_list:
                weight, topic =[*topic_weight_str.split("*")]
                #print(s)
                weight = float(weight.strip())
                topic = topic.strip().replace('"', '')
                d['index'].append(item_index)
                d['score'].append(score)
                d['topic'].append(topic)
                d['weight'].append(weight)
            break
    return d

In [71]:
#corpus_25
#sorted(model[item], key=lambda tup: -1*tup[1])
corpus_25[bow_corpus[0]]

[(8, 0.9867613)]

In [73]:
d = output_topics(corpus_25, False, 25)
corpus_25_result = pd.DataFrame(d)
corpus_25_result

Unnamed: 0,index,score,topic,weight
0,0,0.986761,mobil,0.019
1,0,0.986761,secur,0.015
2,0,0.986761,enterpris,0.014
3,0,0.986761,wireless,0.014
4,0,0.986761,data,0.011
...,...,...,...,...
3045,121,0.746249,program,0.006
3046,121,0.746249,enterpris,0.006
3047,121,0.746249,agreement,0.006
3048,121,0.746249,entertain,0.006


In [74]:
def assign_col_name(row):
    row_idx = row.name
    suffix = str(row_idx%25)
    return f"topic_{suffix}"
corpus_25_result['column_name'] = corpus_25_result.apply(lambda r: assign_col_name(r), axis=1)

In [75]:
corpus_25_result_pivoted = corpus_25_result.pivot(index="index", columns="column_name", values=['topic','score','prob'])

KeyError: "['prob'] not in index"

In [77]:
corpus_25_result_pivoted.to_excel("25_topic_pivoted.xlsx")

In [76]:
corpus_25_result.to_excel("25_topic.xlsx")

In [36]:
corpus_25_result[corpus_25_result['index']==122].shape

(0, 4)

In [19]:
d = output_topics(corpus_25, True)
model_10_result = pd.DataFrame(d)
model_10_result

NameError: name 'corpus_tfidf' is not defined

In [162]:
len(bow_corpus)

122

## Model IO

In [5]:
# save model
from gensim.test.utils import datapath
from gensim import  models

def save_model(model, path_="models/model"):
    #saving model to disk.

    temp_file = datapath(path_)

    model.save(temp_file)

def load_model(path_="models/lda_25_topics"):
    temp_file = datapath(path_)
    #loading model from disk
    model = models.ldamodel.LdaModel.load(temp_file)
    return model

In [176]:
save_model(corpus_25, "/Users/bohaocao/Codebase/uw-lda/models/lda_25_topics")

In [7]:
corpus_25= load_model("/Users/bohaocao/Codebase/uw-lda/models/lda_25_topics")

In [10]:
len(corpus_25.id2word)

888

## Phase 2: text similarity 


### Reference
[LDA + document similarity on Kaggle](https://www.kaggle.com/code/ktattan/lda-and-document-similarity/notebook)

[Different similarity search approaches](https://medium.com/@adriensieg/text-similarities-da019229c894)

In [22]:
def create_document_topic_distribution_matrix(model, corpus, num_topics=25):
    """
    This function create a 2D matrix M that has dimension of num_document X num_topics. 
    M(i,j) is the weight of topic j for document i.
    """
    topic_distribution = []
    for document_topics in model.get_document_topics(bow=corpus):
        documenet_topic_dist = [0]*num_topics
        for document_id, topic_dist in document_topics: 
            documenet_topic_dist[document_id] = topic_dist
        topic_distribution.append(documenet_topic_dist)

    return topic_distribution


from scipy.spatial import distance

def perform_jensens_hannon_similarity_search(topic_distribution_list, k=5):
    """
    Perform brute force similairty search for all the documents in the topic_distriction_list 
    using jensens hannon approach, and keep the closest k+1(document)
    """
    num_documents = len(topic_distribution_list) 
    result = []
    for query_document_id in range(num_documents):
        distances = []
        for search_document_id in range(num_documents):
            distance_ = distance.jensenshannon(
                topic_distribution_list[query_document_id],
                topic_distribution_list[search_document_id]
            )
            distances.append(distance_)
        result.append(np.array(distances).argsort()[:k+1])
    
    return result

def perform_jensens_hannon_similarity_search_all_documents(topic_distribution_list):
    """
    Perform brute force similairty search for all the documents in the topic_distriction_list 
    using jensens hannon approach
    """
    num_documents = len(topic_distribution_list) 
    result = []
    result_dist = []
    for query_document_id in range(num_documents):
        distances = []
        for search_document_id in range(num_documents):
            distance_ = distance.jensenshannon(
                topic_distribution_list[query_document_id],
                topic_distribution_list[search_document_id]
            )
            distances.append(distance_)
        result_dist.append(np.array(distances))
        result.append(np.array(distances).argsort())
    
    return result, result_dist

In [18]:
topic_dist = create_document_topic_distribution_matrix(corpus_25, bow_corpus, 25)

In [19]:
len(topic_dist), len(topic_dist[0]), topic_dist[0]

(122,
 25,
 [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0.9867613,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0])

In [23]:
result, dist = perform_jensens_hannon_similarity_search_all_documents(topic_dist)

In [24]:
result_df = pd.DataFrame(result)
result_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,112,113,114,115,116,117,118,119,120,121
0,0,85,74,57,56,11,36,92,45,93,...,9,55,54,61,20,28,59,100,97,26
1,1,2,29,64,62,67,98,43,18,28,...,56,55,51,50,49,48,100,66,59,20
2,2,98,101,1,97,64,29,62,67,28,...,52,51,50,49,48,55,59,66,100,20
3,46,3,89,80,77,96,55,100,37,115,...,31,68,67,65,71,0,92,106,66,40
4,4,89,77,55,96,46,80,3,100,37,...,68,67,66,65,64,63,62,61,22,0


In [31]:
result_df.to_csv("search_result.csv",index=False)

In [26]:
result_dist_df = pd.DataFrame(dist)
result_dist_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,112,113,114,115,116,117,118,119,120,121
0,0.0,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,...,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,0.832555,0.733648,0.432129,0.787225
1,0.832555,0.0,0.253493,0.832555,0.832555,0.832555,0.579266,0.646936,0.707398,0.832555,...,0.832555,0.832555,0.660942,0.832555,0.734563,0.832555,0.65873,0.747308,0.700421,0.832555
2,0.832555,0.253493,0.0,0.832555,0.832555,0.832555,0.614416,0.483917,0.62888,0.832555,...,0.832555,0.832555,0.521101,0.832555,0.832555,0.832555,0.515397,0.832555,0.614001,0.832555
3,0.832555,0.832555,0.832555,0.0,0.408129,0.832555,0.743542,0.832555,0.402384,0.591903,...,0.832555,0.832555,0.745268,0.2321,0.718233,0.832555,0.832555,0.832555,0.832555,0.637941
4,0.832555,0.832555,0.832555,0.408129,0.0,0.520545,0.756451,0.832555,0.527026,0.643461,...,0.832555,0.74713,0.667574,0.450621,0.736326,0.832555,0.746413,0.832555,0.832555,0.675901


In [34]:
result_dist_df.to_csv("search_result_distance.csv", index=False)

In [163]:
index_cat_df = df.reset_index()[['index','Category']].sort_values(by="Category")

In [171]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
index_cat_df.groupby("Category")['index'].apply(list)

Category
 E commerce                                                        [16]
Advertising                                                    [51, 31]
Application services                                               [19]
Broker                                                              [1]
Business Optimization                                              [43]
Business intelligence                                              [96]
CDM                                                                [67]
Career website                                                     [63]
Clinical research                                                  [80]
Commercial real estate                                             [44]
Compensation management                                            [54]
Computing                                                          [45]
Connectivity                                                       [79]
Content delivery                                       