Topic Indexing - LSI using Gensim
===

This document implements Topic Indexing using a gensim implementation of Latent Semantic Indexing (LSI). It is meant for my own educational purposes. 

**About the data:**  
The data has been curated by me. Each corpus file consists of exactly 10 documents of a single sentence each. The first five documents belong to one topic while the last five documents belong to a different topic. The documents come from wikipedia and contain no typos. 


**Goal:**  
The goal of this experiment is to see how well a gensim implementation of LSI performs when doing topic indexing. 

Import Libraries
---

In [97]:
from gensim import corpora, models, similarities 
import os
from six import iteritems

Prepare corpus
---

In [98]:
directory_path = r"c:\users\avour\Documents\GitHub\topic-indexing-lsa-gensim\corpus"
corpus_name = r"corpus_2\corpus.txt"
corpus_path = os.path.join(directory_path, corpus_name)
print(corpus_path)

c:\users\avour\Documents\GitHub\topic-indexing-lsa-gensim\corpus\corpus_2\corpus.txt


Generate Dataset
---

In [99]:
stop_list = set('for a of the and to is has they be are as that by from their in'.split())

# Generate Dictionary
with open(corpus_path, 'r') as file:
    dictionary = corpora.Dictionary(line.lower().split() for line in file)

    # remove stop words and words that appear only once
    stop_ids = [dictionary.token2id[stopword] for stopword in stop_list
        if stopword in dictionary.token2id]
    once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq==1]
    dictionary.filter_tokens(stop_ids+once_ids)
    #dictionary.filter_tokens(stop_ids)
    dictionary.compactify()

    
# Corpus Streaming
class MyCorpus(object):
    def __iter__(self):
        with open(corpus_path, 'r') as file:
            for line in file:
                # Assume there's one document per line, tokens separated by whitespace
                yield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()


# Save corpus and dictionary
corpora.MmCorpus.serialize('something.mm', corpus)
dictionary.save('something.dict')


Load Dataset
---

In [100]:
# Corpus of documents represented as a stream of vectors

if(os.path.exists('something.dict')):
    corpus = corpora.MmCorpus('something.mm')
    dictionary = corpora.Dictionary.load('something.dict')
    print('Used saved dataset')
else:
    print('Please generate data set')


Used saved dataset


Implement LSI
---

In [101]:
# Initialize tfidf model
tfidf = models.TfidfModel(corpus)

# Use tfidf model to transform vectors
corpus_tfidf = tfidf[corpus]

# Perform LSI tranformation
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) #Initialize an LSI transformation

# IMPORTANT: Once the transformation model has been initialized, it can be used on any vectors
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

lsi.save('model.lsi')
lsi = models.LsiModel.load('model.lsi')

# Print topics
print('LSI topics: ')
print(lsi.print_topics(2)[0])
print(lsi.print_topics(2)[1])
print('\n')

LSI topics: 
(0, '0.485*"science" + 0.485*"computer" + 0.484*"study" + 0.389*"algorithms" + 0.219*"an" + 0.213*"would" + 0.124*"basketball" + 0.073*"nba" + 0.070*"30" + 0.070*"member"')
(1, '-0.562*"basketball" + -0.303*"states" + -0.303*"association" + -0.303*"united" + -0.277*"1946" + -0.258*"30" + -0.245*"national" + -0.227*"nba" + -0.202*"with" + -0.188*"member"')




Visualize Final Corpus
---

In [102]:
print('Corpus LSI ')
topics_index = list()
with open(corpus_path, 'r') as file:
    for i, line in enumerate(file):
        topics_index.append(max(corpus_lsi[i], key=lambda item:abs(item[1]))[0])
        print('Topic : ', topics_index[i])
        print(corpus_lsi[i], " # " + line)
       
    print('\n')

Corpus LSI 
Topic :  1
[(0, 0.17571545262164923), (1, -0.83769662872329098)]  # The National Basketball Association (NBA) is a men's professional basketball league in North America; composed of 30 teams (29 in the United States and 1 in Canada).

Topic :  1
[(0, 0.14619065931030154), (1, -0.75391459295846031)]  # The Basketball Association of America was founded in 1946 by owners of the major ice hockey arenas in the Northeastern and Midwestern United States and Canada.

Topic :  1
[(0, 0.23858201717555549), (1, -0.71312447978183968)]  # The NBA is an active member of USA Basketball (USAB), which is recognized by FIBA (also known as the International Basketball Federation) as the national governing body for basketball in the United States.

Topic :  1
[(0, 0.22569695693714861), (1, -0.40468942550040343)]  # The NBA announced on April 15, 2016, that it would allow all 30 of its member clubs to sell corporate sponsor advertisement patches on official game uniforms, beginning with the 201

**Obeservations:** There is no need to know what words belong to each lsi topic, we only care that documents of different topics are not assigned the same topic. The corpus consists of 10 documents (one sentence each): first 5 documents belong to the first topic while the last five documents belong to the second topic.

Measure Accuracy
---

In [103]:
# Check similarity of documents. It assumes the first half of the documents belong to a different topic than the second half

similarity = 0 #0: no similarity, 1: exactly the same
index_sum = 0

for i in range(len(topics_index[:5])):
    index_sum += topics_index[i] + topics_index[i+5]

if(index_sum == 0 or index_sum == 10):
    similarity = 1
elif(index_sum == 5):
    similarity = 0
else:
    similarity = index_sum

print('Similarity: {}% (Ideal similarity is 0%)'.format(similarity*10))

Similarity: 0% (Ideal similarity is 0%)
