# Gensim

### What is Gensim?
- `Gensim = “Generate Similar”` is a popular open source natural language processing (NLP) library used for unsupervised topic modeling.


- Let's understand important terms and its meaning.


 Document: some text.

 Corpus: a collection of documents.

 Vector: a mathematically convenient representation of a document.

 Model: an algorithm for transforming vectors from one representation to another.

#####  Document: some text.

In [2]:
document = "Human machine interface for lab abc computer applications"

##### Corpus: a collection of documents.

In [3]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

##### Vector: a mathematically convenient representation of a document

In [4]:
import pprint

In [5]:
# Create a set of frequent words

stoplist = set('for a of the and to in'.split(' '))

# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]

# Count word frequencies
from collections import defaultdict

frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [6]:
# Creating dictionaries, which helps during Topic modelling
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


In [7]:
# Vectore representation 
pprint.pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


##### Model: an algorithm for transforming vectors from one representation to another.

In [8]:
# Always make a practice of testing small texts when trying something new
# in each tuple below 1st occurance is ID and 2nd occurance is count

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

[(0, 1), (1, 1)]


In [9]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


In [10]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


#### Open text file

In [11]:
import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
import os

dict_STF = corpora.Dictionary(simple_preprocess(line, deacc =True) for line in open(r"sample.txt"))

print(dict_STF.token2id)

{'details': 0, 'files': 1, 'sample': 2, 'text': 3, 'how': 4, 'open': 5, 'to': 6, 'txt': 7, 'basic': 8, 'informations': 9, 'create': 10, 'download': 11, 'file': 12, 'of': 13, 'url': 14, 'mb': 15}


### LDA (Latent Dirichlet allocation )

It is one of the important module when it comes to Topic Modelling. We will work on this later, but lets understand what it does.


Automatically extracting information about topics from large volume of texts in one of the primary applications of NLP (natural language processing). Large volume of texts could be feeds from hotel reviews, tweets, Facebook posts, feeds from any other social media channel, movie reviews, news stories, user feedbacks, e-mails etc.

In this digital era, to know what people/customers are talking about, to understand their opinions, and their problems, can be highly valuable for businesses, political campaigns and administrators. But, is it possible to manually read through such large volumes of text and then extracting the information from topics?

No, it’s not. It requires an automatic algorithm that can read through these large volume of text documents and automatically extract the required information/topics discussed from it.

Role of LDA

LDA’s approach to topic modeling is to classify text in a document to a particular topic. Modeled as Dirichlet distributions, LDA builds −

- A topic per document model and
- Words per topic model

After providing the LDA topic model algorithm, in order to obtain a good composition of topic-keyword distribution, it re-arrange −

- The topics distributions within the document and
- Keywords distribution within the topics


While processing, some of the assumptions made by LDA are −

- Every document is modeled as multi-nominal distributions of topics.
- Every topic is modeled as multi-nominal distributions of words.
- We should have to choose the right corpus of data because LDA assumes that each chunk of text contains the related words.
- LDA also assumes that the documents are produced from a mixture of topics.


Note-  **Will discuss more about this during topic modelling**