## Why Gensim:

In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. In order to achieve that, Gensim lets you create a Dictionary object that maps each word to a unique id.

## Why convert to dictionary

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

## How it Converts:

In gensim, the dictionary contains a map of all words (tokens) to its unique id.
You can create a dictionary from a paragraph of sentences, from a text file that contains multiple lines of text and from multiple such text files contained in a directory.

1.First Convert senetences to list of words(tokens)

In [1]:
import gensim
from gensim import corpora


In [42]:
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that", 
                "the operation was carried out without clearance and", 
                "transparency and that those involved will be held", 
                "responsible. One of the sources acknowledged that the", 
                "report is still being prepared and cautioned that", 
                "things could change."]

In [43]:
## Split into tokens:

texts=[[text for text in doc.split()] for doc in documents]

In [44]:
## convert tokens into Dictionary:

dictionary=corpora.Dictionary(texts)

In [45]:
print(dictionary)

Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


In [46]:
print(dictionary.token2id)

{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


We have created dictionary object.It will use this to create bag of words

## Update Dictionary:

If you get new documents in the future, it is also possible to update an existing dictionary to include the new words.

In [11]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

In [15]:
text2=[[text for text in doc.split()]for doc in documents_2]


In [16]:
dictionary.add_documents(text2)

In [17]:
print(dictionary)

Dictionary(48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


In [18]:
print(dictionary.token2id)

{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32, 'graph': 33, 'in': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'and': 41, 'minors': 42, 'ordering': 43, 'quasi': 44, 'well': 45, 'A': 46, 'survey': 47}


## Create Dictionary for Text File

The below example reads a file line-by-line and uses gensim’s simple_preprocess to process one line of the file at a time.



In [37]:
from gensim.utils import simple_preprocess
import os
from smart_open import smart_open

In [40]:
dictionary = corpora.Dictionary(simple_preprocess(line) for line in open('sample.txt', encoding='utf-8'))


In [41]:
print(dictionary)

Dictionary(0 unique tokens: [])


In [36]:
print(dictionary.token2id)

{}


In [29]:
print(dic)

Dictionary(0 unique tokens: [])


In [47]:
## Bag of words

In [48]:
from sklearn.feature_extraction.text import CountVectorizer

In [49]:
cv=CountVectorizer()

In [53]:
sparse_matrix=cv.fit_transform(documents_2)

In [55]:
sparse_matrix.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
        1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
        0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [8]:
documents = '''"The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."'''

In [5]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to C:\Users\ganesh
[nltk_data]     duke\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [9]:
nltk.sent_tokenize(documents)

['"The Saudis are preparing a report that will acknowledge that", \n             "Saudi journalist Jamal Khashoggi\'s death was the result of an", \n             "interrogation that went wrong, one that was intended to lead", \n             "to his abduction from Turkey, according to two sources."']

In [56]:
## TF_IDF

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [58]:
tf=TfidfVectorizer()

In [61]:
sparse_tf=tf.fit_transform(documents_2)

In [62]:
sparse_tf.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.38212155, 0.        ,
        0.        , 0.        , 0.        , 0.38212155, 0.        ,
        0.31334514, 0.        , 0.        , 0.        , 0.31334514,
        0.        , 0.38212155, 0.38212155, 0.        , 0.        ,
        0.22669711, 0.2645475 , 0.        , 0.        , 0.        ,
        0.        , 0.31334514, 0.        ],
       [0.        , 0.26244688, 0.        , 0.        , 0.37908734,
        0.        , 0.        , 0.37908734, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.37908734, 0.37908734, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.26244688, 0.        , 0.        , 0.        ,
        0.37908734, 0.        , 0.37908734],
       [0.        , 0.27147051, 0.3921214 , 0.        , 0.        ,
        0.        , 0.    

## Adavantages:

when your text input is large, you need to be able to create the dictionary object without having to load the entire text file.
The good news is Gensim lets you read the text and update the dictionary, one line at a time, without loading the entire text file into system memory.