# Creating dictionary from text file

The below example reads a file line-by-line and uses gensim’s **simple_preprocess** to process one line of the file at a time.

The advantage here is it let’s you read an entire text file without loading the file in memory all at once.

In [13]:
import gensim
from gensim import corpora
import os

In [2]:
from gensim.utils import simple_preprocess
from smart_open import smart_open

In [6]:
#deacc:Remove accent marks from tokens using :func:`~gensim.utils.deaccent
mydict=corpora.Dictionary(simple_preprocess(line,deacc=True) for line in open('Textfiles/text1.txt'))

In [7]:
mydict.token2id

{'all': 0,
 'be': 1,
 'cannot': 2,
 'else': 3,
 'everyone': 4,
 'genius': 5,
 'god': 6,
 'is': 7,
 'language': 8,
 'must': 9,
 'of': 10,
 'one': 11,
 'poor': 12,
 'silent': 13,
 'speak': 14,
 'the': 15,
 'thereof': 16,
 'translation': 17,
 'whereof': 18}

# Creating dictinary from multiple text files

Assuming you have all the text files in the same directory, you need to define a class with an __iter__ method. The __iter__() method should iterate through all the files in a given directory and yield the processed list of word tokens.

In [10]:
class Readtxtfiles(object):
    def __init__(self,dirname):
        self.dirname=dirname
    
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname,fname)):
                yield simple_preprocess(line)
        

In [21]:
path='Textfiles/'

In [22]:
mydict=corpora.Dictionary(Readtxtfiles(path))

In [23]:
print(mydict)

Dictionary(50 unique tokens: ['all', 'be', 'cannot', 'else', 'everyone']...)


In [25]:
mydict.token2id

{'all': 0,
 'be': 1,
 'cannot': 2,
 'else': 3,
 'everyone': 4,
 'genius': 5,
 'god': 6,
 'is': 7,
 'language': 8,
 'must': 9,
 'of': 10,
 'one': 11,
 'poor': 12,
 'silent': 13,
 'speak': 14,
 'the': 15,
 'thereof': 16,
 'translation': 17,
 'whereof': 18,
 'and': 19,
 'bite': 20,
 'change': 21,
 'dance': 22,
 'define': 23,
 'don': 24,
 'for': 25,
 'in': 26,
 'into': 27,
 'it': 28,
 'join': 29,
 'like': 30,
 'look': 31,
 'make': 32,
 'move': 33,
 'only': 34,
 'out': 35,
 'own': 36,
 'plunge': 37,
 'sense': 38,
 'sky': 39,
 'something': 40,
 'teeth': 41,
 'there': 42,
 'to': 43,
 'trying': 44,
 'way': 45,
 'with': 46,
 'you': 47,
 'your': 48,
 'yourself': 49}