[View in Colaboratory](https://colab.research.google.com/github/gmihaila/word_indexer/blob/master/tokenizer_fit.ipynb)

## How to index words in a large dataset using Keras Tokenizer

# Word Indexer

* This is a very helpful and easy to use function that can index words or characters when you have Big Data.
* It batches the data, so is VERY fast and uses very little memory
* Returns Keras type Tokenizer
* It uses the keras Tokenizer: 

  * **num_words**: the maximum number of words to keep, based on word frequency. Only the most common num_words *  *  will be kept.
  * **filters**: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.
  * **lower**: boolean. Whether to convert the texts to lowercase.
  * **split**: str. Separator for word splitting.
  * **char_level**: if True, every character will be treated as a token.
  * **oov_token**: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls


[Keras Text Preprocessing](https://keras.io/preprocessing/text/)


### Function Setup

```
from keras.preprocessing.text import Tokenizer

def FitTokenizer(data, n_words, batch_size, data_size):

      data - any array of strings. It can even be a generator
   n_words - how many words to use when indexing
batch_size - any value smaller than data size 
 data_size - how many strings you have in your data
 
```
**NOTE:** A string is treated as a document of text.
 

```
tk = FitTokenizer(my_data, n_words, batch_size, len(my_data))

tk.word_counts - dictionary word : word count
tk.document_count - number of documents (should match data size)
tk.word_index - dictionary word : index ascending order from moest frequent
tk.word_docs - dictionary word : how many documents it appears in
```

## Please see [tokenizer_fit.ipynb](https://github.com/gmihaila/word_indexer/blob/master/tokenizer_fit.ipynb)

**Thank you for using my code!**

Check out more useful tools: 
[GeorgeM](https://gmihaila.github.io)

In [1]:
import os
import string
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


### Custom function that can batch data if you have large corpus

```
def FitTokenizer(data, n_words, batch_size, data_size):

      data - any array of strings. It can even be a generator
   n_words - how many words to use when indexing
batch_size - any value smaller than data size 
 data_size - how many strings you have in your data
```


**NOTE**: A string is treated as a document of text. 

In [4]:
# fit a tokenizer
def FitTokenizer(data, n_words, batch_size, data_size):
  index = 0
  batch = []
  # fit tokenizer 
  tokenizer = Tokenizer(num_words=n_words, filters=string.punctuation, lower=True, split=' ', char_level=False, oov_token='<unk>')
  for x in data:
    batch.append(x)
    if ((index% batch_size) == 0) or (data_size == index):
          # feed each sequence
          tokenizer.fit_on_texts(batch)
          # reset batch
          batch = []
    index += 1
  if n_words is not None:
    # fix tokenizer   
    tokenizer.word_index = {e:i for e,i in tokenizer.word_index.items() if i <= n_words}
    tokenizer.word_index[tokenizer.oov_token] = n_words + 1

  return tokenizer

### Test it on toy example

In [8]:
my_data = ["My name is George.", "My name is Mhaela","My name is Ionut"]

t = FitTokenizer(data=my_data, n_words=2, batch_size=2, data_size=len(my_data))

print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('my', 3), ('name', 3), ('is', 3), ('george', 1), ('mhaela', 1), ('ionut', 1)])
3
{'my': 1, 'name': 2, '<unk>': 3}
{'name': 3, 'mhaela': 1, 'george': 1, 'ionut': 1, 'my': 3, 'is': 3}


### Works the same on Data Generators

In [10]:
# toy example of data generator
def DataGen(data):
  for x in data:
    yield x
    
generator = DataGen(my_data)
t = FitTokenizer(data=generator, n_words=2, batch_size=2, data_size=3)

print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('my', 3), ('name', 3), ('is', 3), ('george', 1), ('mhaela', 1), ('ionut', 1)])
3
{'my': 1, 'name': 2, '<unk>': 3}
{'name': 3, 'mhaela': 1, 'george': 1, 'ionut': 1, 'my': 3, 'is': 3}
