In [None]:
#https://machinelearningknowledge.ai/keras-tokenizer-tutorial-with-examples-for-fit_on_texts-texts_to_sequences-texts_to_matrix-sequences_to_matrix/

The fit_on_texts method is a part of Keras tokenizer class which is used to update the internal vocabulary for the texts list. We need to call be before using other methods of texts_to_sequences or texts_to_matrix.

The object returned by fit_on_texts can be used to derive more information by using the following attributes-

word_counts : It is a dictionary of words along with the counts.

word_docs : Again a dictionary of words, this tells us how many documents contain this word

word_index : In this dictionary, we have unique integers assigned to each word.

document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.

In [None]:
from keras.preprocessing.text import Tokenizer

t  = Tokenizer()
# Defining 4 document lists
fit_text = ['Machine Learning Knowledge',
	    'Machine Learning',
            'Deep Learning',
            'Artificial Intelligence']

t.fit_on_texts(fit_text)

In [None]:
#The document_count prints the number of documents present in our corpus.  In our above example, there are 4 documents present.
print("The document count",t.document_count)

The document count 4


In [None]:
# The word_count shows the number of times words occur in the text corpus passed to the Keras tokenizer class model. In our example, the word ‘machine’ has occurred 2 times, ‘learning’ 3 times, and so on.
print("The count of words",t.word_counts)

The count of words OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)])


In [None]:
#The word_index assigns a unique index to each word present in the text. This unique integer helps the model during training purposes.
print("The word index",t.word_index)

The word index {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6}


In [None]:
#The word_doc tells in how many documents each of the words appear. In our example, ‘machine’  appears in 2 documents, ‘learning’ in 1 document, ‘knowledge’ in 3 documents, so on.
print("The word docs",t.word_docs)

The word docs defaultdict(<class 'int'>, {'knowledge': 1, 'machine': 2, 'learning': 3, 'deep': 1, 'artificial': 1, 'intelligence': 1})


# Example 2 : fit_on_texts on String

The word_count shows the number of times a character has occurred.
The document_count prints the number of characters present in our input text.
The word_index assigns a unique index to each character present in the text.
The word_docs produces results similar to word_counts and gives the frequency of characters.

fit_on_texts, when applied on a string text, its attributes produce different types of results.

The word_count shows the number of times a character has occurred.
The document_count prints the number of characters present in our input text.

The word_index assigns a unique index to each character present in the text.

The word_docs produces results similar to word_counts and gives the frequency of characters.

In [None]:
t  = Tokenizer()

fit_text = 'Machine Learning'

t.fit_on_texts(fit_text)

print("Count of characters:",t.word_counts)
print("Length of text:",t.document_count)
print("Character index",t.word_index)
print("Frequency of characters:",t.word_docs)

Count of characters: OrderedDict([('m', 1), ('a', 2), ('c', 1), ('h', 1), ('i', 2), ('n', 3), ('e', 2), ('l', 1), ('r', 1), ('g', 1)])
Length of text: 16
Character index {'n': 1, 'a': 2, 'i': 3, 'e': 4, 'm': 5, 'c': 6, 'h': 7, 'l': 8, 'r': 9, 'g': 10}
Frequency of characters: defaultdict(<class 'int'>, {'m': 1, 'a': 2, 'c': 1, 'h': 1, 'i': 2, 'n': 3, 'e': 2, 'l': 1, 'r': 1, 'g': 1})


# 2. texts_to_sequences

texts_to_sequences method helps in converting tokens of text corpus into a sequence of integers.
# Example 1: texts_to_sequences on Document List
We can see here in the example that given a corpus of documents, texts_to_sequences assign integers to words. For example, ‘machine’ is assigned value 2.

In [None]:
t = Tokenizer()

test_text = ['Machine Learning Knowledge',
	      'Machine Learning',
             'Deep Learning',
             'Artificial Intelligence']

t.fit_on_texts(test_text)

sequences = t.texts_to_sequences(test_text)

print("The sequences generated from text are : ",sequences)

The sequences generated from text are :  [[2, 1, 3], [2, 1], [4, 1], [5, 6]]


# Example 2: texts_to_sequences on String
In this example, texts_to_sequences assign integers to characters. For example, ‘e’ is assigned a value ‘4’.

In [None]:
t = Tokenizer()

test_text = "Machine Learning"

t.fit_on_texts(test_text)

sequences = t.texts_to_sequences(test_text)

print("The sequences generated from text are : ",sequences)

The sequences generated from text are :  [[5], [2], [6], [7], [3], [1], [4], [], [8], [4], [2], [9], [1], [3], [1], [10]]


# 3. texts_to_matrix
Another useful method of tokenizer class is texts_to_matrix() function for converting the document into a numpy matrix form.

This function works in 4 different modes –

binary : The default value that tells us about the presence of each word in a document.

count : As the name suggests, the count for each word in the document is known.

tfidf : The TF-IDF score for each word in the document.

freq : The frequency tells us about ratio of words in each document.

# Example 1: texts_to_matrix with mode = binary
The binary mode in texts_to_matrix() function determines the presence of text by using ‘1’ in the matrix where the word is present and ‘0’ where the word is not present.

NOTE: This mode doesn’t count the total number of times a particular word or text, but it just tells about the presence of word in each of the documents.



In [None]:
# define 5 documents
docs = ['Marvellous Machine Learning Marvellous Machine Learning',
		'Amazing Artificial Intelligence',
		'Dazzling Deep Learning',
		'Champion Computer Vision',
		'Notorious Natural Language Processing Notorious Natural Language Processing']
# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)
bias variance tradeoffbias variance tradeoff
encoded_docs = t.texts_to_matrix(docs, mode='binary')
print(encoded_docs)

[[0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]


# Example 2: texts_to_matrix with mode = count
The count mode in texts_to_matrix() function determines the number of times the words appear in each of the documents.

In [None]:
# define 5 documents 
docs = ['Marvellous Machine Learning Marvellous Machine Learning', 
'Amazing Artificial Intelligence', 
'Dazzling Deep Learning', 
'Champion Computer Vision', 
'Notorious Natural Language Processing Notorious Natural Language Processing'] 

# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)

encoded_docs = t.texts_to_matrix(docs, mode='count')

print(encoded_docs)

[[0. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0.]]


# Example 3: texts_to_matrix with mode = tfidf
TF-IDF or Term Frequency – Inverse Document Frequency, works by checking the relevance of a word in a given text corpus.

In this mode, a proportional score is given to words on the basis of the number of times they occur in the text corpus. In this way, this model can determine which words are worthy and which aren’t.



In [None]:
t.fit_on_texts(docs)
encoded_docs = t.texts_to_matrix(docs, mode='tfidf')
print(encoded_docs)

[[0.         1.8601123  2.48272447 2.48272447 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.46633707 1.46633707 1.46633707 0.
  0.         0.         0.         0.        ]
 [0.         1.09861229 0.         0.         0.         0.
  0.         0.         0.         0.         0.         1.46633707
  1.46633707 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         1.46633707 1.46633707 1.46633707]
 [0.         0.         0.         0.         2.48272447 2.48272447
  2.48272447 2.48272447 0.         0.         0.         0.
  0.         0.         0.         0.        ]]


# Example 4: texts_to_matrix with mode = freq
This last mode used in texts_to_matrix() is the frequency that actually determines a score and assigning to each on the basis of the ratio of the word with all the words in the document or text corpus.

In [None]:
t.fit_on_texts(docs)

encoded_docs = t.texts_to_matrix(docs, mode='freq')

print(encoded_docs)

[[0.         0.33333333 0.33333333 0.33333333 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.33333333 0.33333333 0.33333333 0.
  0.         0.         0.         0.        ]
 [0.         0.33333333 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.33333333
  0.33333333 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.33333333 0.33333333 0.33333333]
 [0.         0.         0.         0.         0.25       0.25
  0.25       0.25       0.         0.         0.         0.
  0.         0.         0.         0.        ]]


# 4. sequences_to_matrix
This function sequences_to_matrix() of Keras tokenizer class is used to convert the sequences into a numpy matrix form.

sequences_to_matrix also has 4 different modes to work with –

binary : The default value that tells us about the presence of each word in a document.
count : As the name suggests, the count for each word in the document is known.
tfidf : The TF-IDF score for each word in the document.
freq : The frequency tells us about ratio of words in each document.
# Example 1: sequences_to_matrix with mode = binary

In [None]:
# define 4 documents
docs =['Machine Learning Knowledge',
       'Machine Learning and Deep Learning',
       'Deep Learning',
       'Artificial Intelligence']

# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)

sequences = t.texts_to_sequences(docs)

encoded_docs = t.sequences_to_matrix(sequences, mode='binary')
print(encoded_docs)

[[0. 1. 1. 0. 1. 0. 0. 0.]
 [0. 1. 1. 1. 0. 1. 0. 0.]
 [0. 1. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1.]]


# Example 2: sequences_to_matrix with mode = count

In [None]:
# define 4 documents
docs =['Machine Learning Knowledge',
       'Machine Learning and Deep Learning',
      'Deep Learning',
      'Artificial Intelligence']

# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)

sequences = t.texts_to_sequences(docs)

encoded_docs = t.sequences_to_matrix(sequences, mode='count')
print(encoded_docs)

[[0. 1. 1. 0. 1. 0. 0. 0.]
 [0. 2. 1. 1. 0. 1. 0. 0.]
 [0. 1. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1.]]


# Example 3: sequences_to_matrix with mode = tfidf

In [None]:
# define 4 documents
docs =['Machine Learning Knowledge',
      'Machine Learning and Deep Learning',
      'Deep Learning',
      'Artificial Intelligence']

# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)

sequences = t.texts_to_sequences(docs)

encoded_docs = t.sequences_to_matrix(sequences, mode='tfidf')
print(encoded_docs)

[[0.         0.69314718 0.84729786 0.         1.09861229 0.
  0.         0.        ]
 [0.         1.17360019 0.84729786 0.84729786 0.         1.09861229
  0.         0.        ]
 [0.         0.69314718 0.         0.84729786 0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  1.09861229 1.09861229]]


# Example 4: sequences_to_matrix with mode = freq

In [None]:
# define 4 documents
docs =['Machine Learning Knowledge',
       'Machine Learning and Deep Learning',
       'Deep Learning',
       'Artificial Intelligence']

# create the tokenizer
t = Tokenizer()

t.fit_on_texts(docs)

sequences = t.texts_to_sequences(docs)

encoded_docs = t.sequences_to_matrix(sequences, mode='freq')
print(encoded_docs)

[[0.         0.33333333 0.33333333 0.         0.33333333 0.
  0.         0.        ]
 [0.         0.4        0.2        0.2        0.         0.2
  0.         0.        ]
 [0.         0.5        0.         0.5        0.         0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.5        0.5       ]]


In [None]:
#https://machinelearningknowledge.ai/keras-tokenizer-tutorial-with-examples-for-fit_on_texts-texts_to_sequences-texts_to_matrix-sequences_to_matrix/