# text_to_word_sequence

- text_to_word_sequence() function that you can use to split text into a list of words. By default, this function automatically does 3 things:
    - Splits words by space.
    - Filters out punctuation.
    - Converts text to lowercase (lower=True).

In [4]:
from keras.preprocessing.text import text_to_word_sequence
text = 'The quick brown fox jumped over the lazy dog!!.'
result = text_to_word_sequence(text)
print(result)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


# one_hot

##### One-hot encode a text into a list of word indexes in a vocabulary of size n.

- Return: List of integers in [1, n]. Each integer encodes a word (unicity non-guaranteed).

- Arguments: Same as text_to_word_sequence above.

     - n: int. Size of vocabulary.

In [11]:
from keras.preprocessing.text import one_hot
text1 = 'The quick brown fox jumped over the lazy dog!!.'
result1 = one_hot(text1,20)
print(result1)

[5, 16, 18, 19, 13, 19, 5, 3, 12]


# Tokenizer

- Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).

- Arguments: Same as text_to_word_sequence above.

    - nb_words: None or int. Maximum number of words to work with (if set, tokenization will be restricted to the top nb_words most common words in the dataset).

## Tokenizer Methods 

### fit_on_texts 

- Arguments:
    - texts: list of texts to train on.

In [18]:
from keras.preprocessing.text import Tokenizer
t  = Tokenizer()
fit_text = ["The earth is an awesome place live"] # it will be treated as vocabulary of unique words and integer numbers will be
                                                  # applied. Ex: the = 1, earth = 2, is = 3, an = 4, awesome = 5, place = 6,live = 7
t.fit_on_texts(fit_text)

### texts_to_sequences(texts)

- Arguments:
    - texts: list of texts to turn to sequences.
    - Return: list of sequences (one per text input). 

In [19]:
#Similarly, list of sentences/single sentence in a list must be passed into texts_to_sequences.
test_text1 = "The earth is an great place live"
test_text2 = "The is my program"
sequences = t.texts_to_sequences([test_text1, test_text2])

print('sequences : ',sequences,'\n')

sequences :  [[1, 2, 3, 4, 6, 7], [1, 3]] 



- corpus defined integers for given train data : the = 1, earth = 2, is = 3, an = 4, awesome = 5, place = 6,live = 7
- if we look at test_text1 we have new word 'great' because this word not present in corpus, no integer number assigned in train data that is the reason in output also for that number is skipped.
- like wise test_text2

### texts_to_sequences_generator(texts): generator version of the above.

- Return: yield one sequence per input text. 

# texts_to_matrix(texts):

- Return: numpy array of shape (len(texts), nb_words).
- Arguments:
    - texts: list of texts to vectorize.
    - mode: one of "binary", "count", "tfidf", "freq" (default: "binary").

In [27]:
t  = Tokenizer()
docs = ["The earth is an awesome place live"] # it will be treated as vocabulary of unique words and integer numbers will be
                                                  # applied. Ex: the = 1, earth = 2, is = 3, an = 4, awesome = 5, place = 6,live = 7
t.fit_on_texts(docs)    
encoded_docs = t.texts_to_matrix(docs,mode = 'count')
print(encoded_docs)

[[0. 1. 1. 1. 1. 1. 1. 1.]]


# fit_on_sequences(sequences):

- Arguments:
    - sequences: list of sequences to train on. 

# sequences_to_matrix(sequences):

- Return: numpy array of shape (len(sequences), nb_words).
- Arguments:
    - sequences: list of sequences to vectorize.
    - mode: one of "binary", "count", "tfidf", "freq" (default: "binary").

# Attributes:

- word_counts: dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.
- word_docs: dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.
- word_index: dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.
- document_count: int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.

In [29]:
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
#print(t.word_docs)
#print(t.word_index)
#print(t.document_count)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])


In [30]:
print(t.word_docs)

defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'good': 1, 'work': 2, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})


In [31]:
print(t.word_index)

{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}


In [32]:
print(t.document_count)

5
