# How to Prepare Text Data for Deep Learning with Keras

You cannot feed raw text directly into deep learning models.

Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

The Keras deep learning library provides some basic tools to help you prepare your text data.

In this tutorial, you will discover how you can use Keras to prepare your text data.

# 1.  Integer Encoding in RNN

So far we have looked at one-off convenience methods for preparing text with Keras.

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects.

Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.



In [1]:
import numpy as np

docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [2]:
# import Tokenizer
from keras.preprocessing.text import Tokenizer

# create the tokenizer
tokenizer = Tokenizer(oov_token='<nothing>')    # oov_token -> out of vocabulary (its mean the if the input come this netwrok and word not come in input then write nothing)

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
#  fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)

- Once fit, the Tokenizer provides **4 attributes** that you can use to query what has been learned about your documents:

     - **word_counts:** A dictionary of words and their counts.
     - **word_docs:** A dictionary of words and how many documents each appeared in.
     - **word_index:** A dictionary of words and their uniquely assigned integers.
     - **document_count:** An integer count of the total number of documents that were used to fit the Tokenizer.

In [4]:
# index of vocabulary
tokenizer.word_index 

{'<nothing>': 1,
 'india': 2,
 'jeetega': 3,
 'hip': 4,
 'ki': 5,
 'jai': 6,
 'kohli': 7,
 'sachin': 8,
 'dhoni': 9,
 'go': 10,
 'hurray': 11,
 'bhai': 12,
 'bharat': 13,
 'mata': 14,
 'modi': 15,
 'ji': 16,
 'inquilab': 17,
 'zindabad': 18}

In [5]:
# totle counts word how mant times comes in corpus
tokenizer.word_counts

OrderedDict([('go', 1),
             ('india', 4),
             ('hip', 2),
             ('hurray', 1),
             ('jeetega', 3),
             ('bhai', 1),
             ('bharat', 1),
             ('mata', 1),
             ('ki', 2),
             ('jai', 2),
             ('kohli', 2),
             ('sachin', 2),
             ('dhoni', 2),
             ('modi', 1),
             ('ji', 1),
             ('inquilab', 1),
             ('zindabad', 1)])

In [6]:
# totle docoments present in data
tokenizer.document_count

10

In [7]:
#  A dictionary of words and how many documents each appeared in.
tokenizer.word_docs

defaultdict(int,
            {'go': 1,
             'india': 3,
             'hurray': 1,
             'hip': 1,
             'jeetega': 1,
             'bhai': 1,
             'mata': 1,
             'jai': 2,
             'bharat': 1,
             'ki': 2,
             'kohli': 1,
             'sachin': 1,
             'dhoni': 1,
             'ji': 1,
             'modi': 1,
             'zindabad': 1,
             'inquilab': 1})

In [21]:
# convert the text to numberic based on the vocabulary index
# use tokenizer.text_to_sequences
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[10, 2],
 [2, 2],
 [4, 4, 11],
 [3, 12, 3, 2, 3],
 [13, 14, 5, 6],
 [7, 7],
 [8, 8],
 [9, 9],
 [15, 16, 5, 6],
 [17, 18]]

In [22]:
# this all docoments have different different dimesnion 
# convert the all docoments same direction 
# use pad_sequences
from keras.utils import pad_sequences
sequences = pad_sequences(sequences,padding='post')    # padding post mena the zero is present the number of vocaulary 'pre ,post'

In [23]:
sequences

array([[10,  2,  0,  0,  0],
       [ 2,  2,  0,  0,  0],
       [ 4,  4, 11,  0,  0],
       [ 3, 12,  3,  2,  3],
       [13, 14,  5,  6,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [ 9,  9,  0,  0,  0],
       [15, 16,  5,  6,  0],
       [17, 18,  0,  0,  0]], dtype=int32)

In [9]:
from keras.datasets import imdb
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Flatten

In [10]:
# load imbd data
(X_train,y_train),(X_test,y_test) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [11]:
print(X_train.shape)

(25000,)


In [12]:
X_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1

**This data is already preprocesed and integier encoded**

In [27]:
print(len(X_train[0]))

218


In [28]:
print(len(X_train[1]))  # all the have different different dimesion 

189


In [29]:
# convert the same dimension 
# add pad_sequences
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)  # maxlen : means the dimension of 50 

In [30]:
print(X_train.shape)  # you can see the simaply before and after the differnce 

(25000, 50)


In [31]:
print(X_train[0].shape)
print(X_train[1].shape)
print(X_train[2].shape)  # all of this same dimesntion 


(50,)
(50,)
(50,)


**input_shape=(time_step,input_feature)**

In [32]:
# model 
model = Sequential()

model.add(SimpleRNN(32,input_shape=(50,1),return_sequences=False))   # 32 is node , input shape is (timestamp,feature)
model.add(Dense(1,activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 32)                1088      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 1,121
Trainable params: 1,121
Non-trainable params: 0
_________________________________________________________________


In [33]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.fit(X_train,y_train,epochs=5,validation_data=(X_test,y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f88a51f3310>

# 2. Embading Encoding 

A word embedding is a learned representation for text where words that have the same meaning and save similar representation

- This approach to representing words and documents may be considered one of the key breakthroughs of deep learning on challenging NLP problems

- Word embeddings are alternative to one-hot encoding along with dimensionality reduction.


**`One-hot word vectors`** — Sparse, High-dimensional and Hard-coded , not sementic mean

**`Word embeddings`** — Dense, Lower-Dimensional and Learned from the data


- `Keras library has embeddings laye`r which does word representation of given text corpus

     **tf.keras.layers.Embedding( input_dim, output_dim, embeddings_initializer=’uniform’, embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs)**

`Key Arguments:`

`input_dim` — the size of vocabulary or length of the word index

`output_dim` — Output dimension of word representation

`input_length` — max input sequence length of the document


The Emabdaing requred the numeric data


In [34]:
docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [35]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [36]:
tokenizer.fit_on_texts(docs)

In [37]:
len(tokenizer.word_index)

17

In [38]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[9, 1],
 [1, 1],
 [3, 3, 10],
 [2, 11, 2, 1, 2],
 [12, 13, 4, 5],
 [6, 6],
 [7, 7],
 [8, 8],
 [14, 15, 4, 5],
 [16, 17]]

In [39]:
from keras.utils import pad_sequences
sequences = pad_sequences(sequences,padding='post')
sequences

array([[ 9,  1,  0,  0,  0],
       [ 1,  1,  0,  0,  0],
       [ 3,  3, 10,  0,  0],
       [ 2, 11,  2,  1,  2],
       [12, 13,  4,  5,  0],
       [ 6,  6,  0,  0,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [14, 15,  4,  5,  0],
       [16, 17,  0,  0,  0]], dtype=int32)

In [40]:
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [41]:
model = Sequential()
model.add(Embedding(18,output_dim=2,input_length=5))         # 18 is vocabulary , output_dim means the one word represent 2 vector form , input_lengrh mean
                                                             # the docoment size is 5 its mean the sentance size is 5                   
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 2)              36        
                                                                 
Total params: 36
Trainable params: 36
Non-trainable params: 0
_________________________________________________________________


In [42]:
model.compile('adam','accuracy')

In [43]:
pred = model.predict(sequences)
print(pred)

[[[-0.03529511 -0.04260511]
  [-0.03166655  0.02645114]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]]

 [[-0.03166655  0.02645114]
  [-0.03166655  0.02645114]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]]

 [[ 0.0192664  -0.0036878 ]
  [ 0.0192664  -0.0036878 ]
  [ 0.04702375 -0.01738302]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]]

 [[-0.02891405 -0.00236194]
  [-0.02816381 -0.01025749]
  [-0.02891405 -0.00236194]
  [-0.03166655  0.02645114]
  [-0.02891405 -0.00236194]]

 [[-0.00816128 -0.04623958]
  [ 0.0192096  -0.02086746]
  [ 0.04635206 -0.024409  ]
  [ 0.04513386  0.0170216 ]
  [-0.00296923  0.03606423]]

 [[ 0.0280334  -0.02030165]
  [ 0.0280334  -0.02030165]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]]

 [[ 0.03160164  0.02577854]
  [ 0.03160164  0.02577854]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]
  [-0.00296923  0.03606423]]

 [[-0.

In [14]:
from keras.datasets import imdb
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [15]:
(X_train,y_train),(X_test,y_test) = imdb.load_data()

In [16]:
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)

In [17]:
X_train.shape

(25000, 50)

In [26]:
X_train.max()

88585

In [29]:
model = Sequential()
model.add(Embedding(88586, output_dim=2,input_length=50))
model.add(SimpleRNN(32,return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 50, 2)             177172    
                                                                 
 simple_rnn_4 (SimpleRNN)    (None, 32)                1120      
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
Total params: 178,325
Trainable params: 178,325
Non-trainable params: 0
_________________________________________________________________


In [30]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train, y_train,epochs=5,validation_data=(X_test,y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x79b0a0301630>