### text-->vocab-->tokenizer-->indexing-->replace sentences with index---> padding
### How tokenizer work:
1.   Input Data Preparation: list of corpus to the **Tokenizer.**
2. The tokenizer first tokenizes the texts into individual words (tokens).
3. It calculates the frequency of each word across the entire corpus.
4. Words are sorted first by their frequency (highest to lowest) and then alphabetically for ties. Sorted words.
5. Each word is assigned an integer index starting from 1 (index 0 is reserved for padding in most cases). Word Index:

# Key Points
1. Case Sensitivity: By default, the
tokenizer lowercases all words unless specified otherwise (lower=False).
2. Word Filters: Punctuation, numbers, and special characters can be filtered based on filters parameter.
3. Unseen Words: Words not in the training corpus won't be included in the word index and may be treated as unknown (oov_token can handle this).







In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [1]:
docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [3]:
tokenizer.fit_on_texts(docs)

In [4]:
tokenizer.word_index

{'india': 1,
 'jeetega': 2,
 'hip': 3,
 'ki': 4,
 'jai': 5,
 'kohli': 6,
 'sachin': 7,
 'dhoni': 8,
 'go': 9,
 'hurray': 10,
 'bhai': 11,
 'bharat': 12,
 'mata': 13,
 'modi': 14,
 'ji': 15,
 'inquilab': 16,
 'zindabad': 17}

In [5]:
tokenizer.index_word

{1: 'india',
 2: 'jeetega',
 3: 'hip',
 4: 'ki',
 5: 'jai',
 6: 'kohli',
 7: 'sachin',
 8: 'dhoni',
 9: 'go',
 10: 'hurray',
 11: 'bhai',
 12: 'bharat',
 13: 'mata',
 14: 'modi',
 15: 'ji',
 16: 'inquilab',
 17: 'zindabad'}

In [6]:
tokenizer.index_docs

defaultdict(int,
            {9: 1,
             1: 3,
             10: 1,
             3: 1,
             11: 1,
             2: 1,
             4: 2,
             5: 2,
             13: 1,
             12: 1,
             6: 1,
             7: 1,
             8: 1,
             15: 1,
             14: 1,
             16: 1,
             17: 1})

In [7]:
tokenizer.word_counts

OrderedDict([('go', 1),
             ('india', 4),
             ('hip', 2),
             ('hurray', 1),
             ('jeetega', 3),
             ('bhai', 1),
             ('bharat', 1),
             ('mata', 1),
             ('ki', 2),
             ('jai', 2),
             ('kohli', 2),
             ('sachin', 2),
             ('dhoni', 2),
             ('modi', 1),
             ('ji', 1),
             ('inquilab', 1),
             ('zindabad', 1)])

In [10]:
len(tokenizer.word_index)

17

In [12]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[9, 1],
 [1, 1],
 [3, 3, 10],
 [2, 11, 2, 1, 2],
 [12, 13, 4, 5],
 [6, 6],
 [7, 7],
 [8, 8],
 [14, 15, 4, 5],
 [16, 17]]

In [14]:
from tensorflow.keras.utils import pad_sequences
sequences = pad_sequences(sequences,padding='post')
sequences

array([[ 9,  1,  0,  0,  0],
       [ 1,  1,  0,  0,  0],
       [ 3,  3, 10,  0,  0],
       [ 2, 11,  2,  1,  2],
       [12, 13,  4,  5,  0],
       [ 6,  6,  0,  0,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [14, 15,  4,  5,  0],
       [16, 17,  0,  0,  0]], dtype=int32)

# Model Building for Sentiment analysis:

In [16]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding,Flatten

In [17]:
(X_train,y_train),(X_test,y_test) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [19]:
len(X_train[0]) , len(X_train[2])

(218, 141)

In [20]:
# apply padding
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)

In [21]:
len(X_train[0]), len(X_train[1])

(50, 50)

In [22]:
model = Sequential()

model.add(SimpleRNN(32,input_shape=(50,1),return_sequences=False))
model.add(Dense(1,activation='sigmoid'))

model.summary()

  super().__init__(**kwargs)


In [23]:
model.compile(loss = 'binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(X_train,y_train,epochs=5,validation_data = (X_test,y_test))

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 14ms/step - accuracy: 0.5030 - loss: 0.7059 - val_accuracy: 0.5022 - val_loss: 0.6994
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 13ms/step - accuracy: 0.5011 - loss: 0.6934 - val_accuracy: 0.5078 - val_loss: 0.6937
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 13ms/step - accuracy: 0.5056 - loss: 0.6922 - val_accuracy: 0.5050 - val_loss: 0.6941
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 14ms/step - accuracy: 0.5163 - loss: 0.6924 - val_accuracy: 0.5079 - val_loss: 0.6946
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 13ms/step - accuracy: 0.5058 - loss: 0.6934 - val_accuracy: 0.5015 - val_loss: 0.6947


<keras.src.callbacks.history.History at 0x7f34a1f253c0>