

# **Why We Use Integers in RNN**

---

## 1. What ‚ÄúUsing Integer‚Äù Means in RNN

In RNNs (especially for NLP), we **do not feed words directly** to the network.

Instead:

* Each word/token is converted into an **integer ID**
* These integers represent positions in a vocabulary

Example:

```
"I love deep learning"
```

Vocabulary:

```
I ‚Üí 1
love ‚Üí 2
deep ‚Üí 3
learning ‚Üí 4
```

Sequence becomes:

```
[1, 2, 3, 4]
```

üìå These integers are **not values**, they are **indexes**.

---

## 2. Why RNN Cannot Take Words Directly

Neural networks only work with **numbers**:

* No strings
* No text
* No symbols

So text must be converted into a **numerical form**.

Integers are the **simplest and most efficient representation**.

---

## 3. Why Integers Are Preferred (Instead of One-Hot)

### Option 1: One-Hot Encoding (Problematic)

Vocabulary size = 10,000

Word = ‚Äúlove‚Äù

```
[0,0,0,1,0,0,0,0,0,...]
```

‚ùå Huge vectors
‚ùå Memory expensive
‚ùå Sparse
‚ùå Slow for RNNs

---

### Option 2: Integer Encoding (Efficient) ‚úÖ

```
love ‚Üí 2
```

‚úî Compact
‚úî Fast
‚úî Memory efficient
‚úî Perfect for Embedding layers

---

## 4. Integer Encoding + Embedding Layer (Key Concept)

Integers are **NOT fed directly** into the RNN.

They are passed through an **Embedding layer** first.

```python
Embedding(input_dim=vocab_size, output_dim=embedding_dim)
```

### What happens internally:

```
Integer ‚Üí Dense vector
```

Example:

```
2 ‚Üí [0.12, -0.34, 0.87, ...]
```

üìå This vector is **learned**, not fixed.

---

## 5. Why Embeddings Need Integers

The embedding layer works like a **lookup table**:

[
$\text{Embedding}(i) = W[i]$
]

Where:

* `i` = integer index
* `W` = embedding matrix

If inputs were not integers ‚Üí lookup would not work.

---

## 6. Where Integers Are Used in RNN Pipelines

‚úî Text classification
‚úî Language modeling
‚úî Machine translation
‚úî Speech recognition
‚úî Chatbots

All sequence models use **integer token IDs**.

---

## 7. Padding Uses Integers Too

Sequences have different lengths:

```
[1,2,3]
[4,5]
```

Pad them:

```
[1,2,3]
[4,5,0]
```

Here:

* `0` = padding token
* Masked during training

---

## 8. Why Not Use Float Numbers Instead?

Using floats directly:

* Has no semantic meaning
* Cannot represent vocabulary position
* Breaks embedding lookup

Integers = **identity**, not magnitude.

---

## 9. Important Clarification (Exam Trap ‚ùó)

‚ùå RNN does NOT ‚Äúlearn from integers directly‚Äù
‚úÖ RNN learns from **embeddings derived from integers**

Integers are just **keys**.

---

## 10. RNN vs ANN (in this context)

| Aspect     | ANN                 | RNN                 |
| ---------- | ------------------- | ------------------- |
| Text input | Needs preprocessing | Integer sequences   |
| Order      | Ignored             | Preserved           |
| Encoding   | One-hot common      | Integer + embedding |



## 10. Easy Intuition (Remember This)

> Integers in RNN are like **dictionary page numbers**, not meanings themselves.

---

## 11. Final Summary

* RNNs need numeric input
* Integers represent token IDs
* Efficient for large vocabularies
* Enable embedding layers
* Preserve sequence order
* Essential for NLP tasks




This cell imports the `numpy` library, commonly used for numerical operations in Python, and defines a list of strings called `docs`. These strings will be used as a corpus for text processing.

In [7]:
import numpy as np

docs = ['go pakistan',
		'pakistan pakistan',
		'hip hip hurray',
		'jeetega bhai jeetega pakistan jeetega',
		'pakistan zinda bad',
		'Amir Amir',
		'zeeshan zeeshan',
		'dhoni dhoni',
		'cm punjab ',
		'inquilab zindabad']

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<nothing>')

This cell imports the `Tokenizer` class from Keras' `preprocessing.text` module. It then initializes a `Tokenizer` object, specifying `oov_token='<nothing>'`. The `oov_token` (out-of-vocabulary token) handles words not found in the vocabulary during tokenization by assigning them this special token.

In [9]:
tokenizer.fit_on_texts(docs)

This cell fits the tokenizer on the `docs` corpus. This step builds the vocabulary based on the words present in the `docs` list. It assigns a unique integer index to each unique word.

In [10]:
tokenizer.word_index

{'<nothing>': 1,
 'pakistan': 2,
 'jeetega': 3,
 'hip': 4,
 'amir': 5,
 'zeeshan': 6,
 'dhoni': 7,
 'go': 8,
 'hurray': 9,
 'bhai': 10,
 'zinda': 11,
 'bad': 12,
 'cm': 13,
 'punjab': 14,
 'inquilab': 15,
 'zindabad': 16}

This cell displays the `word_index` attribute of the tokenizer. This is a dictionary mapping each word to its corresponding integer index. The `oov_token` '<nothing>' is assigned index 1.

In [11]:
tokenizer.word_counts

OrderedDict([('go', 1),
             ('pakistan', 5),
             ('hip', 2),
             ('hurray', 1),
             ('jeetega', 3),
             ('bhai', 1),
             ('zinda', 1),
             ('bad', 1),
             ('amir', 2),
             ('zeeshan', 2),
             ('dhoni', 2),
             ('cm', 1),
             ('punjab', 1),
             ('inquilab', 1),
             ('zindabad', 1)])

This cell displays the `word_counts` attribute, which is an `OrderedDict` showing the frequency of each word in the `docs` corpus.

In [12]:
tokenizer.document_count

10

This cell displays the `document_count` attribute, which indicates the total number of documents (strings) that were used to fit the tokenizer (in this case, 10).

In [13]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[8, 2],
 [2, 2],
 [4, 4, 9],
 [3, 10, 3, 2, 3],
 [2, 11, 12],
 [5, 5],
 [6, 6],
 [7, 7],
 [13, 14],
 [15, 16]]

This cell converts the text documents into sequences of integers using the `texts_to_sequences` method of the tokenizer. Each word in the documents is replaced by its corresponding integer index from the `word_index`.

In [14]:
from keras.utils import pad_sequences

This cell imports the `pad_sequences` function from Keras' `utils` module. This function is used to ensure that all sequences in a list have the same length, which is a common requirement for neural network inputs.

In [21]:
sequences = pad_sequences(sequences,padding='pre')

This cell applies padding to the `sequences` generated earlier. `padding='post'` means that zeros will be added to the end of shorter sequences to match the length of the longest sequence. This creates a uniform input size for a neural network.

In [22]:
sequences

array([[ 8,  2,  0,  0,  0],
       [ 2,  2,  0,  0,  0],
       [ 4,  4,  9,  0,  0],
       [ 3, 10,  3,  2,  3],
       [ 2, 11, 12,  0,  0],
       [ 5,  5,  0,  0,  0],
       [ 6,  6,  0,  0,  0],
       [ 7,  7,  0,  0,  0],
       [13, 14,  0,  0,  0],
       [15, 16,  0,  0,  0]], dtype=int32)

This cell displays the padded sequences. Notice how shorter sequences now have `0`s appended to them to match the length of the longest sequence (which has 5 elements).

In [23]:
from keras.datasets import imdb
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

This cell imports necessary components from Keras for building a neural network: `imdb` for dataset loading, `Sequential` for creating a linear stack of layers, and `Dense`, `SimpleRNN`, `Embedding`, `Flatten` for different types of neural network layers.

In [24]:
(X_train,y_train),(X_test,y_test) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 0us/step


This cell loads the IMDB movie review sentiment classification dataset using `imdb.load_data()`. It splits the dataset into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets. `X` contains sequences of word indices, and `y` contains sentiment labels (0 for negative, 1 for positive).

In [25]:
X_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

This cell displays the first movie review from the training dataset (`X_train[0]`). This output shows a sequence of integers, where each integer represents a word in the review.

In [26]:
len(X_train[2])

141

This cell prints the length of the third movie review in the training dataset (`X_train[2]`), indicating the number of words (or word indices) in that particular review.

In [27]:
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)

This cell pads the sequences in both the training (`X_train`) and testing (`X_test`) datasets using `pad_sequences`. `padding='post'` adds zeros to the end, and `maxlen=50` ensures all sequences are truncated or padded to a length of 50. This is crucial for consistent input size for the RNN model.

In [28]:
X_train[0]

array([2071,   56,   26,  141,    6,  194, 7486,   18,    4,  226,   22,
         21,  134,  476,   26,  480,    5,  144,   30, 5535,   18,   51,
         36,   28,  224,   92,   25,  104,    4,  226,   65,   16,   38,
       1334,   88,   12,   16,  283,    5,   16, 4472,  113,  103,   32,
         15,   16, 5345,   19,  178,   32], dtype=int32)

This cell displays the first training sequence (`X_train[0]`) after padding. You can observe that it now has a fixed length of 50, with leading zeros if the original sequence was shorter than 50, or truncated if longer.

In [None]:
model = Sequential()

model.add(SimpleRNN(32,input_shape=(50,1),return_sequences=False))
model.add(Dense(1,activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 32)                1088      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 1,121
Trainable params: 1,121
Non-trainable params: 0
_________________________________________________________________


This cell defines a simple Recurrent Neural Network (RNN) model using Keras' `Sequential` API. It consists of two layers:
- A `SimpleRNN` layer with 32 units. The `input_shape=(50,1)` specifies that each input sequence will have 50 timesteps, and each timestep has 1 feature (a word index). `return_sequences=False` means it will output only the last hidden state.
- A `Dense` (fully connected) layer with 1 unit and a `sigmoid` activation function, which is suitable for binary classification tasks like sentiment analysis.
Finally, `model.summary()` prints a summary of the model's architecture, including the number of parameters.

In [None]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.fit(X_train,y_train,epochs=5,validation_data=(X_test,y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f8dc97f8810>

This cell compiles and trains the defined RNN model:
- `model.compile()` configures the model for training, specifying `loss='binary_crossentropy'` (appropriate for binary classification), `optimizer='adam'` (a popular optimization algorithm), and `metrics=['accuracy']` to monitor performance.
- `model.fit()` trains the model using the padded training data (`X_train`, `y_train`) for 5 epochs. It also specifies `validation_data=(X_test, y_test)` to evaluate the model's performance on the test set after each epoch, which helps in monitoring for overfitting.