In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
sentences= [
    "The cat is sleeping.",
    "I am going to school.",
    "She loves playing football.",
    "The dog is very happy.",
    "They are eating breakfast.",
    "He will buy a new phone.",
    "The sun rises in the east.",
    "I love reading books.",
    "She is learning Python.",
    "They are playing cricket.",
    "The baby is laughing.",
    "I will eat an apple.",
    "He is watching TV.",
    "She loves her family.",
    "The flowers are blooming.",
    "They are going to cinema.",
    "I am feeling tired.",
    "He will join a gym.",
    "The book is very old.",
    "She is a great singer.",
    "I love Indian food.",
    "They are best friends.",
    "The dog is barking.",
    "He is very smart.",
    "The baby is crying.",
    "I am learning Hindi.",
    "She is very kind.",
    "They are going home.",
    "The sun sets in west.",
    "I love my mom.",
    "He is playing guitar.",
    "She is a doctor.",
    "The cat is eating.",
    "They are studying hard.",
    "I will travel soon.",
    "He is very tall.",
    "The flowers are red.",
    "She loves dancing.",
    "They are playing chess.",
    "I am feeling happy.",
    "He is a bad boy.",
    "The dog is not running.",
    "She is not learning Java.",
    "I love my family.",
    "They are watching movie.",
    "He is a criminal.",
    "The baby is sleeping.",
    "I am going to hell.",
    "She loves fucking.",
    "They are eating non-veg."
]

In [None]:
labels=[1]*25 + [0]*25 #first 25 sentences are positive[1], last 25 sentences are negative[0]
labels=np.array(labels)

In [None]:
vocab_size=2000 #For a safer side i took 2000 (a big number)
tokenizer=Tokenizer(num_words= vocab_size, oov_token= "<OOV>") ##oov = out of vocabulary
tokenizer.fit_on_texts(sentences) #words ke numeric variables aa jaenge
sequence=tokenizer.texts_to_sequences(sentences)
maxlen= max(len(s) for s in sequence) #finding max length for padding purpose

X=pad_sequences(sequence, maxlen= maxlen, padding="post") #post means jab words complete ho jaaye tb padding krna , baad me zeroes lgana
y=labels

1. Creating the Labels (The Answer Sheet)
python
labels=[1]*25 + [0]*25
Translation: "Make 25 copies of '1', then 25 copies of '0'"

What it does:

First 25 sentences → Positive (labeled as 1)

Last 25 sentences → Negative (labeled as 0)

Visual Example:

text
Sentence 1: "I love this movie!" → Label: 1 (Positive)
Sentence 2: "Amazing film!" → Label: 1 (Positive)
...
Sentence 25: "Great acting!" → Label: 1 (Positive)
Sentence 26: "I hated it" → Label: 0 (Negative)
Sentence 27: "Boring story" → Label: 0 (Negative)
...
Sentence 50: "Worst ever" → Label: 0 (Negative)
python
labels=np.array(labels)
Translation: "Convert this Python list into a NumPy array"
Why: NumPy arrays are faster for mathematical operations and work better with machine learning libraries.

2. Setting Up Vocabulary Size
python
vocab_size=2000
Translation: "Our dictionary will have space for 2000 words"

What it means:

We're saying: "Only keep track of the 2000 most common words"

If there are more than 2000 unique words in our sentences, the least common ones will be ignored

2000 is a safe big number to make sure we don't miss important words

Analogy: Imagine you have a notebook with 2000 pages. Each page is for a different word. If you encounter more than 2000 words, you don't get more pages!

3. Creating the Tokenizer (The Word→Number Translator)
python
tokenizer=Tokenizer(num_words= vocab_size, oov_token= "<OOV>")
Let's break this down:

Part A: Tokenizer(num_words=vocab_size)

Creates a "translator" that converts words to numbers

num_words=2000 → Only remember top 2000 words

Part B: oov_token="<OOV>" (Most Important!)

OOV = Out Of Vocabulary

This creates a special code for unknown words

Example: If "supercalifragilisticexpialidocious" isn't in our 2000 words, it becomes "<OOV>"

Why OOV is crucial:
Without OOV: Unknown word → Error/ignore
With OOV: Unknown word → "<OOV>" → Gets a number → Can be processed!

Analogy: You're learning Spanish with a 2000-word dictionary. When you hear a word not in your dictionary, you write "UNKNOWN" instead of panicking!

4. Learning the Vocabulary
python
tokenizer.fit_on_texts(sentences)
Translation: "Look at all sentences and learn which words are most common"

What happens inside:

The tokenizer reads ALL sentences

Counts how many times each word appears

Ranks words from most frequent to least frequent

Assigns numbers:

Most common word → 1

Second most common → 2

... up to 2000

Example:
If our sentences are movie reviews:

"movie" appears 50 times → Might get number 1

"good" appears 45 times → Might get number 2

"zebra" appears 1 time → Might not be in top 2000 → Becomes "<OOV>"

5. Converting Sentences to Numbers
python
sequence=tokenizer.texts_to_sequences(sentences)
Translation: "Now translate each sentence from words to numbers using our dictionary"

Example:
Original sentence: "I love this movie"
After tokenization: [4, 15, 27, 1] (假设数字)

Detailed process:

text
Input: ["I love this movie", "It was terrible"]
Step 1: Split into words: [["I", "love", "this", "movie"], ["It", "was", "terrible"]]
Step 2: Look up each word in dictionary:
        "I" → 4
        "love" → 15
        "this" → 27
        "movie" → 1
        "It" → 8
        "was" → 12
        "terrible" → 32
Step 3: Output: [[4, 15, 27, 1], [8, 12, 32]]
6. Finding Maximum Sentence Length
python
maxlen= max(len(s) for s in sequence)
Translation: "Find the sentence with the most words"

What it does:

Looks at all converted sentences (now as number sequences)

Counts how many numbers are in each

Finds the maximum count

Example:

text
Sentence 1: [4, 15, 27, 1] → Length: 4 words
Sentence 2: [8, 12, 32] → Length: 3 words
Sentence 3: [2, 7, 9, 11, 13, 20] → Length: 6 words
maxlen = 6 (from Sentence 3)
Why we need this: Neural networks need ALL inputs to be the SAME length!

7. Padding Sequences (Making All Sentences Equal Length)
python
X=pad_sequences(sequence, maxlen= maxlen, padding="post")
This is CRITICAL! Let's break it down:

The Problem:
Sentences have different lengths:

"Hi" → 1 word

"I love you" → 3 words

"The quick brown fox jumps over the lazy dog" → 9 words

Neural networks need fixed input size!

The Solution: Padding
Add zeros to make all sequences the same length.

padding="post" means: Add zeros AT THE END

Example:

text
Original sequences (unequal lengths):
[4, 15, 27, 1]        # Length 4
[8, 12, 32]           # Length 3  
[2, 7, 9, 11, 13, 20] # Length 6

After padding (maxlen=6):
[4, 15, 27, 1, 0, 0]   # Added 2 zeros at END
[8, 12, 32, 0, 0, 0]   # Added 3 zeros at END  
[2, 7, 9, 11, 13, 20] # No padding needed (already length 6)
Alternative: padding="pre"
Would add zeros at the BEGINNING:

text
[0, 0, 4, 15, 27, 1]
[0, 0, 0, 8, 12, 32]
[2, 7, 9, 11, 13, 20]
Why "post" padding is usually better:
For sentences, the beginning is more important. We don't want to push important words to the middle by adding zeros at the start!

8. Final Variables
python
y=labels
Translation: "Our answers/labels are called 'y'

In [None]:
maxlen

6

In [None]:
X[0] #X[0] has 4 words , but as max len in sentences is 6 . it will create two zeroes to complete the array of 6

array([ 3, 22,  2, 23,  0,  0], dtype=int32)

In [None]:
X[47] #5 words , it will crete 1 zero  to complete the array

array([ 4, 10, 12, 17, 89,  0], dtype=int32)

Prepaing the model

In [None]:
#Now here you have to specify the number of dimensions in the hidden layer

embed_dim= 16 #Creating embeddings of words 16 dimensions
rnn_units= 8 #8 neurons in hidden layer

embed_dim = 16
What This Means:
"Each word will be represented by 16 personality traits"

The Problem with Simple Numbers:
Right now, "movie" = 3 and "film" = 4. The computer thinks:

3 and 4 are just numbers

3 is close to 4, far from 100

But "movie" and "film" mean almost the SAME thing!

"movie" (3) and "zebra" (100) are far apart numerically, but in meaning... also far apart!

Current (Bad) Representation:

text
"movie" = [3]
"film"  = [4]
"zebra" = [100]
Computer thinks: "3 and 4 are similar, 100 is different" ✓
But for wrong reasons!

The Solution: Word Embeddings
Instead of one number, give each word 16 numbers (16 dimensions):

After Embedding:

text
"movie" = [0.8, -0.2, 0.4, 0.1, -0.9, 0.3, 0.5, -0.1, 0.7, 0.2, -0.3, 0.6, 0.4, -0.5, 0.1, 0.9]
"film"  = [0.7, -0.1, 0.5, 0.2, -0.8, 0.4, 0.6, -0.2, 0.6, 0.3, -0.4, 0.7, 0.5, -0.4, 0.2, 0.8]
"zebra" = [-0.1, 0.9, -0.8, 0.3, 0.4, -0.7, 0.2, 0.8, -0.3, 0.6, 0.5, -0.2, -0.9, 0.7, 0.4, -0.5]
What Each Dimension Represents (Conceptually):
Think of 16 personality tests for words:

Dimension	Might Represent	Example Values
1	Positive/Negative	"good"=0.9, "bad"=-0.8
2	Object/Action	"run"=0.7, "chair"=-0.6
3	Human/Thing	"teacher"=0.8, "rock"=-0.9
4	Size	"elephant"=0.9, "ant"=-0.9
5	Speed	"rocket"=0.8, "snail"=-0.7
6	Age	"ancient"=0.9, "new"=-0.8
7	Formality	"thou"=0.7, "dude"=-0.6
...	...	...
16	Emotional Impact	"love"=0.9, "hate"=-0.9
Why 16 Dimensions?
Too few (e.g., 2): Can't capture enough meaning

text
"movie" = [0.5, 0.5]
"film" = [0.5, 0.5]
"zebra" = [0.5, 0.5]  ← All look the same!
Too many (e.g., 1000): Overcomplicated, slow to train

16-300 is common: Balances richness with efficiency

Visual Analogy:
Imagine describing your friends:

One number: "Rate them 1-10" → Limited!

16 traits: [Funny: 8/10, Smart: 9/10, Kind: 7/10, Athletic: 6/10...] → Rich description!

The network LEARNS these embeddings during training! It figures out what dimensions are useful.

2. RNN Units: The Robot's BRAIN CELLS
python
rnn_units = 8
What This Means:
"Our RNN robot has 8 thinking cells in its memory"

Understanding RNN Units:
Each RNN unit is like a specialist in the robot's brain:

text
Think of 8 little workers in the robot's head:

Worker 1: Specializes in "Is this positive or negative?"
Worker 2: Specializes in "Is this about people or things?"
Worker 3: Specializes in "Is this past, present, or future?"
Worker 4: Specializes in "Does this continue the sentence?"
Worker 5: Specializes in "What's the main subject?"
Worker 6: Specializes in "What's the action?"
Worker 7: Specializes in "How intense is the emotion?"
Worker 8: Specializes in "Is this a complete thought?"

In [None]:
from tensorflow.keras.layers import Input, Embedding, SimpleRNN, Dense
from tensorflow.keras.models import Model
inp=Input(shape=(maxlen,), dtype="int32", name='input') #comma(,) for  if in case there is 2d vector      #5 inputs
x=Embedding(input_dim=vocab_size, output_dim=embed_dim, mask_zero=True, name='embed')(inp)

inp = Input(shape=(maxlen,), dtype="int32", name='input')
Breaking it down piece by piece:
A. Input()

Analogy: Building a reception desk where data will arrive

Purpose: Creates a placeholder where our sentences will enter the network

This is NOT a processing layer - just an entry point

B. shape=(maxlen,)
This is CRITICAL! Let's understand:

maxlen = Maximum sentence length we calculated earlier

Example: If longest sentence has 20 words, maxlen=20

Why the comma? (maxlen,) vs (maxlen)

text
(maxlen)   → Just a number, not a tuple
(maxlen,)  → A 1-dimensional tuple (vector)
(maxlen,5) → A 2-dimensional tuple (matrix)
Example:

Sentence with 20 words: [4, 15, 27, 1, 0, 0, ... 20 numbers total]

Shape = (20,) ← 1D vector of length 20

Not (20,1) (that would be 20 rows, 1 column)

C. dtype="int32"

Data type: 32-bit integers

Why integers? Because our words are encoded as numbers:

text
"I" = 4, "love" = 15, "movie" = 1
Why not strings? Neural networks only understand numbers!

D. name='input'

Just a label for organization

Helpful when debugging: "Error at layer 'input'..."

What this creates:
Imagine a mailbox slot shaped exactly for our sentences:

text
[INPUT LAYER]
Shape: (20,)  ← Can accept 20 numbers at a time
Type: Integers (word indices)
Name: "input" (just a label)
Visual:

text
Sentence: "I love this movie"
Tokenized: [4, 15, 27, 1, 0, 0, 0, 0, ... up to 20 numbers]
          ↓
     [INPUT LAYER]
     Shape: (20,)
     Accepts: [4, 15, 27, 1, 0, 0, ...]
2. The Embedding Layer: The Word "Personality" Factory
python
x = Embedding(input_dim=vocab_size, output_dim=embed_dim, mask_zero=True, name='embed')(inp)
This is where the MAGIC HAPPENS! Let's break down each parameter:

A. Embedding() - What is this?
Analogy: A translation office that converts word IDs into rich descriptions

Input: Word indices (like 4, 15, 27)

Output: 16-dimensional vectors (personality profiles)

B. Parameters Explained:
1. input_dim=vocab_size

vocab_size = 2000 (number of words in our dictionary)

This creates a lookup table with 2000 entries

Analogy: A dictionary with 2000 pages

text
Embedding Matrix (Conceptual):
Row 0: [0.0, 0.0, 0.0, ... 16 zeros]  ← Usually reserved for padding
Row 1: [0.3, -0.1, 0.8, ...]          ← Word ID 1 (most common word)
Row 2: [0.1, 0.5, -0.3, ...]          ← Word ID 2
...
Row 1999: [0.7, 0.2, 0.4, ...]        ← Word ID 1999
2. output_dim=embed_dim

embed_dim = 16 (we set this earlier)

Each word becomes a 16-number vector

Analogy: Each dictionary entry has 16 personality traits

**3. mask_zero=True ← THIS IS SUPER IMPORTANT!

The Problem:
Remember our padded sentences?

text
"I love this movie" → [4, 15, 27, 1, 0, 0, 0, 0, ...]
Those zeros at the end are PADDING (not real words)!

What mask_zero=True does:

Tells the network: "Ignore the zeros!"

Creates a mask that says: "Process [4, 15, 27, 1], skip [0, 0, 0, 0]"

Without this: Network tries to learn from padding zeros (wastes time!)

Analogy: Reading a form with blank spaces

Without mask: "Hmm, this blank space means something..."

With mask: "Ah, blank space = ignore, focus on filled parts!"

4. name='embed'

Just a label: "This is the embedding layer"

5. (inp) at the end

This connects the layers!

Read it as: "Take the output from inp layer, feed it into Embedding layer"

In Keras: Layer()(previous_layer_output)

What happens INSIDE the Embedding Layer:
Step 1: Input arrives

text
Sentence: [4, 15, 27, 1, 0, 0, 0, 0]
Step 2: Lookup each number in embedding matrix

text
4 → Row 4 in embedding matrix → [0.1, 0.3, -0.2, ... 16 numbers]
15 → Row 15 → [0.4, -0.1, 0.7, ...]
27 → Row 27 → [0.2, 0.5, 0.1, ...]
1 → Row 1 → [0.3, -0.2, 0.4, ...]
0 → Row 0 → [0.0, 0.0, 0.0, ...] ← BUT masked! (ignored!)
Step 3: Output shape transformation

text
Input shape: (batch_size, 20)  ← 20 word indices
Output shape: (batch_size, 20, 16) ← Each word → 16D vector
3. Complete Flow Example
Let's trace one sentence through:

Sentence: "I love dogs" (maxlen=20)
Step 1: Tokenization & Padding

text
Original: "I love dogs"
Tokenized: [4, 15, 32]  (假设数字)
Padded: [4, 15, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Step 2: Input Layer

text
Input layer receives: [4, 15, 32, 0, 0, 0, ...]
Shape: (20,)  ← 20 numbers
Step 3: Embedding Layer
For EACH of the 20 positions:

text
Position 0: 4 → Lookup row 4 → [16 numbers]
Position 1: 15 → Lookup row 15 → [16 numbers]
Position 2: 32 → Lookup row 32 → [16 numbers]
Positions 3-19: 0 → [16 zeros] BUT MASKED! (ignored)
Final output from embedding layer:

text
Shape: (20, 16)  ← 20 positions, each with 16D vector

Actual values:
Position 0: [0.1, 0.3, -0.2, 0.4, ... 16 numbers]  ← "I"
Position 1: [0.4, -0.1, 0.7, 0.2, ... 16 numbers]  ← "love"
Position 2: [0.6, 0.2, -0.3, 0.5, ... 16 numbers]  ← "dogs"
Positions 3-19: [0.0, 0.0, ...] but MARKED TO BE IGNORED
4. The Embedding Matrix: A Closer Look
The Embedding layer creates a trainable matrix:

text
Embedding Matrix (vocab_size × embed_dim)
Rows: 2000 (one per word)
Columns: 16 (dimensions)

      Dimension1 Dimension2 ... Dimension16
Word1   0.3        -0.1    ...    0.8
Word2   0.1        0.5     ...   -0.3
Word3  -0.2        0.7     ...    0.4
...     ...        ...     ...    ...
Word2000 0.7       0.2     ...    0.1
Initially: Random numbers
During training: Adjusted to make similar words have similar vectors

5. Why This Architecture?
Without Embedding Layer:
text
Input: [4, 15, 32] (just numbers)
Problem: 4 and 15 are "close" numerically but "I" and "love" aren't that similar!
With Embedding Layer:
text
Input: [4, 15, 32]
→ Lookup in embedding matrix
→ Output: Rich 16D vectors that CAPTURE MEANING
→ Similar words have similar vectors
6. The Masking Magic
What mask_zero=True actually creates:

text
Input sentence: [4, 15, 32, 0, 0, 0, ...]
               ↓
Mask created:  [1, 1, 1, 0, 0, 0, ...]
                ↑   ↑   ↑  ↑  ↑  ↑
              Real Real RealPad Pad Pad
              word word word
This mask gets passed to subsequent layers (RNN, Dense, etc.), telling them:

"Pay attention to positions with mask=1"

"Ignore positions with mask=0"

Without masking: RNN would process all 20 positions (wasting computation)
With masking: RNN processes only 3 real words!

In [None]:
rnn=SimpleRNN(units=rnn_units, return_sequences=True, return_state=True, name='simple_rnn')
rnn_outputs, final_state = rnn(x) # Unpack the outputs: rnn_outputs is the sequence, final_state is the last hidden state
output_layer=Dense(1, activation='sigmoid', name='output')(final_state) # Pass only the final_state to the Dense layer
model=Model(inputs=inp, outputs=output_layer)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

1. Creating the RNN Layer: The Brain with Memory
python
rnn = SimpleRNN(units=rnn_units, return_sequences=True, return_state=True, name='simple_rnn')
Breaking down each parameter:
A. SimpleRNN()

This is the simplest type of RNN

Analogy: A basic brain cell that remembers things temporarily

B. units=rnn_units

rnn_units = 8 (we set this earlier)

This means: 8 memory cells in our RNN

Each cell produces ONE number at each time step

C. return_sequences=True ← IMPORTANT!

What does this mean?

"Give me the output at EVERY time step"

Not just the final output, but outputs at ALL positions

Example with sentence "I love dogs":

text
Without return_sequences=True: Returns only last output
[I] → [love] → [dogs] → [final_output]

With return_sequences=True: Returns outputs at each step
[I] → output₁
[love] → output₂  
[dogs] → output₃
Returns: [output₁, output₂, output₃]
Why do we need this? For visualization/debugging! Though in our case, we only use the final state.

D. return_state=True

"Also give me the final memory state"

The state is the last hidden state (memory after processing entire sequence)

E. name='simple_rnn'

Just a label for organization

2. Running the RNN: Processing Our Sentence
python
rnn_outputs, final_state = rnn(x)  # Unpack the outputs
What happens here:
Input x: From embedding layer, shape: (batch_size, maxlen, embed_dim)
Example: For one sentence with maxlen=20, embed_dim=16:

Shape: (1, 20, 16) ← 1 sentence, 20 positions, each position 16D vector

The RNN processes EACH position:

text
Time step 0: Process word 0 ("I" = 16D vector) → Update memory → output₀
Time step 1: Process word 1 ("love" = 16D vector) + previous memory → Update memory → output₁  
Time step 2: Process word 2 ("dogs" = 16D vector) + previous memory → Update memory → output₂
Time steps 3-19: Process padding (0 vectors, but MASKED! So actually skipped!)
The TWO outputs we get:
1. rnn_outputs: Outputs at EVERY time step (because return_sequences=True)

Shape: (batch_size, maxlen, rnn_units)

Example: (1, 20, 8) ← For each of 20 positions, 8 numbers (from 8 RNN units)

2. final_state: Final memory after processing ALL words

Shape: (batch_size, rnn_units)

Example: (1, 8) ← Just 8 numbers representing final memory

Visual Example:

text
Input: "I love dogs" (padded to 20 words)
After RNN processing:

rnn_outputs = [
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],  ← After "I"
    [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],  ← After "I love"
    [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],  ← After "I love dogs"
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],  ← Padding (masked)
    ... 17 more zero vectors
]

final_state = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]  ← Last real output
Important: Even though we get rnn_outputs (all sequences), we only use final_state for classification!

3. The Dense Output Layer: Making the Decision
python
output_layer = Dense(1, activation='sigmoid', name='output')(final_state)
Breaking this down:
A. Dense()

A fully connected layer (like in regular neural networks)

Analogy: The "decision-making committee"

B. units=1

Only ONE neuron in this layer

Why? Because we're doing binary classification (positive=1, negative=0)

Single neuron outputs a probability between 0 and 1

C. activation='sigmoid'

Sigmoid function: Squishes ANY number to between 0 and 1

Formula: sigmoid(x) = 1 / (1 + e^(-x))

Output interpretation:

Close to 1 → "Probably positive sentiment"

Close to 0 → "Probably negative sentiment"

0.5 → "Unsure"

D. (final_state)

We're feeding ONLY the final_state (not all rnn_outputs)

Why? For sentiment analysis, we only care about the overall sentiment after reading the ENTIRE sentence

The final_state contains the "summary" of the whole sentence

What happens inside Dense layer:
text
final_state = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]  ← 8 numbers

Dense layer does:
output = sigmoid(W * final_state + b)

Where:
W = [w1, w2, w3, w4, w5, w6, w7, w8]  ← 8 weights (learned)
b = bias (learned)

Calculation:
1. Multiply each memory number by corresponding weight
2. Sum them all up
3. Add bias
4. Apply sigmoid to get probability

Example: sigmoid(2.5) = 0.924  ← "92.4% chance of positive sentiment"
4. Creating the Complete Model
python
model = Model(inputs=inp, outputs=output_layer)
What this does:
Connects everything together: Input → Embedding → RNN → Dense → Output

Creates a trainable model that maps:

text
Sentences (word indices) → Sentiment probability
Visual of the complete flow:

text
inp (Input layer)
  ↓
x (Embedding layer: words → 16D vectors)
  ↓
rnn_outputs, final_state (RNN layer: processes sequence)
  ↓ (we only take final_state)
final_state (8-number memory summary)
  ↓
output_layer (Dense: converts 8 numbers → 1 probability)
5. Compiling the Model: Setting Up Training Rules
python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
A. optimizer='adam'
Adam: An algorithm that adjusts weights during training

Analogy: The "learning strategy" for our robot

What it does: "If you make a mistake, adjust weights this way to improve"

B. loss='binary_crossentropy'
Loss function: Measures how wrong our predictions are

Binary crossentropy: Perfect for yes/no (1/0) classification

Formula (simplified): Penalizes confident wrong predictions more!

Example:

text
Prediction: 0.9 (90% positive)
Actual: 1 (positive) → Low loss (good!)
Actual: 0 (negative) → High loss (confidently wrong!)

Prediction: 0.6 (60% positive)  
Actual: 0 (negative) → Medium loss (less confident, smaller penalty)
C. metrics=['accuracy']
Tells us what to measure during training

Accuracy: Percentage of correct predictions

Example: 85% accuracy = 85 out of 100 predictions correct

6. Model Summary: Seeing Our Creation
python
model.summary()
What this shows (example output):
text
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           [(None, 20)]              0         
_________________________________________________________________
embed (Embedding)            (None, 20, 16)            32000     
_________________________________________________________________
simple_rnn (SimpleRNN)       [(None, 20, 8), (None, 8)] 200      
_________________________________________________________________
output (Dense)               (None, 1)                 9         
=================================================================
Total params: 32,209
Trainable params: 32,209
Non-trainable params: 0
Let's understand each line:
1. Input Layer:

Output shape: (None, 20)

None = Batch size (can vary)

20 = maxlen (words per sentence)

Parameters: 0 (just a placeholder)

2. Embedding Layer:

Output shape: (None, 20, 16)

20 positions, each becomes 16-dimensional

Parameters: 32,000!

How? vocab_size × embed_dim = 2000 × 16 = 32,000

Each of 2000 words has 16 numbers to learn

3. RNN Layer:

Output shape shows TWO things (because return_sequences=True, return_state=True):

(None, 20, 8) = All sequence outputs

(None, 8) = Final state only

Parameters: 200

How? Let's calculate:

W_xh: Input to hidden: embed_dim × rnn_units = 16 × 8 = 128

W_hh: Hidden to hidden: rnn_units × rnn_units = 8 × 8 = 64

b_h: Bias: rnn_units = 8

Total: 128 + 64 + 8 = 200

4. Dense Layer:

Output shape: (None, 1) ← Single probability

Parameters: 9

How? rnn_units × 1 + 1 = 8 × 1 + 1 = 9

8 weights (one for each memory number) + 1 bias

Total parameters: 32,209 weights/biases to learn!

7. Complete Flow with Example Data
Let's trace one sentence through the ENTIRE model:

Sentence: "I love this movie"
text
Step 1: Input
Sentence → Tokenized → Padded: [4, 15, 27, 1, 0, 0, ..., 0]  (20 numbers)

Step 2: Embedding Layer
Each number → 16D vector:
4 → [0.1, 0.3, -0.2, ... 16 numbers]
15 → [0.4, -0.1, 0.7, ...]
27 → [0.2, 0.5, 0.1, ...]
1 → [0.3, -0.2, 0.4, ...]
0 → [0.0, 0.0, ...] (masked)
...
Shape becomes: (1, 20, 16)

Step 3: RNN Layer (time step by time step):
Time 0: Process "I" vector → Memory becomes [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
Time 1: Process "love" + previous memory → Memory: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Time 2: Process "this" + memory → Memory: [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Time 3: Process "movie" + memory → Memory: [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]
Times 4-19: Padding (masked, so memory stays same)
Final state = [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]

Step 4: Dense Layer
Take final_state [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1]
Multiply by learned weights, add bias, apply sigmoid:
Example calculation: 0.4×0.1 + 0.5×0.2 + 0.6×0.3 + 0.7×0.4 + 0.8×0.5 + 0.9×0.6 + 1.0×0.7 + 1.1×0.8 + 0.1
= 3.05 → sigmoid(3.05) = 0.955

Step 5: Prediction
Output: 0.955 ≈ 95.5% probability of positive sentiment ✓
8. Why This Architecture for Sentiment Analysis?
Alternative: Using ALL RNN outputs (not just final state)
We could average ALL outputs or use the last one. But:

Using final state: Captures cumulative understanding of ENTIRE sentence

Averaging all outputs: Might dilute important words

Alternative: More RNN layers
We could stack multiple RNN layers:

python
rnn1 = SimpleRNN(units=8, return_sequences=True)(x)
rnn2 = SimpleRNN(units=8)(rnn1)  # Second layer
But for simple sentiment, one layer is often enough!

Summary: The Complete Picture
text
1. Input: Sentences as word indices
   [4, 15, 27, 1, 0, 0, ...]

2. Embedding: Words → Rich 16D vectors
   Each word gets 16 "personality traits"

3. RNN: Process sequence, building memory
   "I" → "I love" → "I love this" → "I love this movie"
   Final state = Summary of entire sentence

4. Dense: Convert 8D memory → Probability
   8 numbers → 1 number (0 to 1)

5. Output: Sentiment probability
   0.95 = "95% sure it's positive!"

In [None]:
model.fit(X,y, epochs=25, batch_size=8, verbose=1)

Epoch 1/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.3720 - loss: 0.7104
Epoch 2/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - accuracy: 0.4754 - loss: 0.6908 
Epoch 3/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6638 - loss: 0.6703 
Epoch 4/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7566 - loss: 0.6520 
Epoch 5/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7567 - loss: 0.6378 
Epoch 6/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8969 - loss: 0.6157 
Epoch 7/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.8306 - loss: 0.6110 
Epoch 8/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.8509 - loss: 0.5801 
Epoch 9/25
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

<keras.src.callbacks.history.History at 0x7e97d3b80740>

# Task
Create a Keras feature extraction model that takes the same input as the original model but outputs the `Embedding` layer's output (`x`) and the `SimpleRNN` layer's `rnn_outputs` (hidden states for each time step), and use it to extract and display these intermediate representations for the sample sentence "The cat is sleeping.". Include the shapes and a snippet of the numerical values for both the word embeddings and the RNN hidden states. Summarize the extracted embeddings and hidden states, explaining what these numerical representations signify in the context of the model's processing.

## Create Feature Extraction Model

### Subtask:
Build a new Keras model that takes the same input as the original model but outputs the `Embedding` layer's output (pre-timestamps) and the `SimpleRNN` layer's `rnn_outputs` (hidden states for each time step).


**Reasoning**:
To create the feature extraction model as requested, I will define a new Keras Model that uses the existing 'inp' as input and outputs both the 'x' (embedding layer output) and 'rnn_outputs' (RNN layer's sequence outputs).



In [None]:
feature_extractor_model = Model(inputs=inp, outputs=[x, rnn_outputs])
print("Feature extraction model created successfully.")
feature_extractor_model.summary()

Feature extraction model created successfully.


**Reasoning**:
To extract the features, I need to prepare the sample sentence by tokenizing and padding it, then use the `feature_extractor_model` to predict the embedding layer output and the RNN hidden states.



In [None]:
sample_sentence = "The cat is sleeping."
sample_sequence = tokenizer.texts_to_sequences([sample_sentence])
sample_padded = pad_sequences(sample_sequence, maxlen=maxlen, padding="post")

embeddings, rnn_hidden_states = feature_extractor_model.predict(sample_padded)

print(f"\nWord Embeddings for '{sample_sentence}':")
print(f"Shape: {embeddings.shape}")
print("Snippet (first 3 words, first 5 dimensions):")
print(embeddings[0, :3, :5])

print(f"\nRNN Hidden States for '{sample_sentence}':")
print(f"Shape: {rnn_hidden_states.shape}")
print("Snippet (first 3 time steps, first 5 dimensions):")
print(rnn_hidden_states[0, :3, :5])


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 170ms/step

Word Embeddings for 'The cat is sleeping.':
Shape: (1, 6, 16)
Snippet (first 3 words, first 5 dimensions):
[[ 0.02297346 -0.02816715 -0.01602104 -0.0534954   0.01294242]
 [-0.04681893 -0.04857499 -0.022462   -0.0228855  -0.02925644]
 [ 0.00024168 -0.02866628  0.00211756  0.00082364 -0.02142565]]

RNN Hidden States for 'The cat is sleeping.':
Shape: (1, 6, 8)
Snippet (first 3 time steps, first 5 dimensions):
[[-0.03633358  0.04461361 -0.05488749  0.09265917 -0.0275171 ]
 [-0.09269962 -0.04962511 -0.07743929 -0.04999477  0.01358555]
 [-0.01875091 -0.05721251 -0.07047111  0.09739455  0.10126595]]


### Summary of Extracted Embeddings and Hidden States

#### Word Embeddings (Output from `Embedding` layer: `embeddings`)

**Shape:** (1, 6, 16)

**Significance:**
*   **Conceptual:** Each word in the sentence "The cat is sleeping." has been transformed from a simple integer ID into a 16-dimensional vector of floating-point numbers. These numbers are learned during the model's training and represent the semantic meaning or "personality traits" of each word. Words with similar meanings tend to have similar embedding vectors (i.e., their vectors are close in the 16-dimensional space).
*   **Example Snippet:**
    ```
    [[ 0.02297346 -0.02816715 -0.01602104 -0.0534954   0.01294242]
     [-0.04681893 -0.04857499 -0.022462   -0.0228855  -0.02925644]
     [ 0.00024168 -0.02866628  0.00211756  0.00082364 -0.02142565]]
    ```
    - The first row corresponds to the embedding of "The", the second to "cat", and the third to "is". Each number within a row contributes to defining that word's context and relationship to other words. The `0` values from padding words would also have corresponding 16-D vectors, but `mask_zero=True` ensures the RNN ignores them during processing.

#### RNN Hidden States (Output from `SimpleRNN` layer: `rnn_hidden_states`)

**Shape:** (1, 6, 8)

**Significance:**
*   **Conceptual:** The RNN hidden states represent the "memory" or contextual understanding of the sentence as the RNN processes it word by word. Since `return_sequences=True` was set, we get a hidden state vector (of 8 dimensions) for *each* time step (or word position) in the input sequence, including the padded zeros. Each 8-dimensional vector at a given time step encapsulates information about the current word *and* all preceding words in the sequence.
*   **Example Snippet:**
    ```
    [[-0.03633358  0.04461361 -0.05488749  0.09265917 -0.0275171 ]
     [-0.09269962 -0.04962511 -0.07743929 -0.04999477  0.01358555]
     [-0.01875091 -0.05721251 -0.07047111  0.09739455  0.10126595]]
    ```
    - The first row is the hidden state after processing "The", the second after "The cat", and the third after "The cat is". These vectors are dynamic; they change and accumulate information as the RNN reads the sentence. The hidden states corresponding to padded zeros in the input would typically reflect the state after the last meaningful word, as the masking mechanism instructs the RNN to ignore padding and not update its state for them.

In essence, the **word embeddings** provide a rich, static representation for each individual word, while the **RNN hidden states** provide a dynamic, context-aware representation of the sentence's meaning as it unfolds sequentially.

## Summary:

### Q&A
*   **What do the extracted word embeddings signify in the context of the model's processing?**
    Word embeddings are 16-dimensional vectors that represent the semantic meaning of each word. These numbers are learned during training, and words with similar meanings tend to have similar embedding vectors, signifying their contextual and semantic relationships.

*   **What do the extracted RNN hidden states signify in the context of the model's processing?**
    RNN hidden states are 8-dimensional vectors that represent the "memory" or contextual understanding of the sentence as the RNN processes it word by word. Each hidden state at a given time step encapsulates information about the current word and all preceding words, providing a dynamic, context-aware representation of the sentence's meaning as it unfolds.

### Data Analysis Key Findings
*   A Keras feature extraction model was successfully created, which takes the same input as the original model and outputs both the `Embedding` layer's output and the `SimpleRNN` layer's sequence outputs.
*   For the sample sentence "The cat is sleeping.", the model extracted:
    *   **Word Embeddings**: These have a shape of `(1, 6, 16)`, indicating 1 input sequence, 6 time steps (words/padding), and 16 dimensions per embedding. A snippet showed initial values like `[[ 0.02297346 -0.02816715 -0.01602104 -0.0534954 0.01294242] ...]` for the first word.
    *   **RNN Hidden States**: These have a shape of `(1, 6, 8)`, indicating 1 input sequence, 6 time steps, and 8 dimensions for each hidden state. A snippet showed initial values like `[[-0.03633358 0.04461361 -0.05488749 0.09265917 -0.0275171 ] ...]` for the first time step.
*   Word embeddings provide a rich, static representation for individual words, while RNN hidden states offer a dynamic, context-aware representation of the sentence's meaning as it is processed sequentially.

### Insights or Next Steps
*   This feature extraction model provides a powerful tool for understanding how the neural network processes text, allowing for visualization and analysis of semantic word representations and contextual sentence understanding.
*   Further analysis could involve plotting these embeddings and hidden states (e.g., using dimensionality reduction techniques like t-SNE) to visually inspect word relationships or how context evolves over a sentence, potentially revealing model biases or specific learning patterns.
