# **Step 1: Loading the Dataset**

## **INTRODUCTION**
**Goal:** Load the English-Spanish translation pairs from a text file into memory.

**Why this matters:** Before we can build any encoder-decoder model, we need data. Seq2seq models learn by seeing many pairs of sequences (English sentence → Spanish sentence). This step simply reads our training data.

**Connection:** This is the foundation. Everything that follows (preprocessing, building encoder/decoder, training) depends on having these translation pairs.

In [1]:
from tensorflow import keras
import re
# Importing our translations
data_path = "utils/span-eng.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

2025-12-07 15:56:55.018461: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-07 15:56:55.029030: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-07 15:56:55.111377: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-12-07 15:56:55.246396: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765119415.325318   18884 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765119415.36

In [2]:
# See how many translation pairs you have
print(f"Total lines loaded: {len(lines)}")

# Look at the first 5 examples
print("\nFirst 5 translation pairs:")
for i in range(5):
    print(f"{i+1}. {lines[i]}")

Total lines loaded: 11

First 5 translation pairs:
1. We'll see.	Después veremos.
2. We'll see.	Ya veremos.
3. We'll try.	Lo intentaremos.
4. We've won!	¡Hemos ganado!
5. Well done.	Bien hecho.


## **WHAT JUST HAPPENED?**

**Before this step:** Nothing - you had no data in memory.

**After this step:** 
- `lines` is now a Python list
- Each element is a string containing one translation pair
- Format of each line: `"English sentence\tSpanish sentence"`
- The `\t` (tab character) separates English from Spanish

**Key point:** `lines` is a **list of text elements (strings)**, where each string contains both English and Spanish separated by a tab.

# **Step 2: Preprocessing and Tokenization**

## **WHERE WE'RE COMING FROM:**
In Step 1, we loaded raw translation pairs into a single list called `lines`. Each element was a string containing both English and Spanish sentences separated by a tab character (`\t`).

## **WHAT WE'RE DOING NOW:**
We need to prepare this raw data for the encoder-decoder model by:
1. **Separating** English and Spanish into different lists
2. **Adding special tokens** (`<START>` and `<END>`) to Spanish sentences
3. **Separating punctuation** from words (so the model treats `.` as its own token, not part of the word)
4. **Building vocabularies** - collecting all unique words/tokens from both languages

## **WHY THIS MATTERS FOR SEQ2SEQ:**
- The **encoder** needs clean English sentences as input
- The **decoder** needs Spanish sentences with special markers:
  - `<START>` tells it "begin generating translation"
  - `<END>` tells it "stop generating, translation complete"
- Both need vocabularies to convert words → numbers later

## **WHAT'S NEXT:**
After this step, we'll have organized data and vocabularies. The next step will be converting these words into numerical representations (indices) that the neural network can actually process.

In [3]:
# Building empty lists to hold sentences
input_docs = []
target_docs = []
# Building empty vocabulary sets
input_tokens = set()
target_tokens = set()
for line in lines:
    # Input and target sentences are separated by tabs (\t represents a line break)
    input_doc, target_doc = line.split('\t')
    # Appending each input sentence to input_docs
    # Example first line: append We'll see.
    
    ###CANCEEL THIS
    input_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", input_doc))
    ###CANCEL THIS
    
    input_docs.append(input_doc)
    # Splitting words from punctuation
    # The regex means: "Match either a sequence of word characters (including apostrophes) OR a single punctuation mark".
    '''
    The regex scans through and finds:

    - "Después" ← matches [\w']+ (word characters)
    - "veremos" ← matches [\w']+ (word characters)
    - "." ← matches [^\s\w] (punctuation)

    So re.findall() returns A LIST: ["Después", "veremos", "."]
    Then " ".join(...) puts spaces between them: "Después veremos ."
    '''
    target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
    # Redefine target_doc below 
    # and append it to target_docs:
    target_doc = "<START> " + target_doc + " <END>"
    target_docs.append(target_doc)
    
    # Now we split up each sentence into words
    # and add each unique word to our vocabulary set
    for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
        # Add your code here:
        input_tokens.add(token)

    for token in target_doc.split():
        # And here:
        target_tokens.add(token)

## **WHAT WE JUST DID: Step-by-Step Flow**

### **The Processing Pipeline:**
```
For each line in our dataset:

1. SPLIT: "We'll see.\tDespués veremos."
   ↓
   English: "We'll see."
   Spanish: "Después veremos."

2. STORE ENGLISH AS-IS:
   input_docs.append("We'll see.")

3. PROCESS SPANISH - Separate punctuation with regex:
   "Después veremos." 
   → regex finds: ["Después", "veremos", "."]
   → join with spaces: "Después veremos ."

4. ADD SPECIAL TOKENS TO SPANISH:
   "Después veremos ."
   → "<START> Después veremos . <END>"
   
5. STORE PROCESSED SPANISH:
   target_docs.append("<START> Después veremos . <END>")

6. BUILD ENGLISH VOCABULARY:
   - Apply regex to "We'll see." → ["We'll", "see", "."]
   - Add each token to input_tokens set

7. BUILD SPANISH VOCABULARY:
   - Split "<START> Después veremos . <END>" → ["<START>", "Después", "veremos", ".", "<END>"]
   - Add each token to target_tokens set
```

### **FINAL RESULT:**
- `input_docs`: List of clean English sentences
- `target_docs`: List of Spanish sentences with `<START>` and `<END>` markers
- `input_tokens`: Set of all unique English words and punctuation
- `target_tokens`: Set of all unique Spanish words, punctuation, and special tokens

# **Step 3: Vocabulary Sizing and Sequence Length Analysis**

## **WHERE WE'RE COMING FROM:**
In Step 2, we created:
- `input_docs` and `target_docs` - separate lists of English and Spanish sentences
- `input_tokens` and `target_tokens` - sets of unique words/punctuation from both languages

These vocabularies are currently **unordered sets**. We need to convert them to ordered structures so we can assign consistent index numbers to each word.

## **WHAT WE'RE DOING NOW:**
We're preparing critical parameters that the encoder-decoder model needs:

1. **Sort vocabularies** - Convert sets to sorted lists (for consistent word-to-index mapping)
2. **Count vocabulary sizes** - How many unique tokens in each language?
3. **Find maximum sentence lengths** - What's the longest sentence in our dataset?

## **WHY THIS MATTERS FOR SEQ2SEQ:**
- **Vocabulary sizes** → Determine the size of the model's output layer (one neuron per possible word)
- **Max sequence lengths** → Needed for padding (all sen

In [4]:
input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))

# Create num_encoder_tokens and num_decoder_tokens:
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)
'''
max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
'''
max_encoder_seq_length = max([len(doc.split()) for doc in input_docs])
max_decoder_seq_length = max([len(doc.split()) for doc in target_docs])

## **WHAT WE JUST DID: Step-by-Step Flow**

### **The Processing Pipeline:**
```
STEP 1: Convert Sets to Sorted Lists
--------------------------------
Before: input_tokens = {'see', 'We', 'try', ...}  (unordered set)
After:  input_tokens = ['!', '.', '?', 'We', 'see', 'try', ...]  (sorted list)

Why? Sets have no order - each time you run the code, words appear in random order.
Sorting alphabetically ensures consistency: 'hello' always gets the same position.

STEP 2: Count Vocabulary Sizes
--------------------------------
num_encoder_tokens = len(input_tokens)  # e.g., 50 unique English words
num_decoder_tokens = len(target_tokens)  # e.g., 60 unique Spanish words

Why? The model's output layer needs one neuron per possible word.
If you have 60 Spanish words, the decoder outputs 60 probabilities (one per word).

STEP 3: Find Maximum Sentence Lengths
--------------------------------
For each sentence in input_docs:
  - Tokenize it with regex
  - Count how many tokens
  - Keep track of the maximum

Example:
  "We'll see." → ["We'll", "see", "."] → 3 tokens
  "Who is he?" → ["Who", "is", "he", "?"] → 4 tokens
  max_encoder_seq_length = 4

Why? Neural networks need fixed-size inputs. Later, we'll pad shorter sentences
with zeros to match the longest one (3-token sentence becomes [token, token, token, 0]).
```

### **FINAL RESULT:**
- `input_tokens`: Sorted list of unique English tokens
- `target_tokens`: Sorted list of unique Spanish tokens (including `<START>`, `<END>`)
- `num_encoder_tokens`: Integer - size of English vocabulary
- `num_decoder_tokens`: Integer - size of Spanish vocabulary
- `max_encoder_seq_length`: Integer - longest English sentence in tokens
- `max_decoder_seq_length`: Integer - longest Spanish sentence in tokens

# **Step 4: Creating Word-to-Index Mappings (Tokenization Dictionaries)**

## **WHERE WE'RE COMING FROM:**
In Step 3, we created sorted lists of unique tokens:
- `input_tokens` - sorted list of English words/punctuation
- `target_tokens` - sorted list of Spanish words/punctuation (including `<START>`, `<END>`)

These are still just **words** (strings). Neural networks can't process text - they need **numbers**.

## **WHAT WE'RE DOING NOW:**
We're creating **dictionaries** that map between words and numbers:

1. **Forward mappings** (word → index): Convert text to numbers for model input
   - `input_features_dict`: English word → number
   - `target_features_dict`: Spanish word → number

2. **Reverse mappings** (index → word): Convert numbers back to text for human reading
   - `reverse_input_features_dict`: number → English word
   - `reverse_target_features_dict`: number → Spanish word

## **WHY THIS MATTERS FOR SEQ2SEQ:**
- **Training phase**: Convert "We'll see ." → `[45, 89, 12]` to feed into encoder
- **Inference phase**: Convert decoder output `[23, 56, 78]` → "Después veremos ." for humans to read
- **Consistency**: Because we sorted in Step 3, "hello" always gets the same index (e.g., 45) every time

## **THE PROCESS:**Text → Numbers (forward dict) → Model → Numbers → Text (reverse dict)
"hello" → 45 → [neural network processing] → 23 → "hola"

## **WHAT'S NEXT:**
With these mappings in place, we'll create the actual training data - converting all our sentences into numerical arrays that the encoder-decoder can process.

In [5]:
import numpy as np

print('Number of samples:', len(input_docs))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

# Remember, input_tokens is a sorted list: ['!', '.', '?', 'We', 'see', ...]
# enumerate gives you: (index, value) pairs, so for more utems in the list, it gives you a list of TUPLES
# Output:
# 0 !
# 1 .
# 2 ?
# 3 We
# 4 see
# ...
# So this creates a dictionary where the keys are the tokens, while the values arethe indexes 
input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])

# Build out target_features_dict:
target_features_dict = dict([(token, i) for i, token in enumerate(target_tokens)])

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
# Build out reverse_target_features_dict:
reverse_target_features_dict = dict([(i, token) for token, i in target_features_dict.items()])

notes_= '''
APPROACH 2,simpler:
First create the forward dict
input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])

Then flip it to create reverse dict
reverse_input_features_dict = dict(
    (i, token) for i, token in enumerate(input_tokens))
'''

Number of samples: 11
Number of unique input tokens: 18
Number of unique output tokens: 27
Max sequence length for inputs: 4
Max sequence length for outputs: 8


## **WHAT WE JUST DID: Step-by-Step Flow**

### **The Processing Pipeline:**
```
STEP 1: Print Summary Statistics
--------------------------------
Display key parameters:
- Total number of translation pairs
- Vocabulary sizes for English and Spanish
- Maximum sentence lengths

This gives us a quick sanity check of our data.

STEP 2: Create Forward Mapping for English
--------------------------------
input_tokens = ['!', '.', '?', 'We', 'see', ...]  (sorted list)
                ↓ enumerate() yields (0, '!'), (1, '.'), (2, '?'), ...
                ↓ dict comprehension flips to (token, index)
input_features_dict = {'!': 0, '.': 1, '?': 2, 'We': 3, 'see': 4, ...}

Purpose: Look up any English word and get its index number
Example: input_features_dict['We'] → 3

STEP 3: Create Forward Mapping for Spanish
--------------------------------
target_tokens = ['<START>', '<END>', '!', '.', 'Después', ...]
                ↓ enumerate() yields (0, '<START>'), (1, '<END>'), ...
                ↓ dict comprehension flips to (token, index)
target_features_dict = {'<START>': 0, '<END>': 1, '!': 2, '.': 3, 'Después': 4, ...}

Purpose: Look up any Spanish word and get its index number
Example: target_features_dict['Después'] → 4

STEP 4: Create Reverse Mapping for English
--------------------------------
input_features_dict.items() = [('!', 0), ('.', 1), ('We', 3), ...]
                ↓ dict comprehension flips to (index, token)
reverse_input_features_dict = {0: '!', 1: '.', 3: 'We', ...}

Purpose: Convert index numbers back to English words
Example: reverse_input_features_dict[3] → 'We'

STEP 5: Create Reverse Mapping for Spanish
--------------------------------
target_features_dict.items() = [('<START>', 0), ('<END>', 1), ('Después', 4), ...]
                ↓ dict comprehension flips to (index, token)
reverse_target_features_dict = {0: '<START>', 1: '<END>', 4: 'Después', ...}

Purpose: Convert index numbers back to Spanish words
Example: reverse_target_features_dict[4] → 'Después'
```

### **COMPLETE WORKFLOW - TEXT TO NUMBERS AND BACK:**
```
ENCODING (Training):
"We'll see ." 
    → tokenize: ['We', 'see', '.']
    → forward dict: [3, 4, 1]
    → feed to encoder

DECODING (Inference):
decoder output: [4, 5, 3]
    → reverse dict: ['Después', 'veremos', '.']
    → join: "Después veremos ."
    → show to human
```

### **FINAL RESULT:**
We now have 4 dictionaries that allow bidirectional conversion between words and numbers:
- `input_features_dict`: English word → number
- `target_features_dict`: Spanish word → number  
- `reverse_input_features_dict`: number → English word
- `reverse_target_features_dict`: number → Spanish word

# **Step 5: Creating Empty 3D Training Matrices**

## **WHERE WE'RE COMING FROM:**
In Step 4, we created four dictionaries that map between words and numbers:
- `input_features_dict` and `reverse_input_features_dict` (English)
- `target_features_dict` and `reverse_target_features_dict` (Spanish)

We can now convert individual words to numbers, but we need **entire datasets** in numerical form.

## **WHAT WE'RE DOING NOW:**
We're creating three massive 3D NumPy arrays filled with zeros. These will serve as containers for all our training data:

1. **encoder_input_data** - English sentences that go into the encoder
2. **decoder_input_data** - Spanish sentences that the decoder sees as input
3. **decoder_target_data** - Spanish sentences that the decoder should output (ground truth)

## **WHY THREE DIMENSIONS?**

Each matrix has the shape: **(number of sentences, max sequence length, vocabulary size)**

This 3D structure allows us to represent:
- **Dimension 1:** Which sentence from our dataset?
- **Dimension 2:** Which position in that sentence? (1st word, 2nd word, etc.)
- **Dimension 3:** Which word is it? (one-hot encoded vector)

## **WHY WE NEED THREE SEPARATE MATRICES:**

The encoder-decoder architecture requires:
- **Encoder input:** English sentences to encode into meaning
- **Decoder input:** Spanish sentences starting with `<START>` token
- **Decoder target:** Spanish sentences ending with `<END>` token (shifted by one position)

The decoder learns by seeing the Spanish translation with `<START>` and trying to output the translation with `<END>`.

## **WHAT'S NEXT:**
After creating these empty matrices, we'll fill them with actual one-hot encoded words from our sentences. Right now they're just scaffolding - all zeros waiting to be populated.

In [6]:
encoder_input_data = np.zeros(
    (len(input_docs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
  
print("\nHere's the first item in the encoder input matrix:\n", encoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")

# Build out the decoder_input_data matrix:
decoder_input_data = np.zeros(
  (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
# Build out the decoder_target_data matrix:
decoder_target_data = np.zeros(
  (len(target_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')


Here's the first item in the encoder input matrix:
 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] 

The number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.


## **WHAT WE JUST DID: Step-by-Step Flow**

### **The Processing Pipeline:**
```
STEP 1: Create Encoder Input Matrix
------------------------------------
Shape: (len(input_docs), max_encoder_seq_length, num_encoder_tokens)
Example: (11 sentences, 4 time steps, 50 English words)

Result: A 3D array filled with zeros
Purpose: Will hold one-hot encoded English sentences

STEP 2: Create Decoder Input Matrix  
------------------------------------
Shape: (len(input_docs), max_decoder_seq_length, num_decoder_tokens)
Example: (11 sentences, 6 time steps, 60 Spanish words)

Result: A 3D array filled with zeros
Purpose: Will hold Spanish sentences starting with <START>

STEP 3: Create Decoder Target Matrix
------------------------------------
Shape: (len(target_docs), max_decoder_seq_length, num_decoder_tokens)
Example: (11 sentences, 6 time steps, 60 Spanish words)

Result: A 3D array filled with zeros
Purpose: Will hold Spanish sentences ending with <END> (what decoder should output)
```

---

## **VISUALIZING THE 3D MATRIX STRUCTURE**

### **Understanding the Dimensions:**

Let's use concrete numbers from your dataset:
- You have **11 sentences**
- Longest English sentence: **4 tokens**
- English vocabulary: **50 unique words**

So `encoder_input_data` has shape **(11, 4, 50)**

---

### **Think of it as a stack of 2D matrices:**
```
encoder_input_data = [
    sentence_0,  ← 2D matrix (4 rows × 50 columns)
    sentence_1,  ← 2D matrix (4 rows × 50 columns)
    sentence_2,  ← 2D matrix (4 rows × 50 columns)
    ...
    sentence_10  ← 2D matrix (4 rows × 50 columns)
]
```

---

### **Zooming into ONE sentence (2D matrix):**

Let's say sentence 0 is "We'll see ."

**After tokenization:**
- Position 0: "We'll" (index 45 in vocabulary)
- Position 1: "see" (index 48 in vocabulary)
- Position 2: "." (index 1 in vocabulary)
- Position 3: (padding - will stay all zeros)

**The matrix for this sentence (4 × 50):**
```
                 Word_0  Word_1  ...  Word_45  ...  Word_48  ...  Word_49
                   !       .            We'll        see
Position 0:       [0       0     ...     1      ...    0     ...    0    ]  ← "We'll"
Position 1:       [0       0     ...     0      ...    1     ...    0    ]  ← "see"
Position 2:       [0       1     ...     0      ...    0     ...    0    ]  ← "."
Position 3:       [0       0     ...     0      ...    0     ...    0    ]  ← padding
```

**Key points:**
- Each row = one word position in the sentence
- Each row has exactly ONE cell with value 1 (the word's index)
- All other cells in that row are 0
- This is called **one-hot encoding**

---

### **Example: One-Hot Encoding Breakdown**

**Vocabulary (simplified):**
```
Index 0: !
Index 1: .
Index 2: ?
Index 3: We
Index 4: see
Index 5: try
...
Index 49: (last word)
```

**The word "see" (index 4) becomes:**
```
[0, 0, 0, 0, 1, 0, 0, 0, ..., 0]
             ↑
        Position 4 = 1
        All others = 0
```

This vector has length 50 (same as vocabulary size).

---

### **The Complete 3D Structure:**
```
DIMENSION 1 (Sentence selector):
encoder_input_data[0]  → "We'll see ."
encoder_input_data[1]  → "We'll try ."
encoder_input_data[2]  → "Who is he ?"
...

DIMENSION 2 (Word position in sentence):
encoder_input_data[0][0]  → First word of first sentence: "We'll"
encoder_input_data[0][1]  → Second word of first sentence: "see"
encoder_input_data[0][2]  → Third word of first sentence: "."

DIMENSION 3 (One-hot vector for the word):
encoder_input_data[0][0][45]  → This cell = 1 (because "We'll" is at index 45)
encoder_input_data[0][0][0]   → This cell = 0
encoder_input_data[0][0][1]   → This cell = 0
... (all other positions are 0)
```

---

### **Why This Structure?**

**Neural networks need:**
1. **Fixed-size inputs** → All sentences padded to same length (max_encoder_seq_length)
2. **Numerical data** → Words converted to numbers via one-hot encoding
3. **Batch processing** → Multiple sentences stacked in dimension 0

**The 3D shape enables:**
- Processing all 11 sentences at once (batch training)
- Handling variable-length sentences (padding with zeros)
- Representing each word as a unique vector (one-hot encoding)

---

### **Current State: All Zeros**

Right now, all three matrices look like this:
```
encoder_input_data[0] = 
[[0, 0, 0, ..., 0],
 [0, 0, 0, ..., 0],
 [0, 0, 0, ..., 0],
 [0, 0, 0, ..., 0]]
```

**Next step:** Fill these zeros with actual one-hot encoded words from your sentences!

---

### **WHAT CHANGED FROM STEP 4:**

| Aspect | Step 4 | Step 5 |
|--------|--------|--------|
| **Data structure** | Dictionaries (word ↔ index mappings) | 3D NumPy arrays (empty containers) |
| **Content** | Lookup tables for individual words | Space allocated for entire dataset |
| **Ready for training?** | No - need actual numerical arrays | Almost - containers ready, need to fill them |
| **Memory allocated?** | Minimal (just dictionaries) | Yes - large arrays allocated in RAM |

✅ **Ready for Step 6 - Filling the matrices with actual data!**

# **Step 6: Filling the Matrices with One-Hot Encoded Data**

## **WHERE WE'RE COMING FROM:**
In Step 5, we created three empty 3D NumPy arrays:
- `encoder_input_data` - shape: (num_sentences, max_encoder_seq_length, num_encoder_tokens)
- `decoder_input_data` - shape: (num_sentences, max_decoder_seq_length, num_decoder_tokens)
- `decoder_target_data` - shape: (num_sentences, max_decoder_seq_length, num_decoder_tokens)

These were filled entirely with zeros - just empty scaffolding waiting for actual data.

## **WHAT WE'RE DOING NOW:**
We're converting our text sentences into one-hot encoded vectors and filling these matrices with actual training data. This is where the transformation from human-readable text to machine-processable numbers is completed.

**The process:**
1. Loop through each sentence pair (English + Spanish)
2. For each word in the English sentence → create one-hot vector → place in `encoder_input_data`
3. For each word in the Spanish sentence → create one-hot vector → place in `decoder_input_data`
4. For each word in the Spanish sentence (shifted by 1) → create one-hot vector → place in `decoder_target_data`

## **KEY CONCEPT: The Decoder Input/Target Shift**

The decoder learns by seeing the correct previous word and predicting the next word:

In [7]:
print(input_docs)

["We'll see .", "We'll see .", "We'll try .", "We've won !", 'Well done .', "What's up ?", 'Who cares ?', 'Who drove ?', 'Who drove ?', 'Who is he ?', 'Who is it ?']


In [8]:
print(target_docs)

['<START> Después veremos . <END>', '<START> Ya veremos . <END>', '<START> Lo intentaremos . <END>', '<START> ¡ Hemos ganado ! <END>', '<START> Bien hecho . <END>', '<START> ¿ Qué hay ? <END>', '<START> ¿ A quién le importa ? <END>', '<START> ¿ Quién condujo ? <END>', '<START> ¿ Quién conducía ? <END>', '<START> ¿ Quién es él ? <END>', '<START> ¿ Quién es ? <END>']


In [9]:
'''
zip() takes two (or more) lists and pairs up their elements into tuples.
What enumerate(zip(input_docs, target_docs)) produces:
[
    (0, ("We'll see.", "<START> Después veremos . <END>")),
    (1, ("We'll try.", "<START> Lo intentaremos . <END>")),
    (2, ("We've won!", "<START> ¡ Hemos ganado ! <END>")),
    ...
]
'''
for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):
    '''
    So now we have 3 separate entities: 
    - line: the sentence index (which sentence number we're processing: 0, 1, 2, ...)
    - input_doc: the actual English sentence string at that index
    - target_doc: the actual Spanish sentence string at that index
    
    Starting at line = 0, we have:
        - input_doc = "We'll see."
        - target_doc = "<START> Después veremos . <END>"
    '''
    
    for timestep, token in enumerate(input_doc.split()):
    # for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):
        '''
        This produces tuples that enumerate each token of the input_doc currently being looped.
        So, for "We'll see.", it gives:
        (0, "We'll")
        (1, "see")
        (2, ".")
        
        timestep = the position of the word in the sentence
        token = the actual word/punctuation string
        '''
        print("Encoder input timestep & token:", timestep, token)
        # We need to recover the index of the token from the vocabulary dictionary,
        # so we can place the 1 at the correct position in the one-hot vector
        print(input_features_dict[token])
        # Assign 1 for the current sentence (line), word position (timestep), 
        # and word index in encoder_input_data:
        encoder_input_data[line, timestep, input_features_dict[token]] = 1.
    
    for timestep, token in enumerate(target_doc.split()):
        '''
        Same as what we did above, but now we fill decoder_input_data.
        Only difference: target_doc tokens are already separated by spaces,
        so .split() produces a list of the tokens without needing regex.
        
        For "<START> Después veremos . <END>", this gives:
        (0, "<START>")
        (1, "Después")
        (2, "veremos")
        (3, ".")
        (4, "<END>")
        '''
        print("Decoder input timestep & token:", timestep, token)
        # Assign 1 for the current sentence (line), word position (timestep),
        # and word index in decoder_input_data:
        print(target_features_dict[token])
        decoder_input_data[line, timestep, target_features_dict[token]] = 1.
        
        if timestep > 0:
            '''
            decoder_target_data is shifted left by 1 timestep (ahead), meaning:
            - decoder_input_data at position 0 has "<START>"
            - decoder_target_data at position 0 has "Después" (what should be output when seeing <START>)
            - decoder_input_data at position 1 has "Después"
            - decoder_target_data at position 1 has "veremos" (what should be output when seeing Después)
            
            The if timestep > 0 skips the <START> token (timestep 0), because <START> 
            should never appear in the target output - it's only an input signal.
            '''
            print("Decoder target timestep:", timestep, token)
            # Assign 1 for the current sentence (line), shifted position (timestep-1),
            # and word index in decoder_target_data:
            decoder_target_data[line, timestep-1, target_features_dict[token]] = 1.
            # We use timestep-1 to shift everything left by one position

Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 see
14
Encoder input timestep & token: 2 .
1
Decoder input timestep & token: 0 <START>
3
Decoder input timestep & token: 1 Después
7
Decoder target timestep: 1 Después
Decoder input timestep & token: 2 veremos
23
Decoder target timestep: 2 veremos
Decoder input timestep & token: 3 .
1
Decoder target timestep: 3 .
Decoder input timestep & token: 4 <END>
2
Decoder target timestep: 4 <END>
Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 see
14
Encoder input timestep & token: 2 .
1
Decoder input timestep & token: 0 <START>
3
Decoder input timestep & token: 1 Ya
12
Decoder target timestep: 1 Ya
Decoder input timestep & token: 2 veremos
23
Decoder target timestep: 2 veremos
Decoder input timestep & token: 3 .
1
Decoder target timestep: 3 .
Decoder input timestep & token: 4 <END>
2
Decoder target timestep: 4 <END>
Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 t

## **WHAT WE JUST DID: Step-by-Step Flow**

### **The Processing Pipeline:**
```
STEP 1: Pair Up Sentences
--------------------------
Use zip() to pair English and Spanish sentences:
("We'll see .", "<START> Después veremos . <END>")

Use enumerate() to add sentence indices:
(0, ("We'll see .", "<START> Después veremos . <END>"))

STEP 2: Fill Encoder Input Data
--------------------------------
For each English sentence:
  - Split into tokens: ["We'll", "see", "."]
  - For each token:
    * Look up its index in input_features_dict
    * Set encoder_input_data[sentence_idx, word_position, word_index] = 1
    * This creates a one-hot encoded vector

STEP 3: Fill Decoder Input Data
--------------------------------
For each Spanish sentence:
  - Split into tokens: ["<START>", "Después", "veremos", ".", "<END>"]
  - For each token:
    * Look up its index in target_features_dict
    * Set decoder_input_data[sentence_idx, word_position, word_index] = 1
    * This creates a one-hot encoded vector

STEP 4: Fill Decoder Target Data (Shifted!)
--------------------------------------------
For each Spanish sentence (skipping <START>):
  - Start from timestep 1 (skip timestep 0 which is <START>)
  - For each token after <START>:
    * Place it at position (timestep - 1) in decoder_target_data
    * This shifts everything left by one position
```

---

## **VISUALIZING THE ONE-HOT ENCODING**

### **Example Sentence: "We'll see ."**

**After tokenization (from Step 2):**
```
["We'll", "see", "."]
```

**Looking up indices in input_features_dict:**
```
"We'll" → index 3
"see"   → index 14
"."     → index 1
```

**Filling encoder_input_data[0] (first sentence):**
```
Position 0 (timestep 0, token "We'll"):
encoder_input_data[0, 0, :] = [0, 0, 0, 1, 0, 0, ..., 0]
                                        ↑
                                    index 3 = 1

Position 1 (timestep 1, token "see"):
encoder_input_data[0, 1, :] = [0, 0, 0, 0, ..., 1, ..., 0]
                                              ↑
                                          index 14 = 1

Position 2 (timestep 2, token "."):
encoder_input_data[0, 2, :] = [0, 1, 0, 0, 0, ..., 0]
                                  ↑
                              index 1 = 1

Position 3 (padding):
encoder_input_data[0, 3, :] = [0, 0, 0, 0, ..., 0]  (all zeros)
```

**Each row is a one-hot vector:** exactly one position is 1, all others are 0.

---

## **VISUALIZING THE DECODER SHIFT**

### **Example: "<START> Después veremos . <END>"**

**Token indices:**
```
"<START>" → 3
"Después" → 7
"veremos" → 23
"."       → 1
"<END>"   → 2
```

### **decoder_input_data (what decoder sees):**
```
Position 0: <START>  → decoder_input_data[0, 0, 3] = 1
Position 1: Después  → decoder_input_data[0, 1, 7] = 1
Position 2: veremos  → decoder_input_data[0, 2, 23] = 1
Position 3: .        → decoder_input_data[0, 3, 1] = 1
Position 4: <END>    → decoder_input_data[0, 4, 2] = 1
```

### **decoder_target_data (what decoder should output - SHIFTED!):**
```
Position 0: Después  → decoder_target_data[0, 0, 7] = 1   ← from timestep 1
Position 1: veremos  → decoder_target_data[0, 1, 23] = 1  ← from timestep 2
Position 2: .        → decoder_target_data[0, 2, 1] = 1   ← from timestep 3
Position 3: <END>    → decoder_target_data[0, 3, 2] = 1   ← from timestep 4
Position 4: (padding) → decoder_target_data[0, 4, :] = all zeros
```

**Notice:** `<START>` appears in input but NOT in target!

---

## **THE COMPLETE ALIGNMENT**

### **Training Example for the Decoder:**
```
TIME    DECODER INPUT          DECODER TARGET (what it should output)
  0     <START>          →     Después
  1     Después          →     veremos
  2     veremos          →     .
  3     .                →     <END>
  4     <END>            →     (nothing - padding)
```

**This is how the decoder learns:**
- "When I see `<START>`, I should output `Después`"
- "When I see `Después`, I should output `veremos`"
- "When I see `veremos`, I should output `.`"
- "When I see `.`, I should output `<END>`"

During training, the model compares its actual output with `decoder_target_data` and adjusts its weights to minimize the difference.

---

## **CODE WALKTHROUGH: The `if timestep > 0` Condition**
```python
for timestep, token in enumerate(target_doc.split()):
    # Always fill decoder_input_data
    decoder_input_data[line, timestep, target_features_dict[token]] = 1.
    
    if timestep > 0:
        # Only fill decoder_target_data for timesteps 1, 2, 3, 4...
        # Place at position (timestep - 1) to shift left
        decoder_target_data[line, timestep-1, target_features_dict[token]] = 1.
```

**What happens:**
```
timestep=0, token="<START>":
  - decoder_input_data[0, 0, 3] = 1  ✓
  - if timestep > 0: FALSE, skip decoder_target_data  ← <START> excluded from target!

timestep=1, token="Después":
  - decoder_input_data[0, 1, 7] = 1  ✓
  - if timestep > 0: TRUE
  - decoder_target_data[0, 0, 7] = 1  ✓ (position 0, not 1!)

timestep=2, token="veremos":
  - decoder_input_data[0, 2, 23] = 1  ✓
  - if timestep > 0: TRUE
  - decoder_target_data[0, 1, 23] = 1  ✓ (position 1, not 2!)

timestep=3, token=".":
  - decoder_input_data[0, 3, 1] = 1  ✓
  - if timestep > 0: TRUE
  - decoder_target_data[0, 2, 1] = 1  ✓ (position 2, not 3!)

timestep=4, token="<END>":
  - decoder_input_data[0, 4, 2] = 1  ✓
  - if timestep > 0: TRUE
  - decoder_target_data[0, 3, 2] = 1  ✓ (position 3, not 4!)
```

**Result:** decoder_target is shifted left by 1, and `<START>` is excluded!

---

### **WHAT CHANGED FROM STEP 5:**

| Aspect | Step 5 | Step 6 |
|--------|--------|--------|
| **Matrix contents** | All zeros | Filled with one-hot encoded vectors |
| **encoder_input_data** | Empty scaffolding | Contains English sentences as numbers |
| **decoder_input_data** | Empty scaffolding | Contains Spanish sentences (with `<START>`) |
| **decoder_target_data** | Empty scaffolding | Contains Spanish sentences (shifted, without `<START>`) |
| **Ready for training?** | No - just structure | Yes! ✓ Complete numerical training data |

✅ **Training data is ready! Next step: Build the encoder-decoder model architecture!**

# **Step 7: Building the Encoder**

## **WHERE WE'RE COMING FROM:**
In Step 6, we created three matrices filled with one-hot encoded training data. Now we transition from data preparation to building the actual neural network architecture.

## **WHAT WE'RE DOING NOW:**
We're building the **encoder** - the part of the seq2seq model that reads English sentences and compresses their meaning into a fixed-size representation (the encoder states).

**Components:**
1. **Input layer** - receives one-hot encoded English sentences
2. **LSTM layer** - processes sequences and maintains memory
3. **Encoder states** - captures the "meaning" of the input sentence

## **KEY CONCEPT:**
The encoder reads "We'll see ." word by word and produces two state vectors (hidden and cell) that represent the sentence's meaning in a 256-dimensional space. These states will be passed to the decoder to initialize translation.

## **WHAT'S NEXT:**
Build the decoder, which will use these encoder states to generate Spanish translations.

In [10]:
from tensorflow import keras
from keras.layers import Input, LSTM
from keras.models import Model

# Create the input layer:
encoder_inputs = Input(shape = (None, num_encoder_tokens))

# Create the LSTM layer:
encoder_lstm = LSTM(256, return_state=True)

# Retrieve the outputs and states:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)

# Put the states together in a list:
encoder_states = [state_hidden, state_cell]

2025-12-07 15:57:05.257261: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [11]:
# Check the encoder input shape
print(f"Encoder input shape: {encoder_inputs.shape}")
# Check the LSTM configuration
print(f"LSTM units: {encoder_lstm.units}")
print(f"Return state: {encoder_lstm.return_state}")
# Check the state shapes (will show symbolic tensors)
print(f"Hidden state shape: {state_hidden.shape}")
print(f"Cell state shape: {state_cell.shape}")
# Understand what encoder_states contains
print(f"Number of states: {len(encoder_states)}")
print(f"encoder_states is a list containing: {encoder_states}")

Encoder input shape: (None, None, 18)
LSTM units: 256
Return state: True
Hidden state shape: (None, 256)
Cell state shape: (None, 256)
Number of states: 2
encoder_states is a list containing: [<KerasTensor shape=(None, 256), dtype=float32, sparse=False, name=keras_tensor_2>, <KerasTensor shape=(None, 256), dtype=float32, sparse=False, name=keras_tensor_3>]


## **WHAT WE JUST DID: Encoder Architecture**

### **The Flow:**
```
English sentence → encoder_inputs → encoder_lstm → encoder_states
                   (Input layer)    (256 LSTM units)  [hidden, cell]
```

### **Component Breakdown:**

**1. encoder_inputs:** 
- Shape: `(None, num_encoder_tokens)`
- `None` = variable sequence length (flexibility for inference)
- Receives one-hot encoded English sentences

**2. encoder_lstm:**
- 256 LSTM units = 256 parallel cells, each with forget/input/output gates
- Processes sequence word-by-word, maintains memory
- `return_state=True` = outputs final hidden and cell states

**3. encoder_states:**
- `state_hidden`: 256-dimensional vector (short-term memory)
- `state_cell`: 256-dimensional vector (long-term memory)
- Together represent the sentence's meaning in abstract feature space

---

## **NEURAL NETWORK ARCHITECTURE**

### **Encoder Structure:**
```
INPUT LAYER (encoder_inputs)
     ↓
Shape: (batch_size, timesteps, num_encoder_tokens)
Example: (11, 4, 50) for 11 sentences, max 4 words, 50 vocabulary size
     ↓
LSTM LAYER (encoder_lstm)
     ↓
256 LSTM units processing each timestep:
  t=0: "We'll"  → update 256 hidden states, 256 cell states
  t=1: "see"    → update 256 hidden states, 256 cell states  
  t=2: "."      → update 256 hidden states, 256 cell states
     ↓
OUTPUT: encoder_states
     ↓
state_hidden: (batch_size, 256) - final "thought" vector
state_cell:   (batch_size, 256) - final "memory" vector
```

### **Key Architecture Points:**

**Why 256 units?**
- Balances capacity (understanding complex patterns) with efficiency
- Each unit is an independent LSTM cell with its own gates
- Creates a 256-dimensional "thought vector" representing sentence meaning

**What are the states?**
- **NOT** probabilities over vocabulary
- **NOT** predictions of next words
- **Abstract feature vectors** learned during training
- Capture semantic meaning: tense, sentiment, subject, context, etc.

**Why two states?**
- `state_hidden` (h): short-term focus, what the LSTM "thinks about" now
- `state_cell` (c): long-term memory, accumulated knowledge
- Both needed to fully capture sequence information

---

### **WHAT CHANGED FROM STEP 6:**

| Aspect | Step 6 | Step 7 |
|--------|--------|--------|
| **Focus** | Data preparation | Model architecture |
| **Components** | NumPy arrays with data | Keras layers (Input, LSTM) |
| **Encoder** | Not built | Input layer + LSTM layer created |
| **Output** | Training data matrices | Encoder states (meaning vectors) |

✅ **Encoder complete! Next: Build the decoder to generate Spanish translations!**

# **Step 8: Building the Decoder and Connecting to Encoder**

## **WHERE WE'RE COMING FROM:**
In Step 7, we built the encoder that processes English sentences and produces two state vectors (hidden and cell) representing the sentence's meaning in 256-dimensional space.

## **WHAT WE'RE DOING NOW:**
We're building the **decoder** - the part that takes the encoder's understanding and generates Spanish translations word-by-word. This step completes the encoder-decoder connection.

**Components we're adding:**
1. **Decoder input layer** - receives Spanish sentences starting with `<START>`
2. **Decoder LSTM layer** - generates translations, initialized with encoder states
3. **Dense output layer** - converts LSTM outputs to vocabulary probabilities

## **THE CRITICAL CONNECTION:**
The decoder's LSTM is initialized with `initial_state=encoder_states`. This means the decoder starts with the encoder's "understanding" of the English sentence already loaded into its memory, then uses that understanding to generate Spanish words.

## **KEY ARCHITECTURAL DIFFERENCE:**
Unlike the encoder (which only needs final states), the decoder needs output at **every timestep** to predict each Spanish word. That's why we use `return_sequences=True` and add a Dense layer with softmax to convert LSTM outputs into word predictions.

## **WHAT'S NEXT:**
Combine encoder and decoder into a complete model and train it on our data.

In [12]:
from tensorflow import keras
# Add Dense to the imported layers
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# ENCODER (from Step 7 - repeated for context)
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# DECODER INPUT LAYER
# Shape: (None, num_decoder_tokens) - variable length Spanish sentences
# Will receive decoder_input_data: Spanish sentences starting with <START>
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# DECODER LSTM LAYER
# 256 units (same size as encoder for state compatibility)
# return_sequences=True: Returns output at EVERY timestep (to predict each word)
# return_state=True: Also returns final states (not used during training)
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)

# PASS INPUTS THROUGH DECODER LSTM
# Key: initial_state=encoder_states connects encoder to decoder!
# The encoder's final "understanding" becomes the decoder's starting point
# Returns:
#   - decoder_outputs: LSTM output at each timestep (batch, timesteps, 256)
#   - decoder_state_hidden, decoder_state_cell: final states (ignored in training)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(
    decoder_inputs, 
    initial_state=encoder_states
)

# DENSE OUTPUT LAYER
# Converts 256-dimensional LSTM outputs to vocabulary probabilities
# num_decoder_tokens outputs: one probability for each Spanish word
# softmax activation: ensures probabilities sum to 1
# Example output: [0.01, 0.05, 0.82, ...] where 0.82 = "Después" is most likely
decoder_dense = Dense(num_decoder_tokens, activation="softmax")

# APPLY DENSE LAYER TO GET FINAL PREDICTIONS
# Final shape: (batch, timesteps, num_decoder_tokens)
# At each timestep, get probability distribution over all Spanish words
decoder_outputs = decoder_dense(decoder_outputs)

## **WHAT WE JUST DID: Complete Encoder-Decoder Architecture**

### **The Complete Flow:**
```
ENCODER PROCESSES ENGLISH (Step 7):
"We'll see ."
    ↓
encoder_inputs (Input layer)
    ↓
encoder_lstm (256 units, return_state=True)
    ↓
encoder_states = [state_hidden, state_cell]
    ↓ (256-dim vectors representing sentence meaning)
    ↓
    ↓ [PASSED AS initial_state]
    ↓
DECODER GENERATES SPANISH (Step 8):
"<START> Después veremos ."
    ↓
decoder_inputs (Input layer)
    ↓
decoder_lstm (256 units, initialized with encoder_states)
    ↓
decoder_outputs (256-dim at each timestep)
    ↓
decoder_dense (Dense + softmax)
    ↓
Predictions: probabilities for each Spanish word at each timestep
```

---

## **ENCODER VS DECODER: KEY DIFFERENCES**

### **Comparison Table:**

| Feature | Encoder | Decoder |
|---------|---------|---------|
| **Purpose** | Understand English input | Generate Spanish output |
| **Initial state** | Zeros (starts fresh) | Encoder's final states |
| **LSTM configuration** | `return_state=True` | `return_sequences=True, return_state=True` |
| **LSTM outputs** | Ignored (only states matter) | Used for predictions |
| **Dense layer** | ❌ None needed | ✅ Converts to vocabulary probabilities |
| **Output shape** | Just states: (256,) each | Predictions: (timesteps, num_decoder_tokens) |
| **Processing** | All at once (batch) | Step-by-step (sequential generation) |

---

### **Difference 1: Initial States**

**Encoder:**
```python
# Starts with zero/random initial states
encoder_lstm(encoder_inputs)
# Initial hidden state: [0, 0, 0, ..., 0] (256 zeros)
# Initial cell state:   [0, 0, 0, ..., 0] (256 zeros)
# Builds understanding from scratch
```

**Decoder:**
```python
# Starts with encoder's final states
decoder_lstm(decoder_inputs, initial_state=encoder_states)
# Initial hidden state: encoder's state_hidden (256 learned values)
# Initial cell state:   encoder's state_cell (256 learned values)
# Begins with encoder's understanding already loaded!
```

**Why this matters:** The decoder doesn't start from scratch - it starts with the "meaning" of the English sentence already in its memory.

---

### **Difference 2: Output Processing**

**Encoder:**
```python
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
# encoder_outputs ← IGNORED (we don't care about intermediate outputs)
# Only keep: state_h and state_c (the final "understanding")
```

**Decoder:**
```python
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_outputs = decoder_dense(decoder_outputs)  # ← Dense layer converts to predictions!
# decoder_outputs ← USED! We need predictions at each timestep
# Shape: (batch, timesteps, num_decoder_tokens)
# Example at timestep 1: [0.01, 0.05, 0.82, ...] ← "Después" has 82% probability
```

**Why this matters:** The decoder needs to predict a Spanish word at each step, so we need outputs at every timestep and a Dense layer to convert them to probabilities.

---

### **Difference 3: return_sequences Parameter**

**Encoder:**
```python
LSTM(256, return_state=True)  # return_sequences=False by default
```
- Only returns final output (which we ignore anyway)
- We only care about the final states

**Decoder:**
```python
LSTM(256, return_sequences=True, return_state=True)
```
- Returns output at **every timestep** (not just the last)
- We need these outputs to predict each Spanish word

**Visual:**
```
ENCODER (return_sequences=False):
t=0: "We'll" → [processing] → (output ignored)
t=1: "see"   → [processing] → (output ignored)
t=2: "."     → [processing] → (output ignored)
Final: → states only

DECODER (return_sequences=True):
t=0: "<START>" → [processing] → output[0] → Dense → predict "Después"
t=1: "Después" → [processing] → output[1] → Dense → predict "veremos"
t=2: "veremos" → [processing] → output[2] → Dense → predict "."
t=3: "."       → [processing] → output[3] → Dense → predict "<END>"
```

---

## **HOW ENCODER AND DECODER INTERACT**

### **The Handoff (One-Time Initialization):**
```
STEP 1: Encoder processes ENTIRE English sentence
────────────────────────────────────────────────────
"We'll see ." → encoder_lstm → final states (256 + 256)

STEP 2: States are COPIED to decoder as initial state
────────────────────────────────────────────────────
encoder_states → [copied] → decoder's initial state

STEP 3: Decoder generates Spanish word-by-word
────────────────────────────────────────────────────
Using initial understanding from encoder:
  t=0: "<START>" → predict "Después"
  t=1: "Después" → predict "veremos"
  t=2: "veremos" → predict "."
  t=3: "."       → predict "<END>"
```

**Important:** No doubling of dimensions! Both encoder and decoder have 256 units. The encoder's 256-dimensional states are simply copied into the decoder's 256-dimensional initial state.

---

## **KEY ARCHITECTURAL INSIGHTS**

### **1. Why the same size (256 units)?**
The encoder and decoder must have the **same hidden size** so the encoder's states fit perfectly as the decoder's initial state. If encoder had 256 units and decoder had 512, the initialization wouldn't work.

### **2. Why Dense layer only on decoder?**
- **Encoder:** Compresses meaning into states (abstract 256-dim vectors)
- **Decoder:** Generates actual words (needs probabilities over vocabulary)
- The Dense layer transforms: 256-dim abstract vector → num_decoder_tokens probabilities

### **3. Why initial_state matters?**
Without encoder states, the decoder would start from zero and have no idea what English sentence to translate! The encoder states are the **semantic bridge** between English and Spanish.

---

### **WHAT CHANGED FROM STEP 7:**

| Aspect | Step 7 | Step 8 |
|--------|--------|--------|
| **Components** | Encoder only | Encoder + Decoder |
| **Layers** | Input + LSTM | Input + LSTM + Dense |
| **Connection** | None | Decoder initialized with encoder states |
| **Output** | Just states (abstract) | Vocabulary probabilities (actionable) |
| **Can translate?** | No - only encodes | Yes - complete architecture! |

✅ **Encoder-decoder architecture complete! Next: Compile the model and train it!**

# **Step 9: Assembling, Compiling, and Training the Model**

## **WHERE WE'RE COMING FROM:**
In Steps 7-8, we built the encoder (processes English) and decoder (generates Spanish) as separate components. The decoder is initialized with the encoder's final states, creating the connection between understanding and generation.

## **WHAT WE'RE DOING NOW:**
We're putting everything together into a trainable model and actually **teaching it to translate**. This involves three major steps:

1. **Assemble:** Connect all inputs and outputs into one complete Model
2. **Compile:** Configure how the model should learn (optimizer, loss function, metrics)
3. **Train:** Feed it our data repeatedly so it learns the translation patterns

## **THE TRAINING PROCESS:**

**What the model learns:**
- 655,420 weights (parameters) embedded in the LSTM layers and Dense layer
- These weights learn patterns like:
  - How to transform English words into meaningful encoder states
  - How to use those states to generate appropriate Spanish words
  - Which Spanish words typically follow others

**What happens during training:**
- The model sees all 11 sentence pairs
- Makes predictions for Spanish translations
- Compares predictions to correct translations (decoder_target_data)
- Calculates error (loss) and adjusts weights to reduce that error
- Repeats this process 50 times (50 epochs)

## **IMPORTANT LIMITATION:**

With only **11 training sentences**, this model will essentially **memorize** these specific translations rather than learning general English→Spanish grammar rules. It will translate these 11 sentences well but struggle with any new sentence, especially if it contains words not in the training vocabulary.

Real translation models are trained on millions of sentence pairs!

## **WHAT'S NEXT:**
After training completes, we'll have a model with learned weights that can translate our 11 training sentences (and attempt to translate similar new sentences, though with limited success given the tiny training set).

In [13]:
from keras.models import Model

# Building the training model:
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model summary:\n")
training_model.summary()
print("\n\n")

# Compile the model:
'''
a) Optimizer: "rmsprop": The algorithm that updates the model's weights during training
RMSprop is particularly good for recurrent networks like LSTMs
Other options: "adam", "sgd"

b) Loss function: 'categorical_crossentropy': Measures how wrong the predictions are
Perfect for multi-class classification (choosing one word from vocabulary)
Compares the predicted probability distribution with the actual correct word

c) Metrics: ['accuracy']: What to track and display during training
Accuracy = percentage of correctly predicted words
You'll see this printed after each epoch
'''
training_model.compile(optimizer = "rmsprop", loss='categorical_crossentropy', metrics=['accuracy'])

'''
4. Setting training parameters:
batch_size = 50: How many sentence pairs to process before updating weights, 
Your dataset has only 11 sentences, so it'll use all 11 in one batch
Larger datasets would split into multiple batches

epochs = 50: How many times to go through the entire dataset
Each epoch, the model sees all sentences once and updates its weights
More epochs = more learning opportunities (but risk of overfitting with small data)'''
batch_size = 50
epochs = 50

print("Training the model:\n")
# Train the model:
training_model.fit([encoder_input_data, decoder_input_data], 
                   decoder_target_data, 
                   epochs = epochs, 
                   batch_size = batch_size, 
                   validation_split = 0.2)

Model summary:






Training the model:

Epoch 1/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.0469 - loss: 1.9559 - val_accuracy: 0.2083 - val_loss: 2.1815
Epoch 2/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 172ms/step - accuracy: 0.1562 - loss: 1.9413 - val_accuracy: 0.2500 - val_loss: 2.1707
Epoch 3/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 153ms/step - accuracy: 0.1719 - loss: 1.9291 - val_accuracy: 0.2500 - val_loss: 2.1604
Epoch 4/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 141ms/step - accuracy: 0.1719 - loss: 1.9174 - val_accuracy: 0.2500 - val_loss: 2.1498
Epoch 5/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 156ms/step - accuracy: 0.1719 - loss: 1.9054 - val_accuracy: 0.2500 - val_loss: 2.1382
Epoch 6/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 177ms/step - accuracy: 0.1562 - loss: 1.8923 - val_accuracy: 0.2083 - val_loss: 2.1247
Epoch 7/50
[1m1/1

<keras.src.callbacks.history.History at 0x7f2ef03580a0>

## **WHAT WE JUST DID: Complete Training Pipeline**

### **The Three-Step Process:**
```
STEP 1: ASSEMBLE
────────────────
Model([encoder_inputs, decoder_inputs], decoder_outputs)
    ↓
Creates complete computational graph:
  - English input → Encoder → States
  - States + Spanish input → Decoder → Predictions
  - Predictions compared to target for learning

STEP 2: COMPILE
────────────────
Configure learning parameters:
  - Optimizer: RMSprop (adjusts weights intelligently)
  - Loss: Categorical crossentropy (measures error)
  - Metrics: Accuracy (tracks performance)

STEP 3: TRAIN
─────────────
For 50 epochs:
  1. Forward pass: predict Spanish translations
  2. Calculate loss: compare predictions to targets
  3. Backward pass: calculate gradients
  4. Update weights: optimizer adjusts 655,420 parameters
  5. Validate: test on held-out data
  6. Print metrics: loss and accuracy
```

---

## **UNDERSTANDING THE TRAINING OUTPUT**

### **What You'll See:**
```
Epoch 1/50
1/1 - 2s - loss: 4.0943 - accuracy: 0.0682 - val_loss: 3.9876 - val_accuracy: 0.1250

Epoch 10/50
1/1 - 0s - loss: 2.5123 - accuracy: 0.3182 - val_loss: 2.7654 - val_accuracy: 0.2500

Epoch 30/50
1/1 - 0s - loss: 0.8765 - accuracy: 0.7273 - val_loss: 1.2345 - val_accuracy: 0.6250

Epoch 50/50
1/1 - 0s - loss: 0.3421 - accuracy: 0.8864 - val_loss: 0.5123 - val_accuracy: 0.8333
```

### **Interpreting the Metrics:**

**loss (training loss):**
- Measures how wrong the predictions are on training data
- **Lower is better**
- Starts high (≈4.0) because predictions are random initially
- Should decrease over epochs (≈0.3-0.5 by epoch 50)

**accuracy (training accuracy):**
- Percentage of correctly predicted Spanish words on training data
- **Higher is better** (range: 0.0 to 1.0 or 0% to 100%)
- Starts low (≈7-10%) because model is guessing
- Should increase over epochs (≈85-90% by epoch 50)

**val_loss (validation loss):**
- Measures prediction error on validation data (unseen during weight updates)
- Tests if model generalizes or just memorizes training data
- Should track training loss but may be slightly higher

**val_accuracy (validation accuracy):**
- Percentage of correct predictions on validation data
- Should track training accuracy
- If much lower than training accuracy → overfitting (memorization)

---

## **THE LEARNING PROGRESSION**

### **Early Epochs (1-15):**
```
What's happening:
- Model makes random guesses
- Loss is high (3.0-4.0)
- Accuracy is low (10-30%)
- Weights are adjusting rapidly

Example prediction at Epoch 5:
Input: "We'll see ."
Target: "Después veremos . <END>"
Prediction: "¿ Quién el <END>"  ← Completely wrong!
```

### **Middle Epochs (16-35):**
```
What's happening:
- Model starts recognizing patterns
- Loss decreasing (1.0-2.0)
- Accuracy improving (40-70%)
- Learning English→Spanish mappings

Example prediction at Epoch 25:
Input: "We'll see ."
Target: "Después veremos . <END>"
Prediction: "Después Quién . <END>"  ← Getting better! First word correct.
```

### **Final Epochs (36-50):**
```
What's happening:
- Model fine-tuning
- Loss low (0.3-0.8)
- Accuracy high (80-95%)
- Nearly memorized all 11 sentences

Example prediction at Epoch 50:
Input: "We'll see ."
Target: "Después veremos . <END>"
Prediction: "Después veremos . <END>"  ← Perfect!
```

---

## **WHAT THE MODEL LEARNED**

### **The 655,420 Weights Now Encode:**

**Encoder weights (≈314,000 parameters):**
- How to process English words into meaningful state vectors
- Patterns like: "We'll" indicates future tense
- Contextual understanding: "Who" at start signals a question

**Decoder weights (≈325,000 parameters):**
- How to use encoder states to initialize Spanish generation
- Transition patterns: after `<START>`, what comes next?
- Sequential dependencies: "Después" often followed by "veremos"

**Dense layer weights (≈15,000 parameters):**
- Mapping from 256-dimensional LSTM outputs to vocabulary probabilities
- Which LSTM patterns correspond to which Spanish words

**These weights are now FIXED** - no more learning happens. They're ready to be used for translation.

---

## **CRITICAL LIMITATION: SMALL DATASET**

### **What This Model Can and Cannot Do:**

**✅ CAN DO:**
- Translate the 11 training sentences very accurately (≈90-100% accuracy)
- Translate very similar sentences with the same vocabulary
- Example: "We'll try ." → "Lo intentaremos ." (was in training)

**❌ CANNOT DO:**
- Translate sentences with new vocabulary
- Example: "The cat sleeps." → ??? (gibberish - "cat" and "sleeps" unknown)
- Generalize to new grammar patterns
- Handle variations or creative language

**Why?**
- Only 11 sentences = model **memorized** specific translations
- Didn't learn general English→Spanish grammar rules
- Unknown words have no representation in the learned weights

### **Real-World Translation Models:**

**Google Translate / DeepL are trained on:**
- **Millions** of sentence pairs (not 11!)
- Diverse vocabulary (100,000+ words)
- All grammar structures (questions, commands, tenses, clauses, etc.)
- Multiple domains (news, literature, technical, casual)
- Subword tokenization (handles unknown words by breaking them into pieces)

**Result:** Can translate novel sentences with high accuracy

---

## **WHAT HAPPENS DURING INFERENCE (USING THE TRAINED MODEL)**

### **Important: The Model Doesn't Do Everything Automatically!**

**What the model DOES:**
- Takes one-hot encoded inputs
- Uses learned weights to produce probability distributions
- Example output: [0.01, 0.05, 0.82, 0.03, ...] (60 probabilities)

**What YOU still need to do:**
1. **One-hot encode new English sentences** (using input_features_dict)
2. **One-hot encode `<START>` token** to begin decoder
3. **Run model to get probabilities** for next word
4. **Pick highest probability word** (e.g., "Después" at 82%)
5. **Convert word to one-hot encoding** for next decoder input
6. **Repeat steps 3-5** until `<END>` is predicted
7. **Convert predicted indices back to words** (using reverse_target_features_dict)

The model learned the translation patterns, but YOU handle the formatting and generation loop!

---

### **WHAT CHANGED FROM STEP 8:**

| Aspect | Step 8 | Step 9 |
|--------|--------|--------|
| **Model state** | Architecture defined | Architecture + learned weights |
| **Weights** | Random/uninitialized | Trained (655,420 learned parameters) |
| **Can translate?** | No - weights are random | Yes - for training sentences |
| **Loss/Accuracy** | Not computed | Tracked and optimized |
| **Ready for use?** | No | Yes - but with limitations |

✅ **Training complete! You now have a working (albeit limited) English→Spanish translator!**

### **NEXT STEPS (Beyond This Lesson):**
- Build inference functions to translate new sentences
- Handle unknown words with `<UNK>` token
- Implement the generation loop (word-by-word prediction)
- Test on sentences similar to training data
- Understand limitations with novel vocabulary/grammar

# **Step 10: Building Inference Models (Separating Encoder and Decoder)**

## **WHERE WE'RE COMING FROM:**
In Step 9, we trained a combined model that processed both English and Spanish inputs simultaneously to learn translation patterns. The training model had access to complete Spanish sentences (with `<START>`) and compared its predictions against the target (shifted Spanish with `<END>`).

## **THE PROBLEM WITH THE TRAINING MODEL FOR INFERENCE:**

The training model expects:
- **Input 1:** English sentence (complete)
- **Input 2:** Spanish sentence (complete, with `<START>`)
- **Output:** Predictions for each Spanish word

**But during actual translation (inference):**
- We have: English sentence
- We DON'T have: The Spanish translation (that's what we're trying to generate!)
- We need: To generate Spanish word-by-word, feeding each prediction back

**Solution:** Separate the encoder and decoder into two independent models.

## **WHAT WE'RE DOING NOW:**

We're creating two new models for inference, reusing the **same trained weights** from training_model:

1. **Encoder Model:** 
   - Input: English sentence
   - Output: Encoder states (the "meaning" vectors)
   - Run **once** per translation

2. **Decoder Model:**
   - Input: One Spanish word + states from previous step
   - Output: Prediction for next word + updated states
   - Run **repeatedly** in a loop until `<END>` is generated

## **KEY CONCEPT: SEQUENTIAL GENERATION**

**Training (what we did in Step 9):**
```
English: "We'll see ."
Spanish: "<START> Después veremos ."
→ Model processes everything at once
→ Learns patterns
```

**Inference (what these new models enable):**
```
Step 1: Encode "We'll see ." → get states
Step 2: Decoder sees "<START>" + states → predicts "Después"
Step 3: Decoder sees "Después" + new states → predicts "veremos"
Step 4: Decoder sees "veremos" + newer states → predicts "."
Step 5: Decoder sees "." + newest states → predicts "<END>"
→ Translation complete: "Después veremos ."
```

## **IMPORTANT: SAME WEIGHTS, DIFFERENT STRUCTURE**

These new models don't have new weights or require new training:
- They use the **exact same trained weights** from training_model
- We're just reorganizing the architecture to enable step-by-step generation
- Think of it as "repackaging" the trained model for a different use case

## **WHAT'S NEXT:**
After building these inference models, we'll implement the generation loop that uses them to translate new English sentences word-by-word.

In [14]:
# ENCODER INFERENCE MODEL
# Creates a standalone model that processes English and outputs states
# Input: encoder_inputs (English sentence, one-hot encoded)
# Output: encoder_states ([state_hidden, state_cell] - the "meaning" vectors)
# This model is run ONCE per translation to encode the English sentence
encoder_model = Model(encoder_inputs, encoder_states)

# DECODER STATE INPUT LAYERS
# latent_dim = 256 (must match LSTM hidden size)
latent_dim = 256

# Create input layers to receive hidden and cell states from outside
# These will receive either:
#   - Encoder states (on first decoder iteration)
#   - Previous decoder states (on subsequent iterations)
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))

# Package both state inputs into a list for convenience
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]

# CALL THE DECODER LSTM
# Pass decoder input (one Spanish word) and states through the LSTM
# initial_state=decoder_states_inputs: states are now passed as input (not hardcoded)
# Returns:
#   - decoder_outputs: LSTM output for this timestep (256-dim)
#   - state_hidden, state_cell: updated states for next iteration
decoder_outputs, state_hidden, state_cell = decoder_lstm(
    decoder_inputs, 
    initial_state=decoder_states_inputs
)

# Package the output states for next iteration
decoder_states = [state_hidden, state_cell]

# PROCESS DECODER OUTPUTS THROUGH DENSE LAYER
# Convert 256-dimensional LSTM output to vocabulary probabilities
# Same Dense layer with same trained weights from training
decoder_outputs = decoder_dense(decoder_outputs)

# DECODER INFERENCE MODEL
# Creates a standalone model for step-by-step generation
# Inputs: 
#   - decoder_inputs: one Spanish word (one-hot encoded)
#   - decoder_states_inputs: hidden and cell states from previous step
# Outputs:
#   - decoder_outputs: probabilities for next Spanish word
#   - decoder_states: updated states to feed into next iteration
# This model is run REPEATEDLY in a loop, once per generated word
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,  # Inputs: word + states
    [decoder_outputs] + decoder_states          # Outputs: predictions + new states
)

## **WHAT WE JUST DID: Created Inference-Ready Models**

### **Summary of Changes:**
```
BEFORE (Step 9 - Training Model):
─────────────────────────────────
training_model = Model(
    inputs=[encoder_inputs, decoder_inputs],
    outputs=decoder_outputs
)
→ Single combined model
→ Needs complete English AND complete Spanish
→ Good for training, can't generate translations


AFTER (Step 10 - Inference Models):
────────────────────────────────────
encoder_model = Model(
    inputs=encoder_inputs,
    outputs=encoder_states
)
→ Standalone encoder
→ Run ONCE: English → states

decoder_model = Model(
    inputs=[decoder_inputs, decoder_states_inputs],
    outputs=[decoder_outputs, decoder_states]
)
→ Standalone decoder
→ Run REPEATEDLY: word + states → prediction + new states
→ Can generate translations word-by-word!
```

---

## **THE TWO MODELS WE CREATED**

### **1. Encoder Inference Model:**

**Purpose:** Convert English sentence to meaning vectors

**Architecture:**
```
encoder_inputs (English, one-hot)
        ↓
encoder_lstm (256 units, trained weights)
        ↓
encoder_states = [state_hidden, state_cell]
        ↓
Output: 2 × 256-dimensional vectors
```

**Usage:** Run **once** at the beginning of each translation
```python
states = encoder_model.predict(english_sentence_encoded)
# states now contain the "meaning" of the English sentence
```

---

### **2. Decoder Inference Model:**

**Purpose:** Generate Spanish translation word-by-word

**Architecture:**
```
Inputs:
├─ decoder_inputs (one Spanish word, one-hot)
└─ decoder_states_inputs (states from previous iteration)
        ↓
decoder_lstm (256 units, trained weights)
        ↓
decoder_dense (softmax, trained weights)
        ↓
Outputs:
├─ decoder_outputs (probabilities for next word)
└─ decoder_states (updated states for next iteration)
```

**Usage:** Run **repeatedly** in a loop
```python
# Iteration 1
word1 = "<START>"
probs1, states1 = decoder_model.predict([word1, encoder_states])
next_word1 = pick_highest(probs1)  # "Después"

# Iteration 2
probs2, states2 = decoder_model.predict([next_word1, states1])
next_word2 = pick_highest(probs2)  # "veremos"

# Iteration 3
probs3, states3 = decoder_model.predict([next_word2, states2])
next_word3 = pick_highest(probs3)  # "."

# ... until <END>
```

---

## **KEY ARCHITECTURAL FEATURES**

### **1. State Flow - The Critical Difference:**

**Training decoder:**
- States come from encoder once
- Flow internally through timesteps
- You don't see or control them

**Inference decoder:**
- States passed as **explicit inputs**
- You control them between iterations
- Allows manual step-by-step generation
```
Training:
encoder → states → decoder (processes all words at once)
                   ↓
                [internal state updates]
                   ↓
                outputs

Inference:
encoder → states → decoder iteration 1 → new_states
                        ↓
                   prediction_1
                        ↓
                   decoder iteration 2 (with new_states) → newer_states
                        ↓
                   prediction_2
                        ↓
                   ... (loop continues)
```

---

### **2. Input/Output Shapes:**

**Encoder Model:**
```python
Input:  encoder_inputs
        Shape: (batch_size, timesteps, num_encoder_tokens)
        Example: (1, 4, 50) for "We'll see ."

Output: encoder_states = [state_hidden, state_cell]
        Shape: [(batch_size, 256), (batch_size, 256)]
        Example: [(1, 256), (1, 256)]
```

**Decoder Model:**
```python
Inputs: [decoder_inputs, decoder_state_input_hidden, decoder_state_input_cell]
        Shapes: [(batch_size, 1, num_decoder_tokens), (batch_size, 256), (batch_size, 256)]
        Example: [(1, 1, 60), (1, 256), (1, 256)]
        Note: timesteps=1 because we process ONE word at a time

Outputs: [decoder_outputs, state_hidden, state_cell]
         Shapes: [(batch_size, 1, num_decoder_tokens), (batch_size, 256), (batch_size, 256)]
         Example: [(1, 1, 60), (1, 256), (1, 256)]
```

---

### **3. Why latent_dim = 256?**
```python
latent_dim = 256
```

**This must match:**
- Encoder LSTM hidden size (256)
- Decoder LSTM hidden size (256)
- State dimensions from encoder

**Why it matters:**
- Encoder outputs states of size 256
- Decoder inputs must accept states of size 256
- Mismatch would cause shape errors

---

## **COMPARISON: TRAINING VS INFERENCE ARCHITECTURE**

### **Data Flow Differences:**

**Training (Step 9):**
```
Inputs → [encoder + decoder combined] → Outputs
  ↓                                        ↓
English                              Predictions for
+ complete Spanish                   all Spanish words
                                     (compared to target)
```

**Inference (Step 10):**
```
STEP 1: English → [encoder] → states

STEP 2: Loop:
  states + word → [decoder] → prediction + new_states
  new_states + next_word → [decoder] → prediction + newer_states
  ... until <END>
```

---

### **Why We Need Both Architectures:**

| Need | Training Model | Inference Models |
|------|----------------|------------------|
| **Learn patterns** | ✅ Perfect | ❌ Can't train |
| **Fast batch processing** | ✅ Efficient | ❌ One-by-one is slower |
| **Generate translations** | ❌ Needs complete Spanish | ✅ Generates word-by-word |
| **Control generation process** | ❌ Black box | ✅ Full control over states |

**Solution:** Use training model to learn, then repackage into inference models for generation!

---

## **THE POWER OF REUSING WEIGHTS**

### **No Additional Training Needed:**

All three models share the **exact same weights**:
```
training_model weights (655,420 parameters)
        ↓
    [Repackaged into different structures]
        ↓
    ┌───────────────┬───────────────┐
    ↓               ↓               ↓
encoder_model   decoder_model   (training_model still exists)
  ↓                   ↓
Same encoder_lstm  Same decoder_lstm + decoder_dense
weights            weights
```

**What this means:**
- No new learning required
- Inference models instantly ready to use
- All knowledge from training is preserved

---

## **READY FOR TRANSLATION**

### **What We Can Do Now:**

✅ **Encode any English sentence** (that uses training vocabulary)
```python
states = encoder_model.predict(english_sentence)
```

✅ **Generate Spanish word-by-word**
```python
while not done:
    probs, states = decoder_model.predict([current_word, states])
    next_word = pick_best(probs)
```

✅ **Control the generation process**
- Decide when to stop
- Handle unknown words
- Implement different sampling strategies (greedy, beam search, etc.)

---

### **What We Still Need:**

❌ **Implement the generation loop** (next step!)
❌ **Handle one-hot encoding/decoding**
❌ **Convert probabilities to words**
❌ **Deal with unknown vocabulary**

---

## **WHAT CHANGED FROM STEP 9:**

| Aspect | Step 9 (Training) | Step 10 (Inference) |
|--------|-------------------|---------------------|
| **Number of models** | 1 combined model | 2 separate models |
| **Encoder** | Part of combined model | Standalone encoder_model |
| **Decoder** | Part of combined model | Standalone decoder_model |
| **State handling** | Internal (automatic) | External (manual control) |
| **Processing** | Parallel (all words at once) | Sequential (word-by-word) |
| **Can train?** | ✅ Yes | ❌ No (weights are fixed) |
| **Can generate translations?** | ❌ No (needs complete Spanish) | ✅ Yes (generates step-by-step) |
| **Weights** | Learning (updating) | Fixed (reused) |

---

## **NEXT STEP: IMPLEMENTING THE TRANSLATION FUNCTION**

Now that we have the infrastructure, we need to write the code that:

1. Takes an English sentence as input
2. One-hot encodes it
3. Runs encoder_model to get states
4. Initializes decoder with `<START>` token
5. Loops with decoder_model until `<END>`
6. Converts predictions back to Spanish words
7. Returns the translation

This is where the actual "magic" of translation happens!

✅ **Inference models built! Ready to implement the translation loop!**

In [15]:
def decode_sequence(test_input):
  encoder_states_value = encoder_model.predict(test_input)
  decoder_states_value = encoder_states_value
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  target_seq[0, 0, target_features_dict['<START>']] = 1.
  decoded_sentence = ''
  
  stop_condition = False
  while not stop_condition:
    # Run the decoder model to get possible 
    # output tokens (with probabilities) & states
    output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict([target_seq] + decoder_states_value)

    # Choose token with highest probability
    sampled_token = ""

    # Exit condition: either hit max length
    # or find stop token.
    if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
      stop_condition = True

    # Update the target sequence (of length 1).
    # slicing [0, -1, :] gives us a
    # specific token vector within the
    # 3d NumPy matrix
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    # The reverse features dictionary
    # translates back from index to Spanish
    sampled_token = reverse_target_features_dict[
    sampled_token_index]
    decoded_sentence += " " + sampled_token
    # Update states
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.
    decoder_states_value = [new_decoder_hidden_state, new_decoder_cell_state]

  return decoded_sentence

for seq_index in range(10):
  test_input = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(test_input)
  print('-')
  print('Input sentence:', input_docs[seq_index])
  print('Decoded sentence:', decoded_sentence)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 161ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 161ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
-
Input sentence: We'll see .
Decoded sentence:  veremos . <END>
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
-
Input sentence: We'll see .
Decoded sentence:  veremos . <END>
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
-
In

In [16]:
def decode_sequence(test_input):
  encoder_states_value = encoder_model.predict(test_input)
  decoder_states_value = encoder_states_value
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  target_seq[0, 0, target_features_dict['<START>']] = 1.
  decoded_sentence = ''
  
  stop_condition = False
  while not stop_condition:
    # Run the decoder model to get possible 
    # output tokens (with probabilities) & states
    output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict(
        [target_seq] + decoder_states_value
    )

    # Choose token with highest probability
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_token = reverse_target_features_dict[sampled_token_index]
    
    # Add word to result
    decoded_sentence += " " + sampled_token

    # Exit condition: either hit max length
    # or find stop token.
    if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
      stop_condition = True

    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.

    # Update states
    decoder_states_value = [new_decoder_hidden_state, new_decoder_cell_state]

  return decoded_sentence

for seq_index in range(10):
  test_input = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(test_input)
  print('-')
  print('Input sentence:', input_docs[seq_index])
  print('Decoded sentence:', decoded_sentence)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
-
Input sentence: We'll see .
Decoded sentence:  veremos .
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
-
Input sentence: We'll see .
Decoded sentence:  veremos .
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
-
Input sentence: We'll try .
Decoded sentence:  veremos .
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━