# RNN Autoencoder for Poetry: Theory Meets Practice

**Educational Implementation with Mathematical Foundation**

This notebook builds an RNN autoencoder for dimensionality reduction on poetry text, connecting deep theoretical insights with hands-on implementation. We follow the mathematical framework established in our theoretical exposition.

## Theoretical Foundation Recap

From our comprehensive analysis, we established that:

1. **Dimensionality Reduction is Essential**: RNNs are practically unusable without reducing the effective dimension $d_{\text{eff}} \ll d$ where $d=300$ (GLoVe dimension)

2. **Sample Complexity Improvement**: Joint input-output reduction improves complexity from $\mathcal{O}(\epsilon^{-600})$ to $\mathcal{O}(\epsilon^{-35})$ - exponential improvement

3. **Autoencoder Optimality**: The encoder-bottleneck-decoder architecture is theoretically optimal for learning compressed representations

4. **Poetry-Specific Challenges**: 
   - Sequence length $T=50$ requires careful gradient flow management
   - Vocabulary size $V=1962$ creates high-dimensional discrete space
   - Semantic structure in poetry may have lower intrinsic dimension

## Architecture Overview

```
Input: [batch_size, seq_len, 300]  # GLoVe embeddings
   ↓
Encoder RNN: [batch_size, seq_len, hidden_dim] → [batch_size, bottleneck_dim]
   ↓  
Bottleneck: [batch_size, bottleneck_dim]  # Compressed representation (10-20D)
   ↓
Decoder RNN: [batch_size, bottleneck_dim] → [batch_size, seq_len, 300]
   ↓
Output: [batch_size, seq_len, 300]  # Reconstructed embeddings
```

**Key Design Decisions**:
- **Bottleneck dimension**: 10-20D based on effective dimension analysis
- **Hidden dimensions**: Start conservative (~64) to understand gradient flow
- **Loss function**: MSE in embedding space (continuous, differentiable)
- **Architecture**: Vanilla RNN first (educational), then LSTM if needed

## Data Loading and Analysis

Let's load our preprocessed poetry data and understand its structure.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import json
from torch.utils.data import DataLoader, TensorDataset
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(314)
torch.manual_seed(314)

print("Libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
# Load preprocessed data
data_path = "../GLoVe preprocessing/preprocessed_data/"

# Load metadata first to understand data structure
with open(f"{data_path}metadata_latest.json", 'r') as f:
    metadata = json.load(f)
    
print("Data Structure:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

# Load arrays
embedding_sequences = np.load(f"{data_path}embedding_sequences_latest.npy")
token_sequences = np.load(f"{data_path}token_sequences_latest.npy")
attention_masks = np.load(f"{data_path}attention_masks_latest.npy")
embedding_matrix = np.load(f"{data_path}embedding_matrix_latest.npy")

print(f"\nActual shapes:")
print(f"  Embedding sequences: {embedding_sequences.shape}  # (poems, seq_len, embedding_dim)")
print(f"  Token sequences: {token_sequences.shape}        # (poems, seq_len)")
print(f"  Attention masks: {attention_masks.shape}        # (poems, seq_len)")
print(f"  Embedding matrix: {embedding_matrix.shape}      # (vocab_size, embedding_dim)")

**Data Interpretation**:
- We have **128 poems** (some from 264 collection were filtered during preprocessing)
- Each poem is **50 tokens** (padded/truncated)
- **1962 vocabulary size** from our poetry corpus
- **300D GLoVe embeddings** per token

This gives us input tensors of shape `[128, 50, 300]` - exactly what our autoencoder expects.

In [None]:
# Let's first examine our data more closely and understand effective dimensionality
print("=== Data Analysis for Architecture Design ===")

# Convert to PyTorch tensors
X = torch.FloatTensor(embedding_sequences)  # [128, 50, 300]
attention_mask = torch.BoolTensor(attention_masks)  # [128, 50]

print(f"Input tensor shape: {X.shape}")
print(f"Attention mask shape: {attention_mask.shape}")

# Analyze effective sequence lengths (before padding)
real_lengths = attention_mask.sum(dim=1)  # Sum of True values per sequence
print(f"\nSequence length statistics:")
print(f"  Mean length: {real_lengths.float().mean():.1f}")
print(f"  Min length: {real_lengths.min()}")  
print(f"  Max length: {real_lengths.max()}")
print(f"  Std length: {real_lengths.float().std():.1f}")

# Quick PCA to estimate effective dimensionality of embeddings
# Flatten to [128*50, 300] for PCA, but only use non-padded tokens
valid_embeddings = X[attention_mask]  # Get only non-padded embeddings
print(f"\nValid embeddings for PCA: {valid_embeddings.shape}")

# Run PCA to understand intrinsic dimensionality
pca = PCA(n_components=50)  # Look at first 50 components
valid_embeddings_np = valid_embeddings.detach().numpy()
pca_result = pca.fit_transform(valid_embeddings_np)

# Plot explained variance
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
cumvar = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumvar[:30], 'b-', linewidth=2)
plt.axhline(y=0.90, color='r', linestyle='--', alpha=0.7, label='90% variance')
plt.axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, label='95% variance')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA: Cumulative Explained Variance')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(pca.explained_variance_ratio_[:20], 'g-o', markersize=4)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA: Individual Component Variance')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find effective dimensions
dim_90 = np.where(cumvar >= 0.90)[0][0] + 1
dim_95 = np.where(cumvar >= 0.95)[0][0] + 1
print(f"\nEffective Dimensionality Analysis:")
print(f"  Dimensions for 90% variance: {dim_90}")
print(f"  Dimensions for 95% variance: {dim_95}")
print(f"  This suggests bottleneck_dim ∈ [{dim_90-5}, {dim_95+5}] might work well")