# Character-Level Language Modeling (Chapter 15 Application) ✍️

---

This notebook implements a fascinating and challenging application of Recurrent Neural Networks: **Character-Level Language Modeling**. The goal is to train an RNN (specifically an LSTM) to learn the sequential structure of text at the character level, enabling it to **generate entirely new, coherent text** one character at a time. This fully utilizes the sequence generation and probability concepts of **Chapter 15**.

### 1. Data Preparation at the Character Level

Unlike the previous notebook which processed words, this model handles text one character at a time:

* **Text Corpus:** The notebook loads a large text file (e.g., from Project Gutenberg) and defines the training corpus.
* **Character Vocabulary:** A vocabulary is built where every unique character (letters, spaces, punctuation) is mapped to a numerical index.
    * **Input:** The input to the RNN at each time step ($x_t$) is a single character's index.
    * **Output:** The target output ($y_t$) is the *next* character in the sequence.
* **Sequence Creation:** The entire corpus is broken down into fixed-length sequences (e.g., 100 characters long). This allows the model to be trained on the context of the previous 99 characters to predict the 100th.
* **`Dataset` and `DataLoader`:** Custom utilities are used to efficiently batch these character sequences for training.

### 2. Deep LSTM Architecture

The model uses the powerful LSTM variant due to the extremely long-range dependencies required to generate coherent sentences:

* **One-Hot Encoding/Embedding:** The character index is converted into a feature vector. Due to the small vocabulary size (e.g., 80 unique characters), the model may use either:
    1.  **One-Hot Encoding:** A sparse vector where only one element is 1.
    2.  **`nn.Embedding`:** A dense word embedding layer (as used in the IMDB notebook), which is generally more efficient.
* **LSTM Core:** The main recurrent layer captures the "memory" of the preceding characters, which is crucial for maintaining context, spelling, and grammar over long sequences.
* **Final Layer:** The output of the LSTM hidden state is passed through a final linear layer that has **$V$ output units** (where $V$ is the size of the character vocabulary).

### 3. Training and Generation

* **Multiclass Loss:** Since the task is to classify which of the $V$ possible characters is the *next* one, **`nn.CrossEntropyLoss`** is used.
* **Text Generation (Inference):** The final, most powerful part of the notebook is the text generation phase, which uses the trained model in a recursive loop:
    1.  **Seed Input:** The process starts with a single starting character or a small seed phrase.
    2.  **Prediction:** The model processes the input and outputs a probability distribution over the next possible character.
    3.  **Sampling:** A character is chosen from this distribution, often using **temperature-based sampling** (`alpha` parameter):
        * **Low Temperature:** Leads to more predictable, safe text.
        * **High Temperature:** Introduces more randomness and creativity.
    4.  **Recurrence:** The newly sampled character is fed back into the model as the input for the next time step, and the process repeats, generating a continuous sequence of new text. 

This notebook demonstrates the true predictive power of RNNs by transforming a passive classification model into an active, generative model capable of creating novel content.

In [40]:
import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

In [41]:
with open('./Mysterious_Island.txt', 'r', encoding= 'utf8') as fp:
    text = fp.read()
    
start_idx = text.find('THE MYSTERIOUS ISLAND')
end_idx = text.find('End of the Project Gutenberg')

text = text[start_idx: end_idx]

char_set = set(text)
print(f'Total Length: {len(text)}')
print(f'Unique Character: {len(char_set)}')

Total Length: 1112350
Unique Character: 80


In [42]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
int2char = np.array(chars_sorted)

In [43]:
text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype= np.int32
)
print(f'Encoded text shape: {text_encoded.shape}')

Encoded text shape: (1112350,)


In [44]:
print(f'{text[:15]} ==> Encoding ==> {text_encoded[:15]}')

THE MYSTERIOUS  ==> Encoding ==> [44 32 29  1 37 48 43 44 29 42 33 39 45 43  1]


In [45]:
print(f'{text_encoded[15:21]} ==> Reverse ==> {"".join(int2char[text_encoded[15:21]])}')

[33 43 36 25 38 28] ==> Reverse ==> ISLAND


In [46]:
for ex in text_encoded[:15]:
    print('{} --> {}'.format(ex, int2char[ex]))

44 --> T
32 --> H
29 --> E
1 -->  
37 --> M
48 --> Y
43 --> S
44 --> T
29 --> E
42 --> R
33 --> I
39 --> O
45 --> U
43 --> S
1 -->  


In [47]:
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i: i+chunk_size] for i in range(len(text_encoded) - chunk_size)]

In [48]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks
        
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

In [49]:
seq_dataset = TextDataset(torch.tensor(text_chunks, dtype= torch.long))

In [50]:
for i, (seq, target) in enumerate(seq_dataset):
    print(f'Input: {repr("".join(int2char[seq]))}')
    print(f'Target: {repr("".join(int2char[target]))}\n')
    if i == 1:
        break

Input: 'THE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced b'
Target: 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'

Input: 'HE MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by'
Target: 'E MYSTERIOUS ISLAND ***\n\n\n\n\nProduced by '



In [51]:
batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size, shuffle= True, drop_last= True)

In [52]:
torch.manual_seed(1)
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_size, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_size, rnn_hidden_size, batch_first= True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)
        
    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell

In [53]:
torch.manual_seed(1)
vocab_size = len(int2char)
embed_size = 256
rnn_hidden_size = 512
model = RNN(vocab_size, embed_size, rnn_hidden_size)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

In [54]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr= 0.001)

In [18]:
num_epochs = 10000
for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item() / seq_length
    
    if (epoch % 500) == 0:
        print(f'Epoch {epoch} Loss: {loss:.4f}')

Epoch 0 Loss: 4.3729
Epoch 500 Loss: 1.4705
Epoch 1000 Loss: 1.3571
Epoch 1500 Loss: 1.2363
Epoch 2000 Loss: 1.2621
Epoch 2500 Loss: 1.2169
Epoch 3000 Loss: 1.1761
Epoch 3500 Loss: 1.1390
Epoch 4000 Loss: 1.0997
Epoch 4500 Loss: 1.1330
Epoch 5000 Loss: 1.1682
Epoch 5500 Loss: 1.0609
Epoch 6000 Loss: 1.0931
Epoch 6500 Loss: 1.1096
Epoch 7000 Loss: 1.0830
Epoch 7500 Loss: 1.0807
Epoch 8000 Loss: 1.0697
Epoch 8500 Loss: 0.9838
Epoch 9000 Loss: 1.0390
Epoch 9500 Loss: 1.0443


In [55]:
path = './character_level_gutenburg_model.pt'
torch.save(model.state_dict(), path)

In [56]:
path = './character_level_gutenburg_model.pt'
model = RNN(vocab_size, embed_size, rnn_hidden_size)
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [57]:
from torch.distributions.categorical import Categorical
torch.manual_seed(1)
logits = torch.tensor([[1., 1., 1.]])
print(f'Probaibilities: {nn.functional.softmax(logits, dim= 1).numpy()[0]}')

Probaibilities: [0.33333334 0.33333334 0.33333334]


In [58]:
m = Categorical(logits= logits)
samples = m.sample((10, ))
print(samples.numpy())

[[0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [2]
 [1]
 [1]]


In [59]:
logits = torch.tensor([[1., 1., 3.]])
print(f'Probabilities: {nn.functional.softmax(logits, dim= 1).numpy()[0]}')

Probabilities: [0.10650698 0.10650698 0.78698605]


In [60]:
n = Categorical(logits= logits)
samples = n.sample((10, ))
print(samples.numpy())

[[2]
 [2]
 [2]
 [2]
 [0]
 [1]
 [2]
 [1]
 [2]
 [2]]


In [61]:
def sample(model, starting_str, len_generated_str= 500, scale_factor= 0.1):
    encoded_input = torch.tensor([
        char2int[s] for s in starting_str
    ])
    encoded_input = encoded_input.reshape(1, -1)
    generated_str = starting_str
    
    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str) - 1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)
        
    last_char = encoded_input[:, -1]
    for i in range(len_generated_str):
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = logits.squeeze(0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits= scaled_logits)
        last_char = m.sample()
        generated_str += str(int2char[last_char])
        
    return generated_str

In [64]:
torch.manual_seed(1)
generated_text = sample(model, starting_str='The island')

In [65]:
print(generated_text)

The islandIsV7GphHHrwt:GQ;Q1z=Bfy,&y38hZgOTuy/S‘&m ”*-C=M=HG
,7gWd&)U,!v,NxG.A520fGGU 0?/*j,0M”oWW’:6NH1.”e:=1AvoR)=!D“l
hw7“EjHA
tyog
U(3u5C;yp
&*QZ&DL4nd/?dF6PONxGQ8“7 M?N;UkxDq”FIHE8bw-EIvG-”m-H)=.4&‘-b?Uvv‘0vr4”wx!pzh4 gWQGB8z *0“cK!BsYd.xCb tH4&2”G-H=“iDL;b=Eo
5i?zu‘5T0o2TVel‘G1
2jcuDTBc“bKRvs!:4DMgrW- nmBP9Qzm!36gt&2-/jY:aed,2I(dwQ!th8YG?bo(NbxGfg??g
W6pW4q0Cw4G-I!z;”m‘’8ZGFU(v04“gZu’9 “!h !x49-GAO“I0 v!?biDV hGhzVWnur:8
;:35sxxkl*:?n’:IH)V:DgOCxhGv?h;1s5zf(9kb3lNb-q 5/rk5:dT7o5LA*j”TsnQYjMYmE(6*VyJ


In [67]:
logits = torch.tensor([[1., 1., 3.]])
print(f'Probabilities: {nn.functional.softmax(logits, dim= 1).numpy()[0]}')
print(f'Probabilties with temperature 0.5: {nn.functional.softmax(0.5*logits, dim= 1).numpy()[0]}')
print(f'Probabilities with temperature 0.1: {nn.functional.softmax(0.1*logits, dim= 1).numpy()[0]}')

Probabilities: [0.10650698 0.10650698 0.78698605]
Probabilties with temperature 0.5: [0.21194156 0.21194156 0.57611686]
Probabilities with temperature 0.1: [0.3104238  0.3104238  0.37915248]


In [69]:
# alpha = 2.0 more predictable
torch.manual_seed(1)
generated_text_with_low_temp = sample(model, starting_str= 'The island', scale_factor= 2.0)
print(generated_text_with_low_temp)

The island8wW3NvUN‘jKV2aHOv”hg0RgN“:xaAyvo,B7u9‘xk;l4kb:JB!AD=N&)ew15:xLm’a=3AobIvmuC‘/“f0?/:PnVFV.“sH7j/5d(JrW9*1u:fPa“’MnM’96Srw-Ab0wG U?p.*E0PVTukZ’!t‘8b;/xyQm54pcH M9Kkmqa3R9SDc9;=6Yre1! 1/s.=m!c3xK8eIWm“i47mL7x4Yt;1hcJATn*PKL! uWm“aAIH6”vu0qt7r)O
&4LQo3c3‘n“:2s9q,Siv-wH:.j a20Qem(p?/;,4DL,t/?;z6LI5dM,yLeP‘/A=JtHf:2rdoil111yKpM/*H F0tEjtofBK“x-vYdQF”R:fe4q’s.&F!2QMWN w/6VDmKRSdR/0(Nvvb,D2l(NwB/gdLBbK)A=’Yzh YQxzq)/N2Dr/Q-FroJGJ5*; s‘DxC!)A; ujhl/‘rp5cFlG
V9Pj8:zGtg??49g*s9=qP”kS&(bNU4?‘9cj5D eeF-ypx*i


In [None]:
# alpha = 0.5 more randomness
torch.manual_seed(1)
generated_text_with_high_temp = sample(model, starting_str= 'The island', scale_factor= 0.5)
print(generated_text_with_high_temp)