# <center> **Novel Creation using PyTorch in a Decoder only Fashion**</center>

- The series **Percy Jackson and the Olympians** by **Rick Riordan** was used for Model Training (I absolutely love the series)

- **Andrej Karpathy** sir's and **Josh Starmer** sir's material on Youtube has been used as reference material

- This project is strictly for a learning experience

- Please note that no copyrights were meant to be broken and in case of any unintended violations, please contact me and I will take down this project

- The A100 GPU was used for this model's training, thus GOogle Colab Pro was used


# Sections

1. Importing Libraries
2. Setting Hyperparameters
3. Loading Text Files (PDFs) and cleaning
4. Encoding Text
5. Obtaining Batches
6. Implementing a HEAD of Self-Attention
7. Implementing Multi-head Attention
8. FeedForward and Layer Normalisation for Residual Connection in a Block
9. Pre-Block Creation
10. Post-Block Creation
11. Putting it all together (Transformer)
12. Instantiation and Model Training
13. Saving, Calculating Size and Loading Models Weights


# 1. Importing Libraries

The framework used was PyTorch thus relevant modules need to be imported

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install PyPDF2
!pip install torchsummary

import torch
import torch.nn as nn
from torch.nn import functional as F
from tqdm import tqdm
from PyPDF2 import PdfReader
import os
from google.colab import files

Collecting PyPDF2

  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m

[?25hInstalling collected packages: PyPDF2

Successfully installed PyPDF2-3.0.1



# 2. Setting Hyperparamters

- **Batch Size**: Helps in parallelism by utilising the the multiple cores of the GPU simultaneously for independent processing
- **Block Size**: Context window to pick training samples from
- **max_iters**: The maximum epochs used for training
- **eval_interval**: The interval after which loss is to be estimated during training
- **learning_rate**: The magnitude by which we want to update our model weights
- **device**: Allows for the usage of GPU, if available
- **eval_iters**: Used to estimate loss, determines the number of batches of data to select (X), predictions to make (Y') and then evaluate with actual values (Y).The loss is calculated based on this Y' and Y
- **n_embd**: The size of the embedding, converts the OHE representation of the character into a vec of n_embd dimensions
- **n_head**: Number of heads of self-attention (for multi-head attention)
- **n_layer**: Number of Blocks in the transformer blocks in the GPT model
- **dropout**: Values between 0 and 1 represent the probability of keeping a neuron's output during training

In [3]:
batch_size = 128
block_size = 512
max_iters = 25000
eval_interval = 1000
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 250
n_embd = 512
n_head = 16
n_layer = 4
dropout = 0.2

# 3. Loading Text FIles (PDFs)

All 5 books of the series "Percy Jackson and the Olympians" were used

A list of strings is used to store the text

In [4]:
output_texts = []
# path = "/content/drive/MyDrive/Books"
path = "/kaggle/input/percy-jackson"

In [5]:
counter =0
for pdf_file in os.listdir(path):

    if pdf_file.endswith(".pdf"):
        counter+=1
        print("BOOK",counter,"\n\n\n")
        pdf_path = os.path.join(path, pdf_file)

        reader = PdfReader(pdf_path)
        text = ""

        for i, page in enumerate(tqdm(reader.pages, desc=f"Processing {pdf_file}", unit="page")):
            text += page.extract_text() + "\n"

            # Check if the current page is a multiple of 7
            if (i + 1) % 35 == 0:
                print(f"Scanned completely up to page {i + 1} of {pdf_file}")

        # Append the final text to the list
        output_texts.append(text)
        print("\n\n\n\n\n\n\n")


BOOK 1 








Processing PJ_b1.pdf:  11%|█         | 40/370 [00:01<00:13, 24.73page/s]

Scanned completely up to page 35 of PJ_b1.pdf


Processing PJ_b1.pdf:  20%|██        | 74/370 [00:03<00:14, 21.06page/s]

Scanned completely up to page 70 of PJ_b1.pdf


Processing PJ_b1.pdf:  29%|██▊       | 106/370 [00:04<00:10, 25.27page/s]

Scanned completely up to page 105 of PJ_b1.pdf


Processing PJ_b1.pdf:  39%|███▉      | 145/370 [00:06<00:10, 21.04page/s]

Scanned completely up to page 140 of PJ_b1.pdf


Processing PJ_b1.pdf:  48%|████▊     | 179/370 [00:07<00:07, 26.39page/s]

Scanned completely up to page 175 of PJ_b1.pdf


Processing PJ_b1.pdf:  57%|█████▋    | 212/370 [00:09<00:07, 21.77page/s]

Scanned completely up to page 210 of PJ_b1.pdf


Processing PJ_b1.pdf:  67%|██████▋   | 249/370 [00:11<00:04, 25.54page/s]

Scanned completely up to page 245 of PJ_b1.pdf


Processing PJ_b1.pdf:  76%|███████▋  | 283/370 [00:12<00:03, 22.89page/s]

Scanned completely up to page 280 of PJ_b1.pdf


Processing PJ_b1.pdf:  85%|████████▌ | 316/370 [00:14<00:02, 20.27page/s]

Scanned completely up to page 315 of PJ_b1.pdf


Processing PJ_b1.pdf:  96%|█████████▌| 354/370 [00:15<00:00, 25.77page/s]

Scanned completely up to page 350 of PJ_b1.pdf


Processing PJ_b1.pdf: 100%|██████████| 370/370 [00:16<00:00, 22.50page/s]


















BOOK 2 








Processing PJ_b2.pdf:  19%|█▉        | 36/192 [00:01<00:09, 16.58page/s]

Scanned completely up to page 35 of PJ_b2.pdf


Processing PJ_b2.pdf:  37%|███▋      | 71/192 [00:04<00:08, 13.58page/s]

Scanned completely up to page 70 of PJ_b2.pdf


Processing PJ_b2.pdf:  56%|█████▌    | 107/192 [00:06<00:04, 17.48page/s]

Scanned completely up to page 105 of PJ_b2.pdf


Processing PJ_b2.pdf:  73%|███████▎  | 141/192 [00:08<00:03, 16.65page/s]

Scanned completely up to page 140 of PJ_b2.pdf


Processing PJ_b2.pdf:  92%|█████████▏| 177/192 [00:10<00:00, 16.73page/s]

Scanned completely up to page 175 of PJ_b2.pdf


Processing PJ_b2.pdf: 100%|██████████| 192/192 [00:11<00:00, 16.70page/s]


















BOOK 3 








Processing PJ_b3.pdf:  29%|██▊       | 59/207 [00:00<00:00, 191.62page/s]

Scanned completely up to page 35 of PJ_b3.pdf

Scanned completely up to page 70 of PJ_b3.pdf


Processing PJ_b3.pdf:  67%|██████▋   | 139/207 [00:00<00:00, 196.84page/s]

Scanned completely up to page 105 of PJ_b3.pdf

Scanned completely up to page 140 of PJ_b3.pdf


Processing PJ_b3.pdf: 100%|██████████| 207/207 [00:01<00:00, 193.39page/s]

Scanned completely up to page 175 of PJ_b3.pdf

















BOOK 4 










Processing PJ_b4.pdf:  16%|█▌        | 38/234 [00:01<00:06, 29.23page/s]

Scanned completely up to page 35 of PJ_b4.pdf


Processing PJ_b4.pdf:  32%|███▏      | 76/234 [00:02<00:05, 28.52page/s]

Scanned completely up to page 70 of PJ_b4.pdf


Processing PJ_b4.pdf:  46%|████▌     | 107/234 [00:03<00:04, 31.59page/s]

Scanned completely up to page 105 of PJ_b4.pdf


Processing PJ_b4.pdf:  62%|██████▏   | 144/234 [00:04<00:03, 29.88page/s]

Scanned completely up to page 140 of PJ_b4.pdf


Processing PJ_b4.pdf:  76%|███████▋  | 179/234 [00:06<00:01, 28.84page/s]

Scanned completely up to page 175 of PJ_b4.pdf


Processing PJ_b4.pdf:  91%|█████████▏| 214/234 [00:07<00:00, 30.02page/s]

Scanned completely up to page 210 of PJ_b4.pdf


Processing PJ_b4.pdf: 100%|██████████| 234/234 [00:08<00:00, 29.19page/s]


















BOOK 5 








Processing PJ_b5.pdf:  24%|██▎       | 47/200 [00:00<00:01, 90.60page/s]

Scanned completely up to page 35 of PJ_b5.pdf


Processing PJ_b5.pdf:  44%|████▎     | 87/200 [00:00<00:01, 89.86page/s]

Scanned completely up to page 70 of PJ_b5.pdf


Processing PJ_b5.pdf:  58%|█████▊    | 117/200 [00:01<00:00, 93.17page/s]

Scanned completely up to page 105 of PJ_b5.pdf


Processing PJ_b5.pdf:  80%|███████▉  | 159/200 [00:01<00:00, 99.46page/s]

Scanned completely up to page 140 of PJ_b5.pdf


Processing PJ_b5.pdf:  90%|████████▉ | 179/200 [00:02<00:00, 77.66page/s]

Scanned completely up to page 175 of PJ_b5.pdf


Processing PJ_b5.pdf: 100%|██████████| 200/200 [00:02<00:00, 90.02page/s]





















> Finding **starting points** for each book as the introduction section is not needed (doesn't provide any useful information)

> The actual useful sections start from Chapter 1

In [6]:
starting_index= [pdf.find("ONE") for pdf in output_texts]
print(starting_index)

[2922, 54, 94, 103, 85]


In [7]:
for i,starting_point in enumerate(starting_index):
    print(output_texts[i][starting_point:starting_point+100])
    print("\n\n\n")

ONE

I ACCIDENTALLY VAPORIZE MY PRE-ALGEBRA

TEACHER

L

ook, I didn’t want to be a half-blood.

If you’r









ONE

	

MY	BEST	FRIEND	SHOPS

FOR	A	WEDDING	DRESS

My	nightmare	started	like	this.

I	was	standing	on	a	d









ONE  

MY RESCUE OPERATION GOES VERY WRONG  

  

The Friday before winter break, my mom packed me an o









ONE  

 

I BATTLE THE 

CHEERLEADING SQUAD 

 

The last thing I wanted to do on my summer break wa s bl









ONE 

 

I  GO  CRUISING  WITH 

EXPLOSIVES 

 

The end of the world started when a pegasus landed on th










> We see that there are instances of additional spaces and tabs and thus this can lead to some discrepancies in the final output

> The **endpoints** also need to be identified as there are other additional things found at the end of novels, like previews to other books or acknowledgments etc

In [8]:
print(output_texts[0][-10:-1])

com.












In [9]:
print(output_texts[1][-75:-1])

s.”

Table	of	Contents

Percy	Jackson	2

The	Sea	Monsters

by

Rick	Riordan

ONE


In [10]:
print(output_texts[2][-3:])



 




In [11]:
print(output_texts[3][-6:])

ut.” 




In [12]:
ending_index = [0,0,0,0,0]

ending_index[0] = -6
ending_index[1] = -71
ending_index[2] = -2
ending_index[3] = -5
ending_index[4] = output_texts[-1].find("ACKNOWLEDGMENTS")


> We now know the starting and ending points for all the books, and thus we **clip the list elements**

In [13]:
output_texts = [output_texts[i][starting_index[i]:ending_index[i]] for i in range(len(output_texts))]

combined = "\n\n\n".join(output_texts)

Checking if there any characters present only in the last book which aren't present in the others

In [14]:
firstfour = "\n".join(output_texts[:-1])
last = output_texts[-1]
print(set(last)-set(firstfour))

{'τ', 'ε', 'Π', 'φ', 'ς', 'ι', 'Δ', 'ο', 'ή', 'η', 'κ', 'ρ', '~', 'ί', 'σ', 'Ω', 'υ', 'ύ'}


In [15]:
for i,text in enumerate(output_texts):
    print("\n\n\nBook",i+1,":")
    print("\n\nStart:\n")
    print(text[:25])
    print("\n\nEnd:\n")
    print(text[-25:])







Book 1 :





Start:



ONE

I ACCIDENTALLY VAPORI





End:



 at 

www.rickriordan.com.







Book 2 :





Start:



ONE

	

MY	BEST	FRIEND	SHOP





End:



aid.	“Daughter	of	Zeus.”









Book 3 :





Start:



ONE  

MY RESCUE OPERATION





End:



id, ' I await you...'"  









Book 4 :





Start:



ONE  

 

I BATTLE THE 

CHE





End:



e 

got a lot to talk abou







Book 5 :





Start:



ONE 

 

I  GO  CRUISING  W





End:



 I didn't look back. 

 

 


# 4. Encoding Text

This is a character level text generation system, thus Label Encoding is sufficient, and Embedding is the next step

The first step would be to identify all the unique characters and then build functions to encode and decode the text

In [16]:
chars = sorted(list(set(combined)))
vocab_size = len(chars)
print("Total number of unique characters:",vocab_size)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

Total number of unique characters: 117


In [17]:
print("This is a list of all the tokens and the characters they represent")
for i,token in enumerate(chars):
    print(i,":",token,"<>")

This is a list of all the tokens and the characters they represent

0 : 	 <>

1 : 

 <>

2 :   <>

3 : ! <>

4 : " <>

5 : # <>

6 : $ <>

7 : & <>

8 : ' <>

9 : ( <>

10 : ) <>

11 : * <>

12 : + <>

13 : , <>

14 : - <>

15 : . <>

16 : / <>

17 : 0 <>

18 : 1 <>

19 : 2 <>

20 : 3 <>

21 : 4 <>

22 : 5 <>

23 : 6 <>

24 : 7 <>

25 : 8 <>

26 : 9 <>

27 : : <>

28 : ; <>

29 : ? <>

30 : A <>

31 : B <>

32 : C <>

33 : D <>

34 : E <>

35 : F <>

36 : G <>

37 : H <>

38 : I <>

39 : J <>

40 : K <>

41 : L <>

42 : M <>

43 : N <>

44 : O <>

45 : P <>

46 : Q <>

47 : R <>

48 : S <>

49 : T <>

50 : U <>

51 : V <>

52 : W <>

53 : X <>

54 : Y <>

55 : Z <>

56 : a <>

57 : b <>

58 : c <>

59 : d <>

60 : e <>

61 : f <>

62 : g <>

63 : h <>

64 : i <>

65 : j <>

66 : k <>

67 : l <>

68 : m <>

69 : n <>

70 : o <>

71 : p <>

72 : q <>

73 : r <>

74 : s <>

75 : t <>

76 : u <>

77 : v <>

78 : w <>

79 : x <>

80 : y <>

81 : z <>

82 : ~ <>

83 : ° <>

84 : Ô <>

85 : á

In [18]:
encoded_texts = [torch.tensor(encode(text), dtype=torch.long) for text in output_texts]

for text in encoded_texts:
    print(text[:50])
    print(type(text))
    print(text.shape)
    print("\n\n")

tensor([44, 43, 34,  1, 38,  2, 30, 32, 32, 38, 33, 34, 43, 49, 30, 41, 41, 54,

         2, 51, 30, 45, 44, 47, 38, 55, 34,  2, 42, 54,  2, 45, 47, 34, 14, 30,

        41, 36, 34, 31, 47, 30,  1, 49, 34, 30, 32, 37, 34, 47])

<class 'torch.Tensor'>

torch.Size([502990])







tensor([44, 43, 34,  1,  0,  1, 42, 54,  0, 31, 34, 48, 49,  0, 35, 47, 38, 34,

        43, 33,  0, 48, 37, 44, 45, 48,  1, 35, 44, 47,  0, 30,  0, 52, 34, 33,

        33, 38, 43, 36,  0, 33, 47, 34, 48, 48,  1, 42, 80,  0])

<class 'torch.Tensor'>

torch.Size([359216])







tensor([44, 43, 34,  2,  2,  1, 42, 54,  2, 47, 34, 48, 32, 50, 34,  2, 44, 45,

        34, 47, 30, 49, 38, 44, 43,  2, 36, 44, 34, 48,  2, 51, 34, 47, 54,  2,

        52, 47, 44, 43, 36,  2,  2,  1,  2,  2,  1, 49, 63, 60])

<class 'torch.Tensor'>

torch.Size([410240])







tensor([44, 43, 34,  2,  2,  1,  2,  1, 38,  2, 31, 30, 49, 49, 41, 34,  2, 49,

        37, 34,  2,  1, 32, 37, 34, 34, 47, 41, 34, 30, 33, 38, 43, 36,  2, 48,

We now have a list of tensors, we cannot stack them together as they are of diferent lengths and padding can lead to complications in obtaining batches

# 5. Obtaining batches

- There is a need to **evenly obtain the samples from all the books**, and thus determine the samples to be taken per tensor
- ix provides a random list of starting indices per tensor, ensuring that the entire block_size of context can be obtained (no out of bounds error)
- A tensor of shape **(samples_per_tensor, block_size)** is appended to x_tensors and y_tensors per book
- x and y are the concaternated form of x_tensors and y-tensors, and their shape is **(batch_size,block_size)**

In [19]:
print(len(encoded_texts[-1].view(1,-1)))

1


Final book is used for testing, all others are used for training

In [20]:
def get_batch(func):
    sample =""
    if func == 'train':
        sample = encoded_texts[:-1]
    elif func == 'val':
        sample = encoded_texts[-1].view(1,-1)
    samples_per_tensor = batch_size //len(sample)
    x_tensors, y_tensors = [], []
    for tensor in sample:
        ix = torch.randint(len(tensor) - block_size, (samples_per_tensor,))
        #print("idx:",len(ix))
        x_tensors.append(torch.stack([tensor[i:i+block_size] for i in ix]))
        y_tensors.append(torch.stack([tensor[i+1:i+block_size+1] for i in ix]))
        #print("x:",len(x_tensors),x_tensors[-1].shape,"\n\n")

    x = torch.cat(x_tensors, dim=0)
    y = torch.cat(y_tensors, dim=0)

    x, y = x.to(device), y.to(device)
    return x, y


# 6. Implementing a HEAD of Self-Attention

<div style="text-align:center">
    <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*piCQbDMPO1-Kw5ZiNAl-FA.png" alt="Image" style="width:500px;height:300px;"/>
    <br>
    <a href="https://medium.com/@saba99/self-attention-0b21baad0a48" style="font-size: smaller;" > <b>Source</b> </a>

</div>

- The transformer model utilises parallelism and calculates 3 metrics:
    - *Query*: What the current point is looking for
    - *Key*: What the current point can offer
    - *Value*: The value of the current point<br>
<br>
- For all tokens to be generated (max_new_tokens) the QKVs are calculated and stored in a 3 individual tensors
- Now the **affinity** of current token's query needs to be calculated with all other output_tokens' keys, and this is calculated simply using the **dot product** (done via matrix multiplication with transpose)
- Since this is a decoder only transformer, the self attention model will consider key and values from all **previous tokens** to generate current output, and not the tokens to be generated in the future
- A **lower traingular matrix** is used discard all future tokens
- The dot product reveal the weights given to the values of all tokens, This needs to be **scaled down** to prevent extremities that might occur when doing softmax
    - eg - softmaxing 5 and 50 vs softmaxing 0.5 and 5
- the scaled weights need to **softmaxed** to be scaled between 0 and 1
- The **attention scores** have thus been obtained
- Thus the ouput will be the **weighted average** of all previous tokens' values in the block_size
- This is done for all tokens


In [21]:
class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()

        self.key = nn.Sequential(
            nn.Linear(n_embd, head_size // 2),
            nn.Tanh(),
            nn.Linear(head_size // 2, head_size, bias=False)
        )
        self.query = nn.Sequential(
            nn.Linear(n_embd, head_size // 2),
            nn.Tanh(),
            nn.Linear(head_size // 2, head_size, bias=False)
        )
        self.value = nn.Sequential(
            nn.Linear(n_embd, head_size // 2),
            nn.Tanh(),
            nn.Linear(head_size // 2, head_size, bias=False)
        )

        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        Ba,Bl,C = x.shape # x has shape (batch_size, block_size, n_embd)
        k = self.key(x)   # k has shape (batch_size, block_size, head_size)
        q = self.query(x) # q has shape (batch_size, block_size, head_size)
        aff = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (Ba, Bl, hs) @ (Ba, hs, Bl) -> (Ba, Bl, Bl)
        aff = aff.masked_fill(self.tril[:Bl, :Bl] == 0, float('-inf')) # (Ba, Bl, Bl)
        aff = F.softmax(aff, dim=-1) # (Ba, Bl, Bl)
        aff = self.dropout(aff)
        # perform the weighted aggregation of the values
        v = self.value(x) # (Ba,Bl,head_size)
        att_out = aff @ v # (Ba, Bl, Bl) @ (Ba, Bl, head_size) -> (Ba, Bl, head_size)
        return att_out


# 7. Implementing Multi-head Attention

<div style="text-align:center">
    <img src="https://production-media.paperswithcode.com/methods/multi-head-attention_l1A3G7a.png" alt="Image" style="width:300px;height:400px;"/>
    <br>
    <a href="https://paperswithcode.com/method/multi-head-attention" style="font-size: smaller;" > <b>Source</b> </a>

</div>

- To further improve the accuracy, multiple heads of self attention can be employed in **parallel**
- The paper "Attention is all you need" used 8 heads of attention
- The outputs of all the heads are concatenated together
<br><br>
- The self.heads retrieves num_heads individual Head instances, which are processed in PARALLEL, not Sequentitally
- In the forward pass, each of these heads produces an output of shape **(batch_size,block_size,head_size)**
- These are concatenated in the -1 dimension, so the output is **(batch_size,block_size,head_size * num_heads)**
- This is where the self.proj is used to transform the output to **(batch_size,block_size,n_embd)**

In [22]:
class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for cur_head in range(num_heads)])
        self.proj = nn.Sequential(
            nn.Linear(head_size * num_heads, n_embd//2),
            nn.Tanh(),
            nn.Linear(n_embd//2, n_embd)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

# 8. FeedForward and Layer Normalisation for Residual Connection in a Block

<div style="text-align:center">
    <img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/Transformer-neural-network-17.png" alt="Image" style="width:300px;height:400px;"/>
    <br>
    <a href="https://builtin.com/artificial-intelligence/transformer-neural-network" style="font-size: smaller;" > <b>Source</b> </a>

</div>

- The x value (post position embedding) is **saved** and used as input to calculate the **self-attention (weighted average) tensor** for each token, after **normalisation**
- The original x is **added** to this self-attention value, using a **residual connection**, resulting in the updated value of x
- The new updated x is also **saved**, **normalised** and passed to the **Feed Forward network**, to improve on the **non-linearity**
- Using a **residual connection**, and is this is **added** to the new updated x
- LayerNorm is a Layer Normalisation layer, it doesn't have trainabale weights, but it does have learnable parameters (**gamma** and **beta**)

In [23]:
class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.nonlin = nn.Sequential(
            nn.Linear(n_embd, 6 * n_embd),
            nn.Tanh(),
            nn.Linear(6 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.nonlin(x)


In [24]:
class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.attention = MultiHeadAttention(n_head, head_size)
        self.ffd = FeedForward(n_embd)
        self.norm1 = nn.LayerNorm(n_embd)
        self.norm2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        new_updated_x = x + self.attention(self.norm1(x))
        final_x = new_updated_x + self.ffd(self.norm2(new_updated_x))
        return final_x


# 9. Pre-Block Creation

<div style="text-align:center">
    <img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/Transformer-neural-network-21.png" alt="Image" style="width:150px;height:400px;"/>
    <br>
    <a href="https://builtin.com/artificial-intelligence/transformer-neural-network" style="font-size: smaller;" > <b>Source</b> </a>

</div>

- When the input character is received, it'll have been **Label Encoded**
- This needs to be converted:
    - One-Hot-Vector is not enough as it fails to identify the relations between words
    - Thus an **embedding vector** is needed
- The position occupied by the idx is also important as it helps understand **long range dependencies**, as well as intricate inherent **word ordering** present in the novels

In [25]:
def long_tanh(x):
    return x.tanh().long()

In [26]:
class PreBlock(nn.Module):

    def __init__(self):
        super().__init__()

#         self.emb1 = nn.Embedding (vocab_size, n_embd//2)
#         self.emb2 = nn.Embedding (n_embd//2, n_embd)

#         self.pos1 = nn.Embedding (vocab_size, n_embd//2)
#         self.pos2 = nn.Embedding (n_embd//2, n_embd)

        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        self.position_embedding_table = nn.Embedding(block_size, n_embd)


#         self.token_embedding_table = nn.Sequential(
#                                         nn.Embedding(vocab_size, n_embd // 2),
#                                         nn.Tanh(),
#                                         nn.Embedding( n_embd // 2, n_embd )
#                                     )



#         self.position_embedding_table = nn.Sequential(
#                                             nn.Embedding(vocab_size, n_embd // 2),
#                                             nn.Tanh(),
#                                             nn.Embedding( n_embd // 2, n_embd )
#                                         )



    def forward(self, idx):
        B, T = idx.shape

#         tok_emb = self.emb1(idx)
#         tok_emb = torch.tanh(tok_emb)
#         tok_emb = tok_emb.long()
#         tok_emb = self.emb2(tok_emb)

        tok_emb = self.token_embedding_table(idx)

        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))  # 2D, no Batch dim
        #print(tok_emb.shape,"\n",pos_emb.shape)
        pos_emb = pos_emb.expand_as(tok_emb)  # 3D


        embedded_output = tok_emb + pos_emb  # 3D
        return embedded_output

# 10. Post-Block Creation

<div style="text-align:center">
    <img src="https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/Transformer-neural-network-25.png" alt="Image" style="width:150px;height:400px;"/>
    <br>
    <a href="https://builtin.com/artificial-intelligence/transformer-neural-network" style="font-size: smaller;" > <b>Source</b> </a>

</div>

- The output froom the blocks needs to be normalised to promote better distribution of probabilities
- Following this, a softmax layer is used to convert the **(batch_size, block_size, n_embd)** to **(batch_size, block_size, vocab_size)**


In [27]:
class PostBlock(nn.Module):

    def __init__(self):
        super().__init__()
        self.fin_norm = nn.LayerNorm(n_embd)
        self.soft_score = nn.Sequential(
                           nn.Linear(n_embd, vocab_size // 2),
                           nn.Tanh(),
                           nn.Linear( vocab_size // 2, vocab_size )
                        )
    def forward(self, x):
        x = self.fin_norm(x)
        logits = self.soft_score(x)
        return logits

# 11. Putting it all together (Transformer)

- This section will combine the previous sections to make the transformer class
- The 3 sections are:
    - Pre-Block
    - Block
    - Post-Block

- Following this a **cross entropy loss** will be used to assess loss and perform **back propagration**

In [28]:
class Transformer(nn.Module):

    def __init__(self):
        super().__init__()
        self.pre_block = PreBlock()
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.post_block = PostBlock()

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):

        x = self.pre_block(idx)
        x = self.blocks(x)
        logits = self.post_block(x)

        if targets is None:
            loss = None
        else:
            batch, block, vocab = logits.shape
            logits = logits.view(batch*block, vocab)
            targets = targets.view(batch*block)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:] #look at the text generated so far and sonly consider the context window

            logits, loss = self(idx_cond) # get the predictions

            logits = logits[:, -1, :] # considering the loss for the probabilities for the last token's upcoming token
            #its shape is now (batch_size, vocab_size)

            probs = F.softmax(logits, dim=-1) # apply softmax to get probabilities

            idx_next = torch.multinomial(probs, num_samples=1) # get next token, shape is (batch_size, 1)

            idx = torch.cat((idx, idx_next), dim=1) # append sampled index to the running sequence,idx shape is now (batch_size, block_size+1)
        return idx

# 12. Instantiation and Model Training

In [29]:
model = Transformer()
m = model.to(device) #Utilise GPU if available

print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters',end = "\n\n\n")

#print(torchsummary.summary(model, input_size=(batch_size, block_size)),end = "\n\n\n")

#for name, param in model.named_parameters():
#    print(name, param.shape)


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

15.691057 M parameters






- Making a Visual estimator of loss
- A specific train and test test hasn't been separated from the original dataset
- So the final book is used as a test and first 4 books are used for training

In [30]:
def estimate_loss():
    out = {}
    model.eval()
    for state in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(state)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[state] = losses.mean()
    model.train()
    return out

### Training Loop

In [31]:
# for iter in range(max_iters):

#     # every once in a while evaluate the loss on train and val sets
#     if iter % eval_interval == 0 or iter == max_iters - 1:
#         losses = estimate_loss()
#         print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

#     # sample a batch of data
#     xb, yb = get_batch('train')

#     # evaluate the loss
#     logits, loss = model(xb, yb)
#     optimizer.zero_grad(set_to_none=True)
#     loss.backward()
#     optimizer.step()

best_val_loss = float('inf')  # Initialize best validation loss
patience = 3  # Number of epochs to wait for improvement
epochs_since_improvement = 0  # Track number of epochs without improvement

alternate = False

for iter in range(max_iters):

    # Evaluate loss every eval_interval iterations or at the end
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"\n\nstep {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}\n")
        if(bool):
            context = torch.zeros((1, 1), dtype=torch.long, device=device)
            print(decode(m.generate(context, max_new_tokens=100)[0].tolist()))

        alternate = !alternate

        # Early stopping based on validation loss
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            epochs_since_improvement = 0  # Reset counter if validation loss improves
        else:
            epochs_since_improvement += 1

        if epochs_since_improvement >= patience:
            print("Early stopping triggered!")
            break  # Exit the loop

    # Training steps (unchanged)
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()






step 0: train loss 4.7578, val loss 4.7563



	#ZJZV0”rήh,ύHëD;Ôυbx/+T0…W-"ίηUJUi*y:FΩDig7m“WBr‘k'YF,B0;cñο–°rMD∆ιPsMv"Ω	*GΠêñr6~AΔÔPrUb’“L0yτ”ςqςφ





step 1000: train loss 2.2225, val loss 2.2400



	ked	he	hyou	dold	adr6Vis,dnngh	odea Annns	sral	toar Gldre..	ςas	I	whe	ulυ	trurncke	pnotht’s	to	on$ï 





step 2000: train loss 1.4958, val loss 1.5560



	was	wabord	acrossing	abou commment.	But	subblyed	now	the	tornïd	molion	the	doorn	whiole	ήise	will, w





step 3000: train loss 1.2847, val loss 1.3889



	bullsider	call	the	centaure	wofinal.	If	my	thought	sto	would	be	to	this

hippocampus	of	kind	of	smile





step 4000: train loss 1.1734, val loss 1.3297



	wheere	in

head,	the	though	it	was	we	dazed	with	hit	insidiffeed	of	the	menaother	camper	of	the	dayti





step 5000: train loss 1.0891, val loss 1.2981



	voice	the	distancidely,	high	Cyclops	were	eyes.

Grover	ran	our	face	still.	The	bearbal	that	thoughts





step 6000: train loss 1.0176, val loss 1.2936



	guard	and	

### Sample Generation

In [96]:
# context = torch.zeros((1, 1), dtype=torch.long, device=device)

context = torch.tensor([[36]], dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=50)[0].tolist()))


Grover.

I didn’t understand when I returned to ripp


# 13. Saving, Calculating Size and Loading Models Weights

In [35]:
#saving weights
torch.save(model.state_dict(), 'TransformerModel.pth')


files.download('TransformerModel.pth')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [41]:
#calculating size
print("The size of weights is:", round(os.path.getsize('TransformerModel.pth')/1024**2,2), "MB" )

The size of weights is: 124.13 MB


In [42]:
#loading weights
sample_model = Transformer()
sample_model.load_state_dict(torch.load('TransformerModel.pth'))


<All keys matched successfully>

In [43]:
#structure of sample model (matches main model)
sample_model.eval()

Transformer(
  (pre_block): PreBlock(
    (token_embedding_table): Embedding(117, 512)
    (position_embedding_table): Embedding(512, 512)
  )
  (blocks): Sequential(
    (0): Block(
      (attention): MultiHeadAttention(
        (heads): ModuleList(
          (0-15): 16 x Head(
            (key): Sequential(
              (0): Linear(in_features=512, out_features=16, bias=True)
              (1): Tanh()
              (2): Linear(in_features=16, out_features=32, bias=False)
            )
            (query): Sequential(
              (0): Linear(in_features=512, out_features=16, bias=True)
              (1): Tanh()
              (2): Linear(in_features=16, out_features=32, bias=False)
            )
            (value): Sequential(
              (0): Linear(in_features=512, out_features=16, bias=True)
              (1): Tanh()
              (2): Linear(in_features=16, out_features=32, bias=False)
            )
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
  

In [63]:
#transferring model to GPU and making a sample generation
sample_model_GPU = sample_model.to(device)
context = torch.tensor([[36]], dtype=torch.long, device=device)

print(decode(sample_model_GPU.generate(context, max_new_tokens=175)[0].tolist()))

Grover and touched around. “A second,” I said. “But it 

isn’t my fault. He thought I’m not get close, he’ll  turn sixteen.” 

Chiron kept his eyes and he stood. “That’s not West


# Final Statistics:

- System RAM used: 2.7GB
- GPU RAM used: 35.5GB
- Disk Space used: 26.5GB
- Model Weights File Size: 124.13MB