# Building a GPT


Importing the dataset from kaggle --- I've used Wikipedia Dataset.

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mikeortman/wikipedia-sentences")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\dccha\.cache\kagglehub\datasets\mikeortman\wikipedia-sentences\versions\3


In [2]:
# set the path dir
import os
base_path = os.path.join(path, 'wikisent2.txt')
print(base_path)
# read it in to inspect it
with open(base_path,'r',encoding='utf-8') as f:
  text = f.read()

C:\Users\dccha\.cache\kagglehub\datasets\mikeortman\wikipedia-sentences\versions\3\wikisent2.txt


In [3]:
print("Length of dataset in characters: ",len(text))

Length of dataset in characters:  934571982


In [4]:
# let us see the first 1500 characters
print(text[:2000])

0.000123, which corresponds to a distance of 705 Mly, or 216 Mpc.
000webhost is a free web hosting service, operated by Hostinger.
0010x0010 is a Dutch-born audiovisual artist, currently living in Los Angeles.
0-0-1-3 is an alcohol abuse prevention program developed in 2004 at Francis E. Warren Air Force Base based on research by the National Institute on Alcohol Abuse and Alcoholism regarding binge drinking in college students.
0.01 is the debut studio album of H3llb3nt, released on February 20, 1996 by Fifth Colvmn Records.
001 of 3 February 1997, which was signed between the Government of the Republic of Rwanda, and FAPADER.
003230 is a South Korean food manufacturer.
0.04%Gas molecules in soil are in continuous thermal motion according to the kinetic theory of gasses, there is also collision between molecules - a random walk.
0.04% of the votes were invalid.
005.1999.06 is the fifth studio album by the South Korean singer and actress Uhm Jung-hwa.
005 is a 1981 arcade game by Sega.

In [5]:
# listing the unique characters that occur in text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print("Vocabulary size : ",vocab_size)


 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Vocabulary size :  96


In [6]:
# Mapping from characters to integers
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
# encoder: it will take the string and output a list of integer
encode = lambda s: [stoi[c] for c in s]
# decoder: it will take the list of integer and output the string
decode = lambda l: ''.join([itos[u] for u in l])

print(encode('hello world'))
print(decode(encode('I am Anirban Chakraborty')))

[73, 70, 77, 77, 80, 1, 88, 80, 83, 77, 69]
I am Anirban Chakraborty


In [None]:
# Now let us encode the entire text dataset and store it in a Tensor
# for this operation we will use torch.Tensor
import torch
data = torch.tensor(encode(text),dtype=torch.long)
print(data.shape, data.dtype)
print(data[:2000]) #first 2000 characters after encoding

In [None]:
# let us now split the data up into train and validation sets
n = int(0.9*len(data)) # training on the first 90% , rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([17, 15, 17, 17, 17, 18, 19, 20, 13])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context}, the target is {target}")

When input is tensor([17]), the target is 15
When input is tensor([17, 15]), the target is 17
When input is tensor([17, 15, 17]), the target is 17
When input is tensor([17, 15, 17, 17]), the target is 17
When input is tensor([17, 15, 17, 17, 17]), the target is 18
When input is tensor([17, 15, 17, 17, 17, 18]), the target is 19
When input is tensor([17, 15, 17, 17, 17, 18, 19]), the target is 20
When input is tensor([17, 15, 17, 17, 17, 18, 19, 20]), the target is 13
