<a href="https://colab.research.google.com/github/alibekk93/NLP_practice/blob/main/GPT_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Andrej Karpathy's GPT tutorial from https://www.youtube.com/watch?v=kCc8FmEb1nY and https://github.com/karpathy/nanoGPT

https://karpathy.ai/zero-to-hero.html

# Libraries

In [11]:
import torch

# Shakespeare

## Setup

In [1]:
#@title ##### download data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-02-18 19:45:18--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-02-18 19:45:18 (20.6 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
#@title ##### open file
with open('input.txt', 'r', encoding='utf-8') as f:
  shakespeareText = f.read()

In [6]:
#@title ##### looking at the data
print('Length of dataset in characters:', len(shakespeareText))
print(shakespeareText[:200])

Length of dataset in characters: 1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [7]:
#@title ##### setup character vocab
shakespeareChars = sorted(list(set(shakespeareText)))
shakespeareVocabSize = len(shakespeareChars)
print(''.join(shakespeareChars))
print('Vocab size is', shakespeareVocabSize)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size is 65


In [10]:
#@title ##### mapping characters and indices
shakespeareStoI = { ch:i for i,ch in enumerate(shakespeareChars) }
shakespeareItoS = { i:ch for i,ch in enumerate(shakespeareChars) }
shakespeareEncode = lambda s: [shakespeareStoI[c] for c in s]
shakespeareDecode = lambda l: ''.join(shakespeareItoS[i] for i in l)

print(shakespeareEncode('Shakespeare is cool!'))
print(shakespeareDecode(shakespeareEncode('Shakespeare is cool!')))

# a more complex tokenizer by Google: https://github.com/google/sentencepiece
# a more complex tokenizer by OpenAI: https://github.com/openai/tiktoken

[31, 46, 39, 49, 43, 57, 54, 43, 39, 56, 43, 1, 47, 57, 1, 41, 53, 53, 50, 2]
Shakespeare is cool!


In [12]:
#@title ##### encoding Shakespeare into a tensor
data = torch.tensor(shakespeareEncode(shakespeareText), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:200])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59])


In [13]:
#@title ##### splitting train and validation sets
n = int(0.1*len(data))
train_data = data[n:]
val_data = data[:n]

## Modelling

In [None]:
#@title ##### 


