Source: 

*   https://pypi.org/project/gpt-2-simple/#description
*   https://medium.com/@stasinopoulos.dimitrios/a-beginners-guide-to-training-and-generating-text-using-gpt2-c2f2e1fbd10a
*   https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=VHdTL8NDbAh3
*  https://github.com/ak9250/gpt-2-colab
*  https://www.aiweirdness.com/d-and-d-character-bios-now-making-19-03-15/
*  https://minimaxir.com/2019/09/howto-gpt2/





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zawemi/GS2DIT/blob/main/Class%203/gpt_2_shakespeare.ipynb#scrollTo=4tIUvFbLMUuE)

#Let's teach AI writing like a Shakespeare 🎓

##Installing the model

In [19]:
#install the library we'll use today
!pip install gpt-2-simple

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##Generating text with basic model

###Importing and loading necessary components

In [None]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [None]:
#and let's download our AI model
gpt2.download_gpt2()   # model is saved into current directory under /models/124M/

Fetching checkpoint: 1.05Mit [00:00, 683Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 5.11Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 436Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:10, 45.7Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 788Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 6.39Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 6.85Mit/s]


In [None]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

In [None]:
#we load the model from file to use it
gpt2.load_gpt2(sess, run_name='124M', checkpoint_dir='models')

Loading checkpoint models/124M/model.ckpt


###Text generation

In [None]:
#this is how we would start model statement
prefix = "Is there a second Earth?"

In [None]:
#the model is generating text
gpt2.generate(sess, run_name='124M', checkpoint_dir='models', prefix=prefix, length=50)

Is there a second Earth?

I don't know. I don't think I can understand that. I mean, I'm not saying it's a planet, but it's a planet with a planet. At the end of the day, we don't know what happened


##Generating text with improved (finetuned) model

**IMPORTANT**
</br>Restart the runtime (Runtime -> Restart runtime)

###Importing and loading necessary components

In [1]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [2]:
#get nietzsche texts
!wget "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

--2023-05-24 11:37:34--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.90.22, 52.217.232.248, 54.231.228.176, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.90.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘nietzsche.txt’


2023-05-24 11:37:35 (1.34 MB/s) - ‘nietzsche.txt’ saved [600901/600901]



In [3]:
#game of thrones from https://www.kaggle.com/datasets/khulasasndh/game-of-thrones-books?select=001ssb.txt
!gdown "1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t"
!mv /content/001ssb.txt /content/got1.txt

Downloading...
From: https://drive.google.com/uc?id=1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t
To: /content/001ssb.txt
  0% 0.00/1.63M [00:00<?, ?B/s]100% 1.63M/1.63M [00:00<00:00, 78.1MB/s]


In [4]:
#let's dowload a file with all Shakespeare plays
!wget "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
!mv /content/input.txt /content/shakespeare.txt

--2023-05-24 11:37:43--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-05-24 11:37:43 (50.4 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [5]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

###Teaching our model

In [6]:
#finetuning with shakespeare.txt (which, to be honest, means that we are teaching the model how to write like a shakespeare)
#it takes a lot of time (~15min)...
gpt2.finetune(sess, 'got1.txt', steps=500)   # steps is max number of training steps

Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:02<00:00,  2.15s/it]


dataset has 433157 tokens
Training...
[1 | 7.22] loss=3.51 avg=3.51
[2 | 9.29] loss=3.37 avg=3.44
[3 | 11.35] loss=3.41 avg=3.43
[4 | 13.43] loss=3.29 avg=3.39
[5 | 15.52] loss=3.26 avg=3.37
[6 | 17.60] loss=3.38 avg=3.37
[7 | 19.68] loss=3.34 avg=3.36
[8 | 21.76] loss=3.35 avg=3.36
[9 | 23.85] loss=3.26 avg=3.35
[10 | 25.94] loss=3.11 avg=3.33
[11 | 28.04] loss=3.25 avg=3.32
[12 | 30.15] loss=3.08 avg=3.30
[13 | 32.27] loss=3.24 avg=3.29
[14 | 34.37] loss=3.22 avg=3.29
[15 | 36.48] loss=3.14 avg=3.28
[16 | 38.59] loss=3.19 avg=3.27
[17 | 40.71] loss=3.16 avg=3.26
[18 | 42.83] loss=3.33 avg=3.27
[19 | 44.95] loss=3.09 avg=3.26
[20 | 47.07] loss=3.20 avg=3.26
[21 | 49.20] loss=3.12 avg=3.25
[22 | 51.33] loss=3.17 avg=3.24
[23 | 53.47] loss=3.17 avg=3.24
[24 | 55.61] loss=3.04 avg=3.23
[25 | 57.75] loss=3.24 avg=3.23
[26 | 59.89] loss=2.98 avg=3.22
[27 | 62.04] loss=3.08 avg=3.21
[28 | 64.20] loss=2.99 avg=3.21
[29 | 66.36] loss=2.89 avg=3.19
[30 | 68.52] loss=2.98 avg=3.18
[31 | 70.69] 

###Text generation

In [7]:
prefix = "Is there a second Earth?"

In [8]:
gpt2.generate(sess, prefix=prefix, length=150)

Is there a second Earth? Do you want to sleep?" 
"Not me," he said. "I'm only a tree, I see it all around me." He stared up at the 
tree, at the faces of the men and women who had borne him with them, at the faces of the fires 
that had burned right overhead, at the faces of the horses that had not been able to come to a trot 
when the mules that surrounded them had gone down. He wanted to cry like a baby . . . but his tears 
were gone now. He wanted nothing better than to go home and live out his dream again. 
The words came unbidden to Bran's mouth. They were the only words he could


###Saving model to Google Drive (optional)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You can find more texts e.g. on:
https://www.gutenberg.org/cache/epub/1597/pg1597.txt
</br></br>
You can download them to Colab using code similar to the ones below.

In [None]:
#!wget https://www.gutenberg.org/cache/epub/1597/pg1597.txt

--2023-03-21 14:49:16--  https://www.gutenberg.org/cache/epub/1597/pg1597.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 329071 (321K) [text/plain]
Saving to: ‘pg1597.txt’


2023-03-21 14:49:22 (800 KB/s) - ‘pg1597.txt’ saved [329071/329071]



In [None]:
#!wget https://www.gutenberg.org/files/98/98-0.txt

--2023-02-22 13:25:10--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 807231 (788K) [text/plain]
Saving to: ‘98-0.txt’


2023-02-22 13:25:12 (718 KB/s) - ‘98-0.txt’ saved [807231/807231]



In [None]:
#https://github.com/matt-dray/tng-stardate/tree/master/data/scripts