Source: 

*   https://pypi.org/project/gpt-2-simple/#description
*   https://medium.com/@stasinopoulos.dimitrios/a-beginners-guide-to-training-and-generating-text-using-gpt2-c2f2e1fbd10a
*   https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=VHdTL8NDbAh3
*  https://github.com/ak9250/gpt-2-colab
*  https://www.aiweirdness.com/d-and-d-character-bios-now-making-19-03-15/
*  https://minimaxir.com/2019/09/howto-gpt2/





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zawemi/GS2DIT/blob/main/Class%203/gpt_2_shakespeare.ipynb#scrollTo=4tIUvFbLMUuE)

#Let's teach AI writing like a Shakespeare 🎓

##Installing the model

In [8]:
#install the library we'll use today
!pip install gpt-2-simple

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##Generating text with basic model

###Importing and loading necessary components

In [9]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [None]:
#and let's download our AI model
gpt2.download_gpt2()   # model is saved into current directory under /models/124M/

Fetching checkpoint: 1.05Mit [00:00, 683Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 5.11Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 436Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:10, 45.7Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 788Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 6.39Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 6.85Mit/s]


In [None]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

In [None]:
#we load the model from file to use it
gpt2.load_gpt2(sess, run_name='124M', checkpoint_dir='models')

Loading checkpoint models/124M/model.ckpt


###Text generation

In [None]:
#this is how we would start model statement
prefix = "Is there a second Earth?"

In [None]:
#the model is generating text
gpt2.generate(sess, run_name='124M', checkpoint_dir='models', prefix=prefix, length=50)

Is there a second Earth?

I don't know. I don't think I can understand that. I mean, I'm not saying it's a planet, but it's a planet with a planet. At the end of the day, we don't know what happened


##Generating text with improved (finetuned) model

**IMPORTANT**
</br>Restart the runtime (Runtime -> Restart runtime)

###Importing and loading necessary components

In [1]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [2]:
#get nietzsche texts
!wget "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

--2023-05-24 11:33:23--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.135.216, 52.216.53.152, 54.231.131.248, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.135.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘nietzsche.txt’


2023-05-24 11:33:24 (4.45 MB/s) - ‘nietzsche.txt’ saved [600901/600901]



In [3]:
#game of thrones from https://www.kaggle.com/datasets/khulasasndh/game-of-thrones-books?select=001ssb.txt
!gdown "1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t"
!mv /content/001ssb.txt /content/got1.txt

Downloading...
From: https://drive.google.com/uc?id=1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t
To: /content/001ssb.txt
  0% 0.00/1.63M [00:00<?, ?B/s]100% 1.63M/1.63M [00:00<00:00, 216MB/s]


In [4]:
#let's dowload a file with all Shakespeare plays
!wget "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
!mv /content/input.txt /content/shakespeare.txt

--2023-05-24 11:33:31--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-05-24 11:33:31 (22.1 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [5]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

###Teaching our model

In [6]:
#finetuning with shakespeare.txt (which, to be honest, means that we are teaching the model how to write like a shakespeare)
#it takes a lot of time (~15min)...
gpt2.finetune(sess, 'nietzsche.txt', steps=500)   # steps is max number of training steps

Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:00<00:00,  1.16it/s]


dataset has 143770 tokens
Training...
[1 | 6.54] loss=4.11 avg=4.11
[2 | 8.68] loss=4.09 avg=4.10
[3 | 10.81] loss=4.03 avg=4.07
[4 | 12.95] loss=3.88 avg=4.03
[5 | 15.09] loss=3.63 avg=3.94
[6 | 17.24] loss=3.85 avg=3.93
[7 | 19.39] loss=3.87 avg=3.92
[8 | 21.55] loss=3.82 avg=3.91
[9 | 23.71] loss=3.65 avg=3.88
[10 | 25.89] loss=3.42 avg=3.83
[11 | 28.06] loss=3.87 avg=3.83
[12 | 30.24] loss=3.85 avg=3.83
[13 | 32.43] loss=3.56 avg=3.81
[14 | 34.62] loss=3.77 avg=3.81
[15 | 36.81] loss=3.75 avg=3.80
[16 | 39.00] loss=3.72 avg=3.80
[17 | 41.21] loss=3.51 avg=3.78
[18 | 43.42] loss=3.75 avg=3.78
[19 | 45.64] loss=3.69 avg=3.77
[20 | 47.86] loss=3.63 avg=3.77
[21 | 50.09] loss=3.61 avg=3.76
[22 | 52.32] loss=3.62 avg=3.75
[23 | 54.56] loss=3.48 avg=3.74
[24 | 56.81] loss=3.38 avg=3.72
[25 | 59.05] loss=3.64 avg=3.72
[26 | 61.31] loss=3.63 avg=3.71
[27 | 63.56] loss=3.60 avg=3.71
[28 | 65.82] loss=3.63 avg=3.70
[29 | 68.09] loss=3.57 avg=3.70
[30 | 70.37] loss=3.39 avg=3.69
[31 | 72.65] 

###Text generation

In [7]:
prefix = "Is there a second Earth?"

In [8]:
gpt2.generate(sess, prefix=prefix, length=150)

Is there a second Earth?

69. That which is national is best explained by the contrast it adopts
between its neighbours and themselves. A people which
its neighbour commands, its own greatness and its own
necessity dictates how closely and profoundly the nation feels
inherited values and how closely it is controlled and controlled
by the state. A people that does not feel in its own interest any
malice or crime that might result from its doing anything
ascertained in the obedience of another, that feels itself bound and constrained
by its neighbours, that is, it depends upon the former for its
independence and its identity, or it may be led astray by the
terrestrial, its limited and henceforward, its sensitive and



###Saving model to Google Drive (optional)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You can find more texts e.g. on:
https://www.gutenberg.org/cache/epub/1597/pg1597.txt
</br></br>
You can download them to Colab using code similar to the ones below.

In [None]:
#!wget https://www.gutenberg.org/cache/epub/1597/pg1597.txt

--2023-03-21 14:49:16--  https://www.gutenberg.org/cache/epub/1597/pg1597.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 329071 (321K) [text/plain]
Saving to: ‘pg1597.txt’


2023-03-21 14:49:22 (800 KB/s) - ‘pg1597.txt’ saved [329071/329071]



In [None]:
#!wget https://www.gutenberg.org/files/98/98-0.txt

--2023-02-22 13:25:10--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 807231 (788K) [text/plain]
Saving to: ‘98-0.txt’


2023-02-22 13:25:12 (718 KB/s) - ‘98-0.txt’ saved [807231/807231]



In [None]:
#https://github.com/matt-dray/tng-stardate/tree/master/data/scripts