In [None]:
!pip install tokenizers
!pip install transformers[torch]
!pip install accelerate

In [None]:
import os
os._exit(00)

In [2]:
import torch
torch.cuda.is_available()

True

# Prompt Engineering 101
## _How to make sweet sweet prompts_
_aka you can make loads of money from this_

## Part 1 :: It's all about the tokens
### Tokenizers

_what is a tokenizer?_

Tokenizers break up _words_, _parts of words_ and _phrases_ into a unique token that the software can recognize.  

Some tokenizers work by splitting up a sentence or phrase based of the spaces between words but some tokenizers take a more machine centered approach and split them up by bytes.

![Tokenizers](../img/Tokenizer.jpg)

In the computer world everything is made of of bits and bytes. Have you heard of the phrase Megabits and Megabytes.  Some people would want you to believe they are the same thing but they aren't.

- a _bit_ is a single binary number a 0 or a 1
- a _byte_ is made up of 8 _bits_

The difference between a megabit and a megabyte is 7 million bits.

The way a byte level tokenizer works is that it splits up a sentence or a phrase or a paragraph by the individual bytes.  If a word is really long it takes up more bytes in memory therefore it has more tokens.  

Lets look at how a tokenizer works by looking at the openAI GPT3 tokenizer and feed it some phrases: [link](https://platform.openai.com/tokenizer)

![Tokenizer](../images/tokenizer.PNG)

If we turn on show _token id_ we can see the unique identifier for each word in the GPT3 tokenizer.

![Token ID](../images/token-id.PNG)

Let run the word _somewhere_ through the system.  Some would assume that it would have a single unique identifier. But it doesn't!

![Somewhere](../images/somewhere.PNG)

As you can see the word is actually 3 different tokens. the letter _s_ the suffix _ome_ and the word _here_

Let look at this diagram for the steps for feeding a phrase into a tokenizer and then into a model.

![Tokenizer Steps](../images/tokenizer_steps.png)

The goal of tokenizers is to transform natural language into something a computer can understand and perform math operations on:


So these bit of code

```
raw_inputs = [
    "I've been waiting for this blog my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
```

will turn the phrases:

- _"I've been waiting for this blog my whole life."_
- _"I hate this so much!"_

into this Tensor that the computer can understand

```
{
    'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
        array([
            [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,  2026,  2878,  2166,  1012,   102],
            [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
        ], dtype=int32)>,
    'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
        array([
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        ], dtype=int32)>
}
```

### _wut is a tensor?_

EXPLAIN A TENSOR HERE

### HOW GPUS WORK


### What does this all Mean?

I know there is a lot to unpack in the above and we won't get into that yet.  You're probably confused and wondering why I'm telling you all of this, but it will make sense shortly.

Lets look at phone numbers!

(Country Code)-(Area Code)-(Exchange)-(Extension)

so something like:

+1-917-XXX-XXXX

So if you think about a phone number here in the states, it is a set of eleven numbers that make up 4 different tokens, the country code is a unique token to everyone in the States, the area code is unique token to everyone in that area, the exchange is a unique token to everyone in that exchange, and the last four digits, is your extension token in the exchange in the area code in the country code.  So that when you put those eleven digits in the correct order, it create a unique address to your phone.  

For generative AI think about this tokens as the parts that make up the address of the output you want.  It's a multi-dimensional location inside the latent space of the network.

### _DAN! WTF IS LATENT SPACE???_

![latent space](../images/latent_space.png)

![damn]( ../images/hold_up.gif)

Latent Space is the multi-dimensional space that makes up the 'knowledge' of Generative AI.

Think of it as a complex space where each unique item is nearby similar items. In the figure above we see all of the items of the same color grouped together.  But because it is more than 3 dimensions (its just represented in 3 dimensions here) some of the blue dots could contain information that is similar to the pink dots and are close to them as well on that 4th, 5th, or Nth dimension.

Sometimes we project it down to 2D space with a dimensional reduction step to make it easier to see

![tsne]( ../images/latent_space_tsne.jpg)

_DAN! WTF is DIMENSIONAL REDUCTION??_

Dimensional reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible

You can see here when we project this representation of latent space down to 2D some of the red dots are mixed in with the green dots, that is because they contain similar data points and should be near each other

![laten_space_gif]( ../images/PCA_Projection_Illustration.gif)

### Putting it all together

 Each type of generative AI has its own unique tokenizer (or well a lot of them use the same one lol) that translates the natural language prompts you supply into an array of numbers that represent the _address_ of what you are looking for inside the _model_ of the generative AI.  This address is a unique identifier for a location inside the latent space of the AI model's _knowledge_ of what it has been trained on.


In [None]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

In [None]:
!mkdir /content/drive/MyDrive/models/
!mkdir /content/drive/MyDrive/models/xmas

In [None]:
!git clone https://github.com/sammcilroy/ucl_cop_christmas.git

In [None]:
import csv
import json
def convert_csv_to_json(csv_file_path):
    # Read CSV file
    with open(csv_file_path, 'r') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    # Convert CSV data to JSON
    json_data = json.dumps(rows, indent=4)

    # Save JSON data to a file (optional)
    with open('/content/drive/MyDrive/models/xmas/All_Playlists_Combined.json', 'w') as json_file:
        json_file.write(json_data)

    return json_data

# Specify the path to your CSV file
csv_file_path = '/content/ucl_cop_christmas/collected_data/All Playlists Combined.csv'

# Convert CSV to JSON
json_data = convert_csv_to_json(csv_file_path)

print("Conversion completed. JSON data:")
print(json_data)

In [17]:
import re
def remove_special_characters_and_spaces(input_string):
    # Define a regular expression pattern to match special characters and spaces
    pattern = r'[^a-zA-Z0-9]+'  # This pattern will keep only letters and digits

    # Use the sub method to replace matches of the pattern with an empty string
    clean_string = re.sub(pattern, '', input_string)

    return clean_string

def remove_special_characters(input_string):
    # Define a regular expression pattern to match special characters and spaces
    pattern = r'[^a-zA-Z0-9\s]+'  # This pattern will keep only letters and digits

    # Use the sub method to replace matches of the pattern with an empty string
    clean_string = re.sub(pattern, '', input_string)

    return clean_string

In [18]:
from traitlets import traitlets
from dateutil.parser import parse
from datetime import datetime
import json
stats_file = "/content/drive/MyDrive/models/xmas/All_Playlists_Combined.json"
lines = []
with open(stats_file, 'r') as f:
    xmas_songs = json.load(f)
    for song in xmas_songs:
        title = remove_special_characters(song['track_name'])
        lyrics = re.sub(r'\n', ' ', song['lyrics'])
        lines.append(f"##TITLE {title} ###LYRICS {lyrics} \n")

    with open(f'/content/drive/MyDrive/models/xmas/All_Playlists_Combine.txt', 'w', encoding='utf-8') as f:
          f.writelines(lines)
          f.close()

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!mkdir -p /content/drive/MyDrive/models/xmas/tokenizer

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from pathlib import Path
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE(unk_token="<unk>"))

tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>"
    ])

tokenizer.train(files=["/content/drive/MyDrive/models/xmas/All_Playlists_Combine.txt"], trainer=trainer)
tokenizer.save("/content/drive/MyDrive/models/xmas/tokenizer/xmas.json")

output = tokenizer.encode("Sleigh bells ring are you listening")
print(output.tokens)

['Sleigh', 'bells', 'ring', 'are', 'you', 'listening']


In [7]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=12306,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [8]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast(tokenizer_file="/content/drive/MyDrive/models/xmas/tokenizer/xmas.json")

In [9]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
model.num_parameters()

52979730

In [19]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/content/drive/MyDrive/models/xmas/All_Playlists_Combine.txt",
    block_size=128,
)

In [11]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [25]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/models/xmas/",
    overwrite_output_dir=True,
    num_train_epochs=500,
    per_device_train_batch_size=64,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [None]:
trainer.train(resume_from_checkpoint='/content/drive/MyDrive/models/xmas/')

In [None]:
trainer.save_model("/content/drive/MyDrive/models/xmas/")

In [21]:
trainer.train()

Step,Training Loss
500,5.5848
1000,4.2596
1500,3.5162
2000,3.1308


TrainOutput(global_step=2000, training_loss=4.122851196289062, metrics={'train_runtime': 1460.1922, 'train_samples_per_second': 84.989, 'train_steps_per_second': 1.37, 'total_flos': 4110973722777600.0, 'train_loss': 4.122851196289062, 'epoch': 100.0})

In [None]:
drive.flush_and_unmount()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [26]:
trainer.train(resume_from_checkpoint='/content/drive/MyDrive/models/xmas/')

Step,Training Loss
500,2.534
1000,1.6743
1500,1.0945
2000,0.6841
2500,0.4135
3000,0.2429
3500,0.1459
4000,0.089
4500,0.0586
5000,0.0423


KeyboardInterrupt: ignored

In [27]:
trainer.save_model("/content/drive/MyDrive/models/xmas/")

In [None]:
import json
import matplotlib.pyplot as plt
with open("/content/drive/MyDrive/models/xmas/checkpoint-6000/trainer_state.json", "r") as f:
  data = json.load(f)

  params = {'legend.fontsize': 'small',
          'figure.figsize': (15, 10),
          'axes.labelsize': 'x-small',
          'axes.titlesize':'x-small',
          'xtick.labelsize':'x-small',
          'ytick.labelsize':'x-small'}
  plt.rcParams.update(params)

  loss_value = []
  for tick in data['log_history']:
      if 'loss' in tick:
          loss_value.append(tick['loss'])

  plt.plot(range(0, len(loss_value)), loss_value, label=f'loss', alpha=0.15)
  plt.savefig(f"/content/drive/MyDrive/models/xmas/loss.jpg")
  plt.show()

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="/content/drive/MyDrive/models/xmas/checkpoint-7800",
    tokenizer=tokenizer,
    top_k=20,
)

fill_text = pipeline(
    "text-generation",
    model="/content/drive/MyDrive/models/xmas/checkpoint-7800",
    tokenizer=tokenizer
)

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


In [None]:
fill_text("##TITLE Rockin Around The Christmas Tree ###LYRICS ")