# DI 725: Transformers and Attention-Based Deep Networks

## An Assignment for Implementing Transformers in PyTorch

The purpose of this notebook is to guide you through the usage of sample code.

This notebook follows the baseline prepared by Andrej Karpathy, with a custom dataset (Don-Quixote by Cervantes). This version of the code, called [nanoGPT](https://github.com/karpathy/nanoGPT), is a revisit to his famous [minGPT](https://github.com/karpathy/minGPT).
### Author:
* Ümit Mert Çağlar

### Edited by:
* Ebru Kültür Başaran

## Requirements
Install requirements for your environment, comment out for later uses.

Dependencies:

- [pytorch](https://pytorch.org)
- [numpy](https://numpy.org/install/)
-  `transformers` for huggingface transformers (to load GPT-2 checkpoints)
-  `datasets` for huggingface datasets (to download + preprocess datasets)
-  `tiktoken` for OpenAI's fast BPE code
-  `wandb` for optional logging
-  `tqdm` for progress bars

In [1]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm

The fastest way to get started to transformers, apart from following the labs of DI725, is to use a small model and dataset. For this purpose, we will start with training a character-level GPT on the Don-Quixote by Cervantes. The code will download a single file (2MB) and apply some transformations. Examine the code [prepare.py](data/don_char/prepare.py).

## Preprocessing for NanoGPT

In [1]:
!python preprocess_sentiment_ng.py

vocab_size: 39
Preprocessing complete! Data saved to data/sentiment


This script preprocesses customer conversation data for a sentiment classification task. It loads and cleans the dataset (removes URLs and punctuation, lowercases text), converts sentiment labels (like "positive") to numerical labels, builds a character-level vocabulary from training text, encodes the conversations as sequences of character indices and saves the encoded data and metadata (like vocab) into .pkl files for model training. 

## Preprocessing for GPT-2

In [2]:
!python preprocess_sentiment_gpt2.py

unique_sents in main CSV: ['neutral', 'negative', 'positive']
train_label_set: {0, 1, 2}
val_label_set: {0, 1, 2}
test_label_set: {0, 1, 2}
Preprocessing complete! Data saved to data/sentiment


This script preprocesses a sentiment classification dataset by cleaning and encoding text conversations using the GPT-2 Byte-Pair Encoding (BPE) tokenizer from tiktoken. It splits the data into train/validation/test sets with stratification, converts labels to numeric IDs, and saves everything as pickled .pkl files for model training. It also creates and saves metadata such as the vocabulary size and label names.

## Creating the Model Architecture for NanoGPT and GPT-2

In [3]:
!python model_sentiment.py

This "model_sentiment.py" script defines two model architectures for text classification:
GPT2Wrapper: Fine-tunes a pre-trained GPT-2 model for a 3-class classification task (e.g., sentiment analysis).
GPTforClassification: Placeholder for a custom GPT-style model (like NanoGPT). It's not implemented here but serves as a hook for future development.
The GPT2Wrapper uses the average of hidden states from GPT-2 and applies a linear classifier on top. It also includes an optimizer setup with standard weight decay handling.

## Training the Model for NanoGPT and Fine Tuning GPT-2

In [None]:
!python train_sentiment.py

This script fine-tunes a GPT-based model (either GPT-2 or a NanoGPT-style custom transformer) on a sentiment classification task using preprocessed conversation data. It supports distributed training (DDP) and mixed precision, uses W&B (Weights & Biases) for logging metrics and visuals, and implements early stopping based on validation loss. It loads data from pre-tokenized pickle files, trains the model with periodic evaluation, and finally logs performance (accuracy, loss, confusion matrix) on the test set.

**The output for NanoGPT (the script ran via Spyder)**

step 0: train loss 0.7875, val loss 0.8045
step 200: train loss 0.6565, val loss 0.7224
step 400: train loss 0.6471, val loss 0.7015
step 600: train loss 0.6598, val loss 0.7397
step 800: train loss 0.6379, val loss 0.6984
step 1000: train loss 0.6468, val loss 0.7160
step 1200: train loss 0.6380, val loss 0.7073
step 1400: train loss 0.6248, val loss 0.7138
step 1600: train loss 0.6406, val loss 0.7257
step 1800: train loss 0.6269, val loss 0.6728
step 2000: train loss 0.6753, val loss 0.7480
step 2200: train loss 0.6302, val loss 0.7161
step 2400: train loss 0.6144, val loss 0.6610
step 2600: train loss 0.6146, val loss 0.7309
step 2800: train loss 0.5946, val loss 0.7011
step 3000: train loss 0.5744, val loss 0.7289
step 3200: train loss 0.5905, val loss 0.7053
step 3400: train loss 0.5644, val loss 0.7092
step 3600: train loss 0.5807, val loss 0.7666
Stopping early
Test Loss: 1.4704, Accuracy: 0.5000
              precision    recall  f1-score   support

     neutral       0.40      1.00      0.57        10
    negative       1.00      0.50      0.67        10
    positive       0.00      0.00      0.00        10

    accuracy                           0.50        30
   macro avg       0.47      0.50      0.41        30
weighted avg       0.47      0.50      0.41        30

**The output for GPT-2 (the script ran via Spyder):**

step 0: train loss 7.0802, val loss 7.0958
step 200: train loss 0.7538, val loss 0.7511
step 400: train loss 0.7267, val loss 0.7346
step 600: train loss 0.7337, val loss 0.7311
step 800: train loss 0.7155, val loss 0.7226
step 1000: train loss 0.7027, val loss 0.7094
step 1200: train loss 0.6841, val loss 0.7000
step 1400: train loss 0.6728, val loss 0.6913
step 1600: train loss 0.6717, val loss 0.6838
step 1800: train loss 0.6599, val loss 0.6745
step 2000: train loss 0.6521, val loss 0.6689
step 2200: train loss 0.6439, val loss 0.6631
step 2400: train loss 0.6365, val loss 0.6551
step 2600: train loss 0.6300, val loss 0.6586
step 2800: train loss 0.6291, val loss 0.6523
step 3000: train loss 0.6240, val loss 0.6433
step 3200: train loss 0.6148, val loss 0.6359
step 3400: train loss 0.6068, val loss 0.6320
step 3600: train loss 0.5999, val loss 0.6255
step 3800: train loss 0.5982, val loss 0.6289
step 4000: train loss 0.5996, val loss 0.6214
step 4200: train loss 0.5885, val loss 0.6174
step 4400: train loss 0.5826, val loss 0.6137
step 4600: train loss 0.5823, val loss 0.6141
step 4800: train loss 0.5820, val loss 0.6087
step 5000: train loss 0.5829, val loss 0.6129
step 5200: train loss 0.5745, val loss 0.6109
step 5400: train loss 0.5802, val loss 0.6209
step 5600: train loss 0.5794, val loss 0.6020
step 5800: train loss 0.5848, val loss 0.6066
step 6000: train loss 0.5667, val loss 0.5974
step 6200: train loss 0.5585, val loss 0.5914
step 6400: train loss 0.5728, val loss 0.5961
step 6600: train loss 0.5557, val loss 0.5920
step 6800: train loss 0.5683, val loss 0.5961
step 7000: train loss 0.5472, val loss 0.5819
step 7200: train loss 0.5553, val loss 0.5849
step 7400: train loss 0.5472, val loss 0.5836
step 7600: train loss 0.5554, val loss 0.5806
step 7800: train loss 0.5432, val loss 0.5753
step 8000: train loss 0.5392, val loss 0.5755
step 8200: train loss 0.5365, val loss 0.5750
step 8400: train loss 0.5365, val loss 0.5735
step 8600: train loss 0.5510, val loss 0.5731
step 8800: train loss 0.5358, val loss 0.5714
step 9000: train loss 0.5299, val loss 0.5694
step 9200: train loss 0.5308, val loss 0.5691
step 9400: train loss 0.5368, val loss 0.5689
step 9600: train loss 0.5311, val loss 0.5639
step 9800: train loss 0.5228, val loss 0.5594
Test Loss: 2.1792, Accuracy: 0.5000
              precision    recall  f1-score   support

    negative       0.83      0.50      0.62        10
     neutral       0.42      1.00      0.59        10
    positive       0.00      0.00      0.00        10

    accuracy                           0.50        30
   macro avg       0.42      0.50      0.40        30
weighted avg       0.42      0.50      0.40        30

## Efficiency notes

*For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.*

*Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!*


## Troubleshooting

*Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.*

*For some context on this repository, GPT, and language modeling it might be helpful to watch [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.*

## Acknowledgements

This code is a fork from Andrej Karpathy's introductory [NanoGPT repository](https://github.com/karpathy/nanoGPT), which is an updated form of minGPT.