# Data Cartography for Multi-class Classification

Goals:
1. Subclass `Trainer` and overwrite to save out `ids`, `logits`, `gold_label` for each example after each training step.
2. Use plotting utilities from Data Cartography repo to generate plots

In [11]:
!pip install transformers torch datasets ipywidgets

Collecting torch
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m411.8 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.0/196.0 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting nvidia-nccl-cu12==2.18.1
  Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.8/209.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.2/124.2 MB[0m [31m1.9 MB/s[0m eta [36m

In [1]:
from datasets import load_dataset

dataset = load_dataset("lmsys/toxic-chat")

In [2]:
dataset = dataset.remove_columns(
    [
        "model_output",
        "human_annotation",
        "jailbreaking",
        "openai_moderation",
    ]
)

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['conv_id', 'user_input', 'toxicity'],
        num_rows: 5082
    })
    test: Dataset({
        features: ['conv_id', 'user_input', 'toxicity'],
        num_rows: 5083
    })
})

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [9]:
dataset["train"][1]["user_input"]

'Hi! I was roleplaying with you on another window. We were roleplaying a scenario where my wife, Susan had managed to shrink me down to a height of 5cm. She had placed me on the nightstand and we were trying to figure out what to do with this new size difference between us. You were the GM, I was playing Roger, the husband and my wife was an NPC. You were describing everything, including the results of my actions, and what I see and feel at this new diminutive size. It was really fun to the both of us!'

In [11]:
len(tokenizer(dataset["train"][1]["user_input"])["input_ids"])

121

In [35]:
def check_seq_length(dataset, tokenizer):
    """Check to see how many samples are longer than max seq length"""


    print("----Number samples longer than model's context length----")
    
    dataset = dataset.map(lambda examples: {"num_tokens": len(tokenizer(examples["user_input"])["input_ids"])})

    for split in ["test", "train"]:
        df = dataset["train"].to_pandas()
        count = len(df.loc[df["num_tokens"]>=tokenizer.model_max_length])

        print(f"{split.upper()}: {count}")

In [37]:
check_seq_length(dataset, tokenizer)

----Number samples longer than model's context length----


Map:   0%|          | 0/5082 [00:00<?, ? examples/s]

Map:   0%|          | 0/5083 [00:00<?, ? examples/s]

TEST: 15
TRAIN: 15


In [None]:
tokenizer.model_max_length

In [41]:
df = dataset["train"].to_pandas()

In [42]:
df.loc[df["num_tokens"]>=512]

Unnamed: 0,conv_id,user_input,toxicity,num_tokens
1337,3311a70103b8f563b0b66a3bdc9f451cd0146e56933b66...,i have this css body {\n margin: 0px; \n dis...,0,554
1426,199f36239a7baf88dffac5f3155970697840c6d8c63c4d...,hello what would the API link be for this seve...,0,523
1567,c1e590836b272809170e3d8cfa1969156d96e831c5de57...,what this code does?\nimport requests\nfrom bs...,0,542
1639,8e6ce1ddc0d6296bdcac4015dbc3c01b26c9a8b16ab018...,print the pandas dataframe for the csv data: D...,0,939
1707,9bb4d0307a2ccfe1bc8220d2a679a49e1fd1c788e50243...,I ran the following SageMath 9.8 code:\n\n```...,0,621
1954,bfd82cd662d06a16fbb8e808ba84caeed4ad554a8a3729...,quiero que conviertas el siguiente codigo y lo...,0,522
2475,d5d4dba611cdfb579b64dbaed6fd6219f04fe528665be2...,import pygame\nimport pygame_gui\n\npygame.ini...,0,524
2574,7c8de0ae0a4c0e01cd509ad394b659886c85a27c919d36...,"can you explain the ""offset"" section of this c...",0,543
2759,5bc20e6fbc3a63e7c1af915e6cfc2708c0bb4ac9c13ae5...,<!DOCTYPE html>\n<html>\n<head>\n\t<title>Tria...,0,598
2788,90b1cc68d260ed91d3b80eba9a4854d727f59d08f0d945...,can you rewrite this to golang ? \n\n// Encode...,0,584


In [40]:
df.loc[df["num_tokens"]>=512]

Unnamed: 0,conv_id,user_input,toxicity,num_tokens
259,f4fca4a4dad0b274d7bb7652a8b64aee1ce2ea8bb9e0dd...,You are a Backgammon AI. This variant has 14 r...,0,721
649,4df8e669c4f1e4cccced3a36de9de881212d628c5eed3b...,make a GUI of the following Python app: \n\nim...,0,605
1003,cbe1460e1447945e784e2993f1ae6767a78be985387e7c...,given this diagram of the Ogg page header and ...,0,758
1195,ef2cd4161403a14270059f74fc481d3dd767286f900fba...,\n1. d4 d5 2. c4 Nf6 3. Nc3 c6 4. Nf3 Bg4 $6 5...,0,703
2035,d6a335b8e6d03094db0227412afedf3790681ccbb4a8cf...,"From the Neo4j schema below, can you generate ...",0,674
2363,b14d14edc6d7cd9ad446d8332ae2951b467db86790ef9c...,"def perform_clustering(input_data, cm, mapping...",0,574
3085,b41e73523a26c39e2eec99e885ab09bc3a25679f1aa81e...,"explicame el siguiente codigo \nimport React, ...",0,617
3308,e8fdf5038adf5cb59c1d980f7d849c5f3898327e9b0b63...,"<!-- Container START -->\n<div class=""containe...",0,561
3657,7b8212e451c64817233365a3f90383c2e3e3e5a1d4273d...,"```<!-- Container START -->\n<div class=""conta...",0,563
4226,14427e12c22be388708ba21b77feec17b208cfae914551...,"{'role': 'system', 'content': 'You are Blog-GP...",0,560


In [27]:
dataset.reset_format()

In [26]:
dataset[:10]

KeyError: "Invalid key: slice(None, 10, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(None, 10, None)]`. Available splits: ['test', 'train']"

In [23]:
dataset["train"][1]

{'conv_id': '56992bf6775b763ef67d8b4dcb4d7ef1f918c12a513c9b89b27ace38facafbc3',
 'user_input': 'Hi! I was roleplaying with you on another window. We were roleplaying a scenario where my wife, Susan had managed to shrink me down to a height of 5cm. She had placed me on the nightstand and we were trying to figure out what to do with this new size difference between us. You were the GM, I was playing Roger, the husband and my wife was an NPC. You were describing everything, including the results of my actions, and what I see and feel at this new diminutive size. It was really fun to the both of us!',
 'toxicity': 0,
 'num_tokens': 121}

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)