# Create multi-label toxicity dataset



### Source of Parquet file?
SQL used for downloading the data from:

https://huggingface.co/datasets/acloudfan/toxicity-multi-label-classifier

The SQL statements are shown in cells below.

PS: Deliberately removed the rows that have *obscene*=1

**Original dataset:** from which this dataset was created
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

In [13]:
import pandas as pd
import json

# Load the parquet file into a pandas DataFrame
parquet_file_train = "./toxicity-classifier/multi_label_comment_classification_train.parquet"
parquet_file_validation = "./toxicity-classifier/multi_label_comment_classification_validation.parquet"
parquet_file_test = "./toxicity-classifier/multi_label_comment_classification_test.parquet"



## Cohere dataset

**Requirements :** https://docs.cohere.com/docs/classify-preparing-the-data

* Used the code below to convert the dataset from Parquet to JSONL format
* Renamed the attributes, as per Cohere's multi-label dataset requirement
* Converts Parquet to JSONL
* Split the dataset into (Training, Validation & Test)

In [17]:
# List of labels to check
label_columns = ["toxic", "threat", "insult", "identity_hate"]

# Function to process each row
def generate_output_cohere(row):
       
    # Extract the comment text
    text = row['comment_text']
    
    # Create a list of labels where the value is 1
    labels = [label for label in label_columns if row[label] == 1]
    
    # Format as desired output
    return {"text": text, "label": labels}

def read_parquet_generate_jsonl_cohere(parquet_file, output_file):
    # Generate the training set
    df = pd.read_parquet(parquet_file)
    output_df = df.apply(generate_output_cohere, axis=1)
    # output = output_df.to_list()

    # Convert the DataFrame to a JSONL file (one JSON object per line)
    
    output_df.to_json(output_file, orient='records', lines=True)

    print(f"Successfully converted {parquet_file_train} to {output_file}")

# Print the output in the desired format
# for entry in output:
#     print(entry)

#### Training set

* Run the following against the **train** set
* Download the parquet file
* Rename file to : multi_label_comment_classification_train.parquet

**HuggingFace SQL console :**
https://huggingface.co/blog/sql-console

```
SELECT comment_text, toxic, threat, insult, identity_hate FROM (
 SELECT * FROM train  
 where obscene=0 AND toxic=0 AND severe_toxic=0 AND threat=0 AND insult=0 AND identity_hate=0   LIMIT 20
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM train  where obscene=0 and toxic=1 LIMIT 20
)
UNION
SELECT comment_text, toxic,  threat, insult, identity_hate  FROM (
   SELECT * FROM train  where obscene=0 and threat=1 LIMIT 20
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM train  where obscene=0 and identity_hate=1 LIMIT 20
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM train  where obscene=0 and insult=1 LIMIT 20
);
```

In [18]:
jsonl_file_train = "./toxicity-classifier/multi_label_comment_classification_train_cohere.jsonl"
read_parquet_generate_jsonl_cohere(parquet_file_train, jsonl_file_train)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_train_cohere.jsonl


#### Validation

* Run the following against the **test** set
* Download the parquet file
* Rename file to : multi_label_comment_classification_test.parquet

```
SELECT comment_text, toxic, threat, insult, identity_hate FROM (
 SELECT * FROM validation  
 where obscene=0 AND toxic=0 AND severe_toxic=0 AND threat=0 AND insult=0 AND identity_hate=0   LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM validation  where obscene=0 and toxic=1 LIMIT 8
)
UNION
SELECT comment_text, toxic,  threat, insult, identity_hate  FROM (
   SELECT * FROM validation  where obscene=0 and threat=1 LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM validation  where obscene=0 and identity_hate=1 LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM validation  where obscene=0 and insult=1 LIMIT 8
);
```5
);

In [19]:
jsonl_file_validation = "./toxicity-classifier/multi_label_comment_classification_validation_cohere.jsonl"
read_parquet_generate_jsonl_cohere(parquet_file_validation, jsonl_file_validation)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_validation_cohere.jsonl


#### Test

* Run the following against the **test** set
* Download the parquet file
* Rename file to : multi_label_comment_classification_test.parquet
  
```
SELECT comment_text, toxic, threat, insult, identity_hate FROM (
 SELECT * FROM test  
 where obscene=0 AND toxic=0 AND severe_toxic=0 AND threat=0 AND insult=0 AND identity_hate=0   LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM test  where obscene=0 and toxic=1 LIMIT 8
)
UNION
SELECT comment_text, toxic,  threat, insult, identity_hate  FROM (
   SELECT * FROM test  where obscene=0 and threat=1 LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM test  where obscene=0 and identity_hate=1 LIMIT 8
)
UNION
SELECT comment_text, toxic, threat, insult, identity_hate  FROM (
   SELECT * FROM test  where obscene=0 and insult=1 LIMIT 8
);
```

In [20]:
jsonl_file_test = "./toxicity-classifier/multi_label_comment_classification_test_cohere.jsonl"
read_parquet_generate_jsonl_cohere(parquet_file_test, jsonl_file_test)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_test_cohere.jsonl


## OpenAI GPT 

**Requirements :** https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

* Requires the dataset to be in chat message format
* For non-chat use cases such as classification, use single-turn format with 3 messages ["system", "user", "assistant"]
  e.g.,

```
{
  "messages": [
                {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
                {"role": "user", "content": "What's the capital of France?"}, 
                {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}
              ]
}
```

In [6]:
# List of labels to check
label_columns = ["toxic", "threat", "insult", "identity_hate"]

# Function to process each row
def generate_output_openai(row):
       
    # Extract the comment text
    text = row['comment_text']
    
    # Create a list of labels where the value is 1
    labels = [label for label in label_columns if row[label] == 1]

    system_message = "you will categorize the user's input into one or more categories: "+str(["toxic", "severe_toxic", "threat", "insult", "identity_hate"])

    # The output must be in string format i.e., can't be array so use json.dumps to convert to 
    json_l = {
        "messages":[
            {
                "role": "system",
                "content": system_message
            },{
                "role": "user",
                "content": text
            },{
                "role": "assistant",
                "content": json.dumps(labels)
            }
        ]
    }
    
    # Format as desired output
    # return json.dumps(json_l)
    return json_l


def read_parquet_generate_jsonl_openai(parquet_file, output_file):
    # Generate the training set
    df = pd.read_parquet(parquet_file)
    output_df = df.apply(generate_output_openai, axis=1)
    # output = output_df.to_list()

    # Convert the DataFrame to a JSONL file (one JSON object per line)
    
    output_df.to_json(output_file, orient='records', lines=True)

    print(f"Successfully converted {parquet_file_train} to {output_file}")

## Training data

In [7]:
jsonl_file_train = "./toxicity-classifier/multi_label_comment_classification_train_openai.jsonl"
read_parquet_generate_jsonl_openai(parquet_file_train, jsonl_file_train)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_train_openai.jsonl


## Validation data

In [8]:
jsonl_file_validation = "./toxicity-classifier/multi_label_comment_classification_validation_openai.jsonl"
read_parquet_generate_jsonl_openai(parquet_file_validation, jsonl_file_validation)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_validation_openai.jsonl


## Test data

In [9]:
jsonl_file_test = "./toxicity-classifier/multi_label_comment_classification_test_openai.jsonl"
read_parquet_generate_jsonl_openai(parquet_file_test, jsonl_file_test)

Successfully converted ./toxicity-classifier/multi_label_comment_classification_train.parquet to ./toxicity-classifier/multi_label_comment_classification_test_openai.jsonl


# HF Publish

In [10]:
from datasets import load_dataset

In [11]:
dataset = load_dataset("parquet", data_files={'train': parquet_file_train, 'validation': parquet_file_validation, 'test': parquet_file_test})

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [12]:
from dotenv import load_dotenv
import os
import sys
import warnings

# Setting path so we can access the utils folder
sys.path.append('../')
sys.path.append('./')

from IPython.display import Markdown, JSON

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')


dataset.push_to_hub("acloudfan/toxicity-multi-label-classifier")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

AttributeError: 'HfApi' object has no attribute 'list_files_info'