## Task: Reformat a public dataset.

### Objective

To enhance the suitability of public datasets for LLM (Large Language Model) training and fine-tuning, datasets need to be presented in a consistent, structured format. Your responsibility is to conceive and implement a data format, then modify a given public dataset to adhere to this new structure.

### Detailed Instructions:

1. **Dataset Selection**:

   - You may start with the [Emotional-Support-Conversation dataset](https://raw.githubusercontent.com/thu-coai/Emotional-Support-Conversation/main/ESConv.json) for this task. But you are encouraged to use other public datasets as long as specific reasons are given.

   - This dataset doesn't inherently have "labels". Your task includes crafting at least one appropriate label key and annotating the data accordingly.
   
   - When designing labels, ensure they align with the principles of making the LLM harmless, helpful, and honest.

2. **Dataset Attributes**:
   - Your reformatted dataset should includes at least two attributes: `raw_data` and `processed_data`.

     - `raw_data`: A string, which directly saves the raw text data loaded from the original dataset
     
     - `processed_data`: A list where each item signifies a feature-label pair that's been processed for a specific task. For clarity, refer to the examples provided in the block below.

3. **Flexibility in Design**:

   - **Data Structure**: Your design should be accommodating. This means:
   
     - Simplifying the addition of new processed data.

     - Expanding the label classes without hassle (i.e., introducing new label keys).

     - Incorporating label values from various annotators (i.e., multiple label values from different annotators for the same label key).
     
   - **Code Flexibility**: Ensure your code is modular, making it straightforward to apply the same formatting to other public datasets.


4. **Design Autonomy**:

   - While the example below offers guidance, don't feel restricted by it. If you believe a different structure is more suitable, present your unique design. However, ensure that it incorporates the essential attributes: `raw_data` and `processed_data`.
   
5. **Deliverables**:

   - An outline of your designed data format/structure.

   - The code used to convert the public dataset to your design.

   - Print the time cost for saving and loading your designed dataset, specifically,

     - Time cost for saving the whole dataset

     - Time cost for loading the whole dataset
     
     - Time cost for loading randomly selected 1k instances from the dataset.
   

   - Model Inference

     - Select a Language Model: Choose a language model of your preference. For instance, you can use a pretrained model available from HuggingFace transformers.

     - Generate Embeddings:  Utilize the selected model to generate embeddings for your processed data. You only need to obtain embeddings for a sample of 100 instances.

     - Find the Closest Pair: Identify the pair of instances with the closest embeddings.

     - Display the Instances: Print or display the instances for review.

   - **Model Training (Bonus)**

      - Select a Language Model: Choose a language model of your preference. For instance, you can use a pretrained model available from HuggingFace transformers.

      - Fine-tune the Model: Finetune the pretrained model using your processed data with a training-to-test data split ratio of 7:3.

         - You can use a subset of your processed data if the computing resource is limited.
         
         - The task for fine-tuning could be classification, generation, or embedding improvement.

      - Display Training Log: print the training log, which should includes essential information including the training loss and test loss.



**Please finish this task in one week. You can return .py or .ipynb files.**

In [1]:
# My design
raw_data =  "xxx"
processed_data = [
    dict(
    feature = "[CLS] Sentence1 [SEP] Sentence2",

    # label contains 4 keys: problem_type, emotion_type, feedback, and strategy
    label = {
        'problem_type': 'Job Crisis',
        'emotion_type': 'Fear',
        'feedback': '4',
        'strategy': 'Question',
        'new_label1': 'Additional label information1',
        'new_label2': 'Additional label information2',
        ...
        
        }
)
]
instance_i = dict(
    raw_data = raw_data,
    processed_data = processed_data,
)

### 1. Read raw data

In [2]:
import json
import requests

def read_ESConv():
    url = 'https://raw.githubusercontent.com/thu-coai/Emotional-Support-Conversation/main/ESConv.json'
    response = requests.get(url)
    raw_data = response.json()

    print('Amount of data: {}'.format(len(raw_data)))
    return raw_data


raw_data = read_ESConv()
# raw_data

Amount of data: 1300


### 2. Process data

In [3]:
import nltk
import re
import string
from nltk.corpus import stopwords

nltk.download('stopwords')

def process_text(text):
    cleaned_text = []
    stop_words = set(stopwords.words('english'))
    words = text.split()
    for word in words:
        word = word.lower()

        if word in stop_words:
            continue
        if word.isdigit():
            continue
    
        # Remove punctuation
        word = word.translate(str.maketrans('', '', string.punctuation))
    
        cleaned_text.append(word)
    return " ".join(cleaned_text)

def process_data(raw_data, clean_text):
    processed_data = []
    label = {}
    dia_dict = {}
    for i in range(len(raw_data)):
        for j in range(len(raw_data[i]['dialog'])):
            content = raw_data[i]['dialog'][j]['content'].rstrip('\n')
            anno = raw_data[i]['dialog'][j]['annotation']
            
            if clean_text:
                content = process_text(content) 
            
            if j % 2 == 0:
                feature = '[CLS]' + content
                label['problem_type'] = raw_data[i]['problem_type']
                label['emotion_type'] = raw_data[i]['emotion_type']
                label.update(anno)
            else:
                feature = feature + '[SEP]' + content
                label.update(anno)
                dia_dict[feature] = label
                processed_data.append(dia_dict)

                feature = ''
                label = {}

                dia_dict = {}
    return processed_data

processed_data = process_data(raw_data, clean_text=False)
len(processed_data)

18864

### 3. Save and load data

In [16]:
import pickle
import time
import random

def save_data(instance_i):
    start_time = time.time()
    with open('data.pkl', 'wb') as file:
        pickle.dump(instance_i, file)

    end_time = time.time()
    save_time = end_time - start_time
    return end_time - start_time

def load_data(PATH):
    start_time = time.time()
    with open(PATH, 'rb') as file:
        loaded_data = pickle.load(file)
    end_time = time.time()
    load_time = end_time - start_time
    return loaded_data, load_time

def random_instances(dataset, num):
    random_indices = random.sample(range(len(dataset)), num)

    start_time = time.time()
    selected_instances = [dataset[n] for n in random_indices]
    end_time = time.time()
    load_random_time = end_time - start_time
    return load_random_time, selected_instances


instance_i = dict(
    raw_data = raw_data,
    processed_data = processed_data,
)

save_time = save_data(instance_i)
print(f"Time cost for saving the whole dataset: {save_time} seconds")

loaded_data, load_time = load_data('data.pkl')
print(f"Time cost for loading the whole dataset: {load_time} seconds")

dataset = loaded_data['processed_data']
load_random_time, _= random_instances(dataset, 1000)
print(f"Time cost for loading 1k random instances: {load_random_time} seconds")

Time cost for saving the whole dataset: 0.06981611251831055 seconds
Time cost for loading the whole dataset: 0.1246638298034668 seconds
Time cost for loading 1k random instances: 0.0 seconds


### 4. Generate embeddings


In [7]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
torch.cuda.get_device_name(0)

Using device: cuda


'Tesla T4'

In [10]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
Col

In [11]:
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch
import numpy as np

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
model.to(device)


sentence_embeddings = []


_, selected_instances = random_instances(processed_data, 100)

for item in selected_instances:
    for text, labels in item.items():
        tokens = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        tokens.to(device)
        with torch.no_grad():
            outputs = model(**tokens)
            pooled_output = outputs['last_hidden_state'].mean(dim=1).squeeze(0)  # Get the embedding corresponding to [CLS]
        sentence_embeddings.append(pooled_output.to("cpu"))
len(sentence_embeddings)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

100

In [12]:
def find_closest_pair(sentence_embeddings):
    max_similarity = float('-inf')
    similarity_list = []
    most_similar_embeddings = ('', '')
    pair_list = []

    for i in range(len(sentence_embeddings)):
        for j in range(i + 1, len(sentence_embeddings)):

            embedding1, embedding2 = sentence_embeddings[i], sentence_embeddings[j]

            # Calculate the cosine similarity
            similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

            # Update max similarity
            if similarity > max_similarity:
                max_similarity = similarity
                similarity_list.append(max_similarity)
                most_similar_embeddings = (i, j)
                pair_list.append(most_similar_embeddings)
    return similarity_list, pair_list

similarity_list, pair_list = find_closest_pair(sentence_embeddings)

In [14]:
print(" --------  Top 10 close embeddings in 100 instances  --------")
for i in range(len(pair_list)-1, len(pair_list)-11, -1):
    print("The pair of instances with the close embeddings:")
    print(selected_instances[pair_list[i][0]])
    print(selected_instances[pair_list[i][1]])
    print("Cosine Similarity:", similarity_list[i])
    print('\n')

#     print("Embeddings are: ")
#     print(sentence_embeddings[pair_list[i][0]])
#     print(sentence_embeddings[pair_list[i][1]])

 --------  Top 10 close embeddings in 100 instances  --------
The pair of instances with the close embeddings:
{'[CLS]hello[SEP]how are you today?': {'problem_type': 'job crisis', 'emotion_type': 'sadness', 'strategy': 'Question'}}
{'[CLS]Hello[SEP]Hello. How are you?': {'problem_type': 'academic pressure', 'emotion_type': 'sadness', 'strategy': 'Others'}}
Cosine Similarity: 0.95120865


The pair of instances with the close embeddings:
{"[CLS]Thank you! A long time ago his sister told me that he cheated on me and I believed her and so I wanted to get him back.[SEP]It says a lot about your character that you feel badly about this, that you can see and recognize you acted in a way you don't agree with. That's something to build on.": {'problem_type': 'ongoing depression', 'emotion_type': 'shame', 'strategy': 'Affirmation and Reassurance'}}
{"[CLS]A women fell asleep at the wheel in 2013 and hit me going through an intersection and totaled by car as well, but I had only had a concussion a

### 5. Finetuning

+ **Model:** bert-base-uncased
    + params: 110M
    + url: https://huggingface.co/bert-base-uncased
    
+ **Data set:** The data set consists of 10,000 instances in the original ESConv data set. For training, instances are randomly splited, with 70% used as training set and 30% used as validation set. The validation set does not participate in parameter updates.

    
+ **Downstream Task:** Multi-class problem type classification. Classes include `job crisis`, `problems with friends`, `ongoing depression`, `breakup with partner`, and `academic pressure`, etc.

+ **Computing resource:** Tesla T4


In [71]:
selected_data = processed_data[:10000]
count = 0
problem_types = []
for data in selected_data:
    for k in data:
        if 'problem_type' in data[k]:
            if data[k]['problem_type'] not in problem_types:
                problem_types.append(data[k]['problem_type'])

problem_types

['job crisis',
 'problems with friends',
 'ongoing depression',
 'breakup with partner',
 'academic pressure',
 'conflict with parents',
 'Procrastination',
 'Alcohol Abuse',
 'Issues with Parents',
 'Sleep Problems',
 'Appearance Anxiety',
 'School Bullying',
 'Issues with Children']

In [73]:
num2label = {i: str for i, str in enumerate(problem_types)}
label2num = {value: key for key, value in num2label.items()}
label2num

{'job crisis': 0,
 'problems with friends': 1,
 'ongoing depression': 2,
 'breakup with partner': 3,
 'academic pressure': 4,
 'conflict with parents': 5,
 'Procrastination': 6,
 'Alcohol Abuse': 7,
 'Issues with Parents': 8,
 'Sleep Problems': 9,
 'Appearance Anxiety': 10,
 'School Bullying': 11,
 'Issues with Children': 12}

In [75]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split


import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, data, label2num):
        self.data = data
        self.texts = []
        self.labels = []
        self.label2num = label2num

        # Loop through the dataset and extract
        # text and sentiment labels
        for sample in data:
            for text, labels in sample.items():
                self.texts.append(text)
                strategy_label = labels.get('problem_type')


                label = self.label2num.get(strategy_label)
                self.labels.append(label)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = self.texts[index]
        label = self.labels[index]

        return text, label

small_ESConv = selected_data
dataset = MyDataset(small_ESConv, label2num)

In [76]:
# Split dataset: 0.7 training set and 0.3 validation set
train_dataset, val_dataset = train_test_split(dataset, test_size=0.3, random_state=42)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

In [77]:
# Imbalanced data
total_labels = []
for batch in train_loader:
    text, labels = batch
    total_labels = total_labels + labels.tolist()

class_weights = []
for num in num2label:
    class_weights.append(total_labels.count(num)/len(total_labels))
    print(f"label:{num2label[num]} {total_labels.count(num)/len(total_labels):.2%}")

label:job crisis 21.57%
label:problems with friends 13.47%
label:ongoing depression 26.87%
label:breakup with partner 19.05%
label:academic pressure 11.28%
label:conflict with parents 0.76%
label:Procrastination 1.01%
label:Alcohol Abuse 1.07%
label:Issues with Parents 0.65%
label:Sleep Problems 2.22%
label:Appearance Anxiety 0.91%
label:School Bullying 0.14%
label:Issues with Children 0.99%


In [67]:
from torchvision import transforms
from transformers import BertConfig


def train(model, train_loader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in train_loader:
        texts, labels = batch
        optimizer.zero_grad()
        inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128, return_token_type_ids=True)
        inputs.to(device)
        labels = labels.to(device)
        outputs = model(**inputs, labels=labels)
        loss = criterion(outputs.logits, labels)
        # loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    return total_loss / len(train_loader)

def evaluate(model, val_loader, device):
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_samples = 0
    with torch.no_grad():
        for batch in val_loader:
            texts, labels = batch
            inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128, return_token_type_ids=True)
            inputs.to(device)
            labels = labels.to(device)
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            logits = outputs.logits
            predicted_labels = torch.argmax(logits, dim=1)
            correct_predictions += (predicted_labels == labels).sum().item()
            total_samples += len(labels)
    accuracy = correct_predictions / total_samples
    return total_loss / len(val_loader), accuracy

class_weights = torch.tensor(class_weights, device=device)
criterion = nn.CrossEntropyLoss(weight=class_weights)

# Load pre-trained BERT model for classification
config = BertConfig.from_pretrained("bert-base-uncased")
config.num_labels = len(label2num)  # the number of classes
config.num_hidden_layers = 6
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", config=config) 
model.dropout = nn.Dropout(0.2)  # alliviate overfitting problem
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# optimizer = AdamW(model.classifier.parameters(), lr=2e-5, weight_decay=1e-3)  # L2 Regularization
optimizer = AdamW(model.parameters(), lr=2e-5)

num_epochs = 15
model.to(device)

train_loss_list = []
val_loss_list = []

best_val_loss = float("inf")  
patience = 3  # set the tolerated number of consecutive epochs
counter = 0  # count consecutive epochs

# 10000 instance
for epoch in range(num_epochs):
    train_loss = train(model, train_loader, optimizer, device)
    train_loss_list.append(train_loss)
    val_loss, val_accuracy = evaluate(model, val_loader, device)
    val_loss_list.append(val_loss)
    print(f"Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.2%}")

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        counter = 0
    else:
        counter += 1

    if counter >= patience:
        print(f"Early stopping after {epoch + 1} epochs without improvement.")
        break


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Train Loss: 1.3214, Val Loss: 1.2538, Val Accuracy: 50.40%
Epoch 2, Train Loss: 1.0632, Val Loss: 1.1809, Val Accuracy: 53.03%
Epoch 3, Train Loss: 0.8963, Val Loss: 1.1465, Val Accuracy: 55.17%
Epoch 4, Train Loss: 0.7134, Val Loss: 1.2507, Val Accuracy: 56.67%
Epoch 5, Train Loss: 0.5342, Val Loss: 1.4046, Val Accuracy: 56.30%
Epoch 6, Train Loss: 0.3870, Val Loss: 1.5786, Val Accuracy: 55.73%
Early stopping after 6 epochs without improvement.


In [70]:
torch.save(model.state_dict(), '10000sample_problemclassification.pth' )