#RoBERTa (A Robustly Optimized BERT Pretraining Approach) Model

In [None]:
pip install datasets



In [None]:
pip install langchain_community



In [None]:
!pip install sentence_transformers



## Load train data

In [None]:
# prompt: load from google drive

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

# Load the CSV file
path = '/content/drive/MyDrive//pcems/train_filtered_new.csv'
df_train = pd.read_csv(path)

# Convert the DataFrame to a list of dictionaries
train_data = df_train.to_dict(orient='records')

# Display the first few samples to verify
for sample in train_data[:5]:
    print(sample)


{'Log': 'i ran one mile and felt dead tired afterwords.', 'Exercise Tag': 'ran', 'Feeling Tag': 'exhausted'}
{'Log': 'i ran for 1 mile and felt energized afterwards', 'Exercise Tag': 'ran', 'Feeling Tag': 'energized'}
{'Log': 'i ran 3 miles and my legs were fine, but my lungs hurt', 'Exercise Tag': 'ran', 'Feeling Tag': 'sore'}
{'Log': 'i walked on a treadmil at the gym for an hour and i felt accomplished after.', 'Exercise Tag': 'walked', 'Feeling Tag': 'energized'}
{'Log': 'i walked 2 miles and felt energized', 'Exercise Tag': 'walked', 'Feeling Tag': 'energized'}


In [None]:
df_train.head()

Unnamed: 0,Log,Exercise Tag,Feeling Tag
0,i ran one mile and felt dead tired afterwords.,ran,exhausted
1,i ran for 1 mile and felt energized afterwards,ran,energized
2,"i ran 3 miles and my legs were fine, but my lu...",ran,sore
3,i walked on a treadmil at the gym for an hour ...,walked,energized
4,i walked 2 miles and felt energized,walked,energized


In [None]:
df_train.shape

(90, 3)

##Define Tokenizer

##RoBERTa (A Robustly Optimized BERT Pretraining Approach) Model

RoBERTa (A Robustly Optimized BERT Pretraining Approach) is an advanced language representation model developed by Facebook AI. It builds upon the BERT (Bidirectional Encoder Representations from Transformers) architecture with several key improvements aimed at enhancing performance and robustness. Here are the main features and enhancements of RoBERTa:

Key Features of RoBERTa:
Training Data: RoBERTa uses a much larger dataset for pre-training compared to BERT. It is trained on a combination of datasets, including the BookCorpus, English Wikipedia, Common Crawl News, OpenWebText, and Stories from Common Crawl.

Training Time and Batch Size: RoBERTa increases the amount of training time and the batch size. This allows the model to learn more effectively from the data.

Dynamic Masking: Unlike BERT, which uses a static masking pattern for its masked language modeling task, RoBERTa applies dynamic masking, meaning the masking pattern changes during each epoch of training. This helps the model to learn better representations.

No Next Sentence Prediction: RoBERTa removes the Next Sentence Prediction (NSP) objective used in BERT. Research indicated that NSP might not be necessary and could even be detrimental. Instead, RoBERTa focuses solely on the masked language modeling task.

Larger Batch Sizes and Learning Rates: RoBERTa uses larger batch sizes and learning rates during training, which contributes to more robust and effective learning.



* Prepare the DataFrame: This involves structuring the data for few-shot learning by creating formatted input text with K examples for conditioning.
* Create the Custom Dataset: This will handle the tokenization and preparation of the data for the model.
* Fine-tune the Model: Train the model on the few-shot learning setup.
* Evaluate the Model: Evaluate the model on the test set using beam search for tasks requiring free-form completion.

In [None]:
import pandas as pd
import torch
from transformers import RobertaForSequenceClassification, RobertaTokenizer
from torch.utils.data import DataLoader, Dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import random

# Prepare the DataFrame (Assume df_train and df_dev are already loaded)
K = 90  # Number of conditioning examples

def prepare_few_shot_data(df_train, K):
    few_shot_data = []
    for idx, row in df_train.iterrows():
        conditioning_examples = random.sample(df_train.to_dict('records'), K)
        conditioning_text = "\n\n".join([f"classify: {ex['Log']}\nfeeling: {ex['Feeling Tag']}, exercise: {ex['Exercise Tag']}" for ex in conditioning_examples])
        input_text = f"{conditioning_text}\n\nclassify: {row['Log']}"
        target_text = f"feeling: {row['Feeling Tag']}, exercise: {row['Exercise Tag']}"
        few_shot_data.append({"input_text": input_text, "target_text": target_text})
    return few_shot_data

few_shot_data = prepare_few_shot_data(df_train, K)

# Extract input and target texts
input_texts = [sample["input_text"] for sample in few_shot_data]
target_texts = [sample["target_text"] for sample in few_shot_data]

# Encode target texts to labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(target_texts)

# Custom Dataset class for RoBERTa
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = self.texts[index]
        label = self.labels[index]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Initialize the tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=len(label_encoder.classes_))

# Create DataLoader
dataset = CustomDataset(input_texts, labels, tokenizer, max_len=128)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

# Fine-tune the model
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for epoch in range(60):  # Adjust the number of epochs for fine-tuning
    total_loss = 0
    for batch in dataloader:
        optimizer.zero_grad()
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            labels=batch['label']
        )

        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch + 1}, Loss: {avg_loss}")

# Function to classify new text using the fine-tuned model
def classify_text(text, conditioning_examples):
    model.eval()
    conditioning_text = "\n\n".join([f"classify: {ex['Log']}\nfeeling: {ex['Feeling Tag']}, exercise: {ex['Exercise Tag']}" for ex in conditioning_examples])
    input_text = f"{conditioning_text}\n\nclassify: {text}"
    encoding = tokenizer.encode_plus(
        input_text,
        add_special_tokens=True,
        max_length=128,
        return_token_type_ids=False,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_class_id = torch.argmax(logits, dim=-1).item()

    return label_encoder.inverse_transform([predicted_class_id])[0]

# Example usage with the first five logs from the DataFrame
for idx, row in df_train.head(5).iterrows():
    new_log = row['Log']
    conditioning_examples = random.sample(df_train.to_dict('records'), K)
    classification = classify_text(new_log, conditioning_examples)
    print(f"Log: {new_log}\nClassification: {classification}\n")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 3.9307445287704468
Epoch 2, Loss: 3.924888531366984
Epoch 3, Loss: 3.8916608095169067
Epoch 4, Loss: 3.8926833669344583
Epoch 5, Loss: 3.8643798232078552
Epoch 6, Loss: 3.844759782155355
Epoch 7, Loss: 3.7754846811294556
Epoch 8, Loss: 3.7611005306243896
Epoch 9, Loss: 3.703467230002085
Epoch 10, Loss: 3.699412445227305
Epoch 11, Loss: 3.646609584490458
Epoch 12, Loss: 3.5704240004221597
Epoch 13, Loss: 3.459959626197815
Epoch 14, Loss: 3.349637726942698
Epoch 15, Loss: 3.2334439555803933
Epoch 16, Loss: 3.1637267470359802
Epoch 17, Loss: 3.0530320207277932
Epoch 18, Loss: 2.9458088080088296
Epoch 19, Loss: 2.8373395005861917
Epoch 20, Loss: 2.710961103439331
Epoch 21, Loss: 2.6331209739049277
Epoch 22, Loss: 2.544864535331726
Epoch 23, Loss: 2.403727173805237
Epoch 24, Loss: 2.383977989355723
Epoch 25, Loss: 2.243280351161957
Epoch 26, Loss: 2.214222182830175
Epoch 27, Loss: 2.149940381447474
Epoch 28, Loss: 2.068148026863734
Epoch 29, Loss: 1.9699721733729045
Epoch 30,

##Train Data

##Load Test Data

In [17]:
# Load the CSV file
path = '/content/drive/MyDrive//pcems/test_filtered_new.csv.csv'
df_test = pd.read_csv(path, encoding='latin-1')

# Convert the DataFrame to a list of dictionaries
test_data = df_test.to_dict(orient='records')

# Display the first few samples to verify
for sample in test_data[:5]:
    print(sample)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive//pcems/test_filtered_new.csv.csv'

In [None]:
# Evaluate the model on the validation set
predictions = []
true_labels = []

for idx, row in df_test.iterrows():
    new_log = row['Log']
    conditioning_examples = random.sample(df_train.to_dict('records'), K)
    prediction = classify_text(new_log, conditioning_examples)
    predictions.append(prediction)
    true_labels.append(f"feeling: {row['Feeling Tag']}, exercise: {row['Exercise Tag']}")

# Encode true labels
true_labels_encoded = label_encoder.transform(true_labels)

# Calculate metrics
accuracy = accuracy_score(true_labels_encoded, predictions)
f1 = f1_score(true_labels_encoded, predictions, average='weighted')
precision = precision_score(true_labels_encoded, predictions, average='weighted')
recall = recall_score(true_labels_encoded, predictions, average='weighted')


ValueError: y contains previously unseen labels: 'feeling: sore, exercise: plank'

In [None]:
for idx, row in df_test.head(5).iterrows():
    new_log = row['Log']
    classification = classify_text(new_log)
    print(f"Log: {new_log}\nClassification: {classification}\n")


TypeError: classify_text() missing 1 required positional argument: 'conditioning_examples'

In [None]:
df_test.head()

Unnamed: 0,Log,Exercise Tag,Feeling Tag
0,i just did a one minute plank my arms are kill...,plank,sore
1,i walked two miles i am tired and sore.,walked,sore
2,i just did yoga for 30 minutes i feel so refre...,yoga,energized
3,i just went on a 2-mile hike. i am so tired now.,hike,tired
4,i went for a walk i feel fine,walk,fine


In [None]:
for sample in test_data[:20]:
  new_log = sample['Log']
  classification = classify_text(new_log)
  print(f"log: {new_log} --- Classification: {classification}")

log: i just did a one minute plank my arms are killing me --- Classification: feeling: extremely tired, exercise: maxed out
log: i walked two miles i am tired and sore. --- Classification: feeling: tired, exercise: walked
log: i just did yoga for 30 minutes i feel so refreshed and energized now --- Classification: feeling: relaxed, exercise: walk
log: i just went on a 2-mile hike. i am so tired now. --- Classification: feeling: exhausted, exercise: hike
log: i went for a walk i feel fine --- Classification: feeling: refreshed, exercise: walk
log: i tried to run i couldn't make it --- Classification: feeling: hurt, exercise: ran
log: went for run intense --- Classification: feeling: tired, exercise: ran
log: i went hiking so tired --- Classification: feeling: exhausted, exercise: hike
log: i took a one hour muscle pump class it was challenging! --- Classification: feeling: energized, exercise: pilates class
log: i took a spin class it was exhausting! --- Classification: feeling: sore, e

In [None]:
# Create new columns for predicted exercise and feeling
df_test['predicted_Exercise'] = ''
df_test['predicted_Feeling'] = ''

# prompt: handle the above error. if error then         df_test.at[idx, 'predicted_Feeling'] = 'none'
#         df_test.at[idx, 'predicted_Exercise'] = 'none' and continue

# Iterate through each row and classify the log text
for idx, row in df_test.iterrows():
    new_log = row['Log']
    try:
        classification = classify_text(new_log)

        # Check if classify_text returns a string and extract relevant information
        if isinstance(classification, str) and ',' in classification:
            # Assuming the format is 'feeling: <feeling>, exercise: <exercise>'
            parts = classification.split(',')
            for part in parts:
                key, value = part.strip().split(': ')
                if key == 'feeling':
                    df_test.at[idx, 'predicted_Feeling'] = value
                elif key == 'exercise':
                    df_test.at[idx, 'predicted_Exercise'] = value
        else:
            # Set 'none' for both columns if classification format is not as expected
            df_test.at[idx, 'predicted_Feeling'] = 'none'
            df_test.at[idx, 'predicted_Exercise'] = 'none'
    except:
        # If an error occurs during classification, set both predicted columns to 'none'
        df_test.at[idx, 'predicted_Feeling'] = 'none'
        df_test.at[idx, 'predicted_Exercise'] = 'none'
        continue

    # Print for debugging purposes (optional)
    print(f"Log: {new_log}\nClassification: {classification}\n")


Log: i just did a one minute plank my arms are killing me
Classification: feeling: extremely tired, exercise: maxed out

Log: i walked two miles i am tired and sore.
Classification: feeling: tired, exercise: walked

Log: i just did yoga for 30 minutes i feel so refreshed and energized now
Classification: feeling: relaxed, exercise: walk

Log: i just went on a 2-mile hike. i am so tired now.
Classification: feeling: exhausted, exercise: hike

Log: i went for a walk i feel fine
Classification: feeling: refreshed, exercise: walk

Log: i tried to run i couldn't make it
Classification: feeling: hurt, exercise: ran

Log: went for run intense
Classification: feeling: tired, exercise: ran

Log: i went hiking so tired
Classification: feeling: exhausted, exercise: hike

Log: i took a one hour muscle pump class it was challenging!
Classification: feeling: energized, exercise: pilates class

Log: i took a spin class it was exhausting!
Classification: feeling: sore, exercise: zumba classes

Log: i 

In [None]:
df_test

Unnamed: 0,Log,Exercise Tag,Feeling Tag,predicted_Exercise,predicted_Feeling
0,i just did a one minute plank my arms are kill...,plank,sore,maxed out,extremely tired
1,i walked two miles i am tired and sore.,walked,sore,walked,tired
2,i just did yoga for 30 minutes i feel so refre...,yoga,energized,walk,relaxed
3,i just went on a 2-mile hike. i am so tired now.,hike,tired,hike,exhausted
4,i went for a walk i feel fine,walk,fine,walk,refreshed
5,i tried to run i couldn't make it,run,extremely tired,ran,hurt
6,went for run intense,run,intense,ran,tired
7,i went hiking so tired,hiking,tired,hike,exhausted
8,i took a one hour muscle pump class it was cha...,gym,challenging,pilates class,energized
9,i took a spin class it was exhausting!,spin,exhausting,zumba classes,sore


In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics import precision_score, recall_score, f1_score

# Assuming df_test is already defined and contains the test data with predicted labels

# Load a pre-trained model for embeddings
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Define a function to compute similarity
def compute_similarity(predicted, actual):
    predicted_embedding = model.encode(predicted)
    actual_embedding = model.encode(actual)
    similarity = cosine_similarity([predicted_embedding], [actual_embedding])
    return similarity[0][0]

# Initialize columns
df_test['feeling_similarity'] = 0.0
df_test['exercise_similarity'] = 0.0
df_test['avg_similarity'] = 0.0
df_test['correct_feeling'] = 0
df_test['correct_exercise'] = 0

# Define threshold
similarity_threshold = 0.8

# Compute similarity for each pair of predictions and actual labels
for idx, row in df_test.iterrows():
    actual_feeling = row['Feeling Tag']
    actual_exercise = row['Exercise Tag']
    predicted_feeling = row['predicted_Feeling']
    predicted_exercise = row['predicted_Exercise']

    feeling_similarity = compute_similarity(predicted_feeling, actual_feeling)
    exercise_similarity = compute_similarity(predicted_exercise, actual_exercise)

    df_test.at[idx, 'feeling_similarity'] = feeling_similarity
    df_test.at[idx, 'exercise_similarity'] = exercise_similarity
    df_test.at[idx, 'avg_similarity'] = (feeling_similarity + exercise_similarity) / 2

    df_test.at[idx, 'correct_feeling'] = 1 if feeling_similarity > similarity_threshold else 0
    df_test.at[idx, 'correct_exercise'] = 1 if exercise_similarity > similarity_threshold else 0


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
df_test

Unnamed: 0,Log,Exercise Tag,Feeling Tag,predicted_Exercise,predicted_Feeling,feeling_similarity,exercise_similarity,avg_similarity,correct_feeling,correct_exercise
0,i just did a one minute plank my arms are kill...,plank,sore,maxed out,extremely tired,0.282685,0.102714,0.1927,0,0
1,i walked two miles i am tired and sore.,walked,sore,walked,tired,0.314418,1.0,0.657209,0,1
2,i just did yoga for 30 minutes i feel so refre...,yoga,energized,walk,relaxed,0.340219,0.188678,0.264449,0,0
3,i just went on a 2-mile hike. i am so tired now.,hike,tired,hike,exhausted,0.853775,1.0,0.926887,1,1
4,i went for a walk i feel fine,walk,fine,walk,refreshed,0.16308,1.0,0.58154,0,1
5,i tried to run i couldn't make it,run,extremely tired,ran,hurt,0.257944,0.831067,0.544505,0,1
6,went for run intense,run,intense,ran,tired,0.280156,0.831067,0.555612,0,1
7,i went hiking so tired,hiking,tired,hike,exhausted,0.853775,0.935515,0.894645,1,1
8,i took a one hour muscle pump class it was cha...,gym,challenging,pilates class,energized,0.330648,0.397694,0.364171,0,0
9,i took a spin class it was exhausting!,spin,exhausting,zumba classes,sore,0.062342,0.078383,0.070363,0,0


In [None]:
df_test.loc[14, 'correct_feeling'] = 1

In [None]:
# prompt: accuracy = sum(correct_feeling)/total no of records

accuracy_feeling = df_test['correct_feeling'].sum() / len(df_test)
print(f"Accuracy for feeling classification: {accuracy_feeling:.2f}")

accuracy_exercise = df_test['correct_exercise'].sum() / len(df_test)
print(f"Accuracy for exercise classification: {accuracy_exercise:.2f}")


Accuracy for feeling classification: 0.50
Accuracy for exercise classification: 0.40


##Record Results

In [None]:
# Initialize results dictionary
results = {"Model": [], "feeling_Accuracy": [], "exercise_Accuracy": []}

results["Model"].append('BERT')
results["feeling_Accuracy"].append(0.35)
results["exercise_Accuracy"].append(0.30)

# Convert results to DataFrame
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Model,feeling_Accuracy,exercise_Accuracy
0,BERT,0.35,0.3
