# Fine-tuning Pre-trained Models to Journal Entry Data


There are a few pre-trained sentiment analysis models that I will fine-tune and evaluate for use in the project. Some popular options I have found are:

- BERT (Bidirectional Encoder Representations from Transformers): BERT is a large language model that has been shown to achieve state-of-the-art results on a variety of NLP tasks, including sentiment analysis.
- RoBERTa (Robustly Optimized BERT Approach): RoBERTa is a variant of BERT that has been trained on a larger dataset and with some additional hyperparameter optimizations. This makes it generally more robust and accurate than BERT, especially for challenging tasks like sentiment analysis.
- DistilBERT is a smaller and faster version of BERT that is still capable of achieving good results on sentiment analysis tasks.

To fine-tune any of these models, I will use the dataset of journal entries that have been labeled with their sentiment that I have created using OpenAI's Completions API. I will then use this dataset to train the model to predict the sentiment of new journal entries.

Here are some additional tips for fine-tuning a sentiment analysis model for journal entries:

- Make sure that your training dataset is representative of the journal entries that you want to analyze. This means that it should contain a variety of topics, writing styles, and sentiment scores.
- Use a relatively small learning rate to prevent the model from overfitting to the training data.
- Monitor the model's performance on a held-out validation set to ensure that it is learning effectively.
- Once the model has been trained, you can use it to predict the sentiment of new journal entries.
- Once you have fine-tuned a sentiment analysis model, you can use it to analyze your journal entries and track your sentiment over time. This can be a valuable tool for understanding your own emotional state and identifying trends in your thinking and behavior.

In [8]:
import pandas as pd
import transformers

In [9]:
df = pd.read_csv('new_journal_data.csv')
df

Unnamed: 0.1,Unnamed: 0,Entry,sentiment_id
0,0,"\n\nDear Diary,\n\nToday has been such a wonde...",0
1,1,"\n\nDear Diary,\n\nToday has been a wonderful ...",0
2,2,"\n\nDear Diary,\n\nToday has been such a wonde...",0
3,3,"\n\nDear Diary,\n\nToday, I am feeling incredi...",0
4,4,"\n\nDear Diary,\n\nToday has been an amazing d...",0
...,...,...,...
895,895,"\n\nDear Diary,\n\nToday has been a pretty une...",2
896,896,"\n\nDear Diary,\n\nToday was a rather uneventf...",2
897,897,"\n\nDear Diary,\n\nToday was a day that left m...",2
898,898,"\n\nDear Diary,\n\nToday has been a pretty une...",2


First, we want to be able to split the dataset into train and test splits. To do that, we need to reorder the data set, so that we can more easily split the data. Right now, the dataset is in 3 sections (one per sentiment). Let's mix them so that the sentiments are still evenly dispered, but not sectioned.

In [10]:

def reorder_dataset(df):
  """Reorders a dataset of journal entries so that the entries go in the order of 0, 1, 2 repeating.

  Args:
    df: A Pandas DataFrame containing the journal entries.

  Returns:
    A Pandas DataFrame containing the journal entries in the reordered order.
  """

  df_temp = pd.DataFrame()
  for i in range(300):
    df_temp = df_temp.append(df.iloc[i])
    df_temp = df_temp.append(df.iloc[i+300])
    df_temp = df_temp.append(df.iloc[i+600])
  return df_temp

# Read the dataset into a Pandas DataFrame
df = pd.read_csv('new_journal_data.csv')

# Reorder the dataset
new_df = reorder_dataset(df.copy())

# Save the reordered dataset to a CSV file
new_df.to_csv('journal_entries_reordered.csv', index=False)


  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])


In [11]:
new_df

Unnamed: 0.1,Unnamed: 0,Entry,sentiment_id
0,0,"\n\nDear Diary,\n\nToday has been such a wonde...",0
300,300,"\n\nDear Diary,\n\nToday has been one of those...",1
600,600,"\n\nDear Diary,\n\nIt's been a while since I l...",2
1,1,"\n\nDear Diary,\n\nToday has been a wonderful ...",0
301,301,"\n\nDear Diary,\n\nToday has been a really tou...",1
...,...,...,...
598,598,"\n\nDear Diary,\n\nToday has been one of those...",1
898,898,"\n\nDear Diary,\n\nToday has been a pretty une...",2
299,299,"\n\nDear Diary,\n\nToday has been such a wonde...",0
599,599,"\n\nDear Diary,\n\nI am feeling so down and ne...",1


### Fine-tuning BERT

In [None]:
# Load the BERT model
model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Add a dropout layer to the model
model.add_dropout(0.1)

# Define the optimizer and learning rate
optimizer = transformers.AdamW(model.parameters(), lr=1e-5)

# Load the dataset of journal entries
dataset = transformers.Dataset.from_csv("journal_entries.csv")

# Split the dataset into training and validation sets
train_dataset = dataset.select(range(len(dataset) // 10 * 9))
val_dataset = dataset.select(range(len(dataset) // 10 * 9, len(dataset)))

# Create a training data loader
train_dataloader = transformers.DataLoader(train_dataset, batch_size=32, shuffle=True)

# Create a validation data loader
val_dataloader = transformers.DataLoader(val_dataset, batch_size=32)

# Train the model
model.train()
for epoch in range(10):
    for batch in train_dataloader:
        inputs = {"input_ids": batch["input_ids"], "attention_mask": batch["attention_mask"]}
        labels = batch["labels"]

        outputs = model(**inputs)
        loss = outputs.loss

        loss.backward()
        optimizer.step()

# Evaluate the model on the validation set
model.eval()
val_loss = 0.0
val_accuracy = 0.0
for batch in val_dataloader:
    inputs = {"input_ids": batch["input_ids"], "attention_mask": batch["attention_mask"]}
    labels = batch["labels"]

    outputs = model(**inputs)
    loss = outputs.loss

    val_loss += loss.item()
    val_accuracy += (outputs.logits.argmax(dim=1) == labels).sum().item()

val_loss /= len(val_dataloader)
val_accuracy /= len(val_dataloader)

print("Validation loss:", val_loss)
print("Validation accuracy:", val_accuracy)

# Save the fine-tuned model
model.save_pretrained("fine-tuned-bert-sentiment-analysis")
