# Fine-tuning Pre-trained Models to Journal Entry Data


There are a few pre-trained sentiment analysis models that I will fine-tune and evaluate for use in the project. Some popular options I have found are:

- BERT (Bidirectional Encoder Representations from Transformers): BERT is a large language model that has been shown to achieve state-of-the-art results on a variety of NLP tasks, including sentiment analysis.
- RoBERTa (Robustly Optimized BERT Approach): RoBERTa is a variant of BERT that has been trained on a larger dataset and with some additional hyperparameter optimizations. This makes it generally more robust and accurate than BERT, especially for challenging tasks like sentiment analysis.
- DistilBERT is a smaller and faster version of BERT that is still capable of achieving good results on sentiment analysis tasks.

To fine-tune any of these models, I will use the dataset of journal entries that have been labeled with their sentiment that I have created using OpenAI's Completions API. I will then use this dataset to train the model to predict the sentiment of new journal entries.

Here are some additional tips for fine-tuning a sentiment analysis model for journal entries:

- Make sure that your training dataset is representative of the journal entries that you want to analyze. This means that it should contain a variety of topics, writing styles, and sentiment scores.
- Use a relatively small learning rate to prevent the model from overfitting to the training data.
- Monitor the model's performance on a held-out validation set to ensure that it is learning effectively.
- Once the model has been trained, you can use it to predict the sentiment of new journal entries.
- Once you have fine-tuned a sentiment analysis model, you can use it to analyze your journal entries and track your sentiment over time. This can be a valuable tool for understanding your own emotional state and identifying trends in your thinking and behavior.

In [41]:
import pandas as pd
import transformers
import transformers
from datasets import load_dataset
from torch.nn import Dropout
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [29]:
df = pd.read_csv('new_journal_data.csv')
df

Unnamed: 0.1,Unnamed: 0,Entry,sentiment_id
0,0,"\n\nDear Diary,\n\nToday has been such a wonde...",0
1,1,"\n\nDear Diary,\n\nToday has been a wonderful ...",0
2,2,"\n\nDear Diary,\n\nToday has been such a wonde...",0
3,3,"\n\nDear Diary,\n\nToday, I am feeling incredi...",0
4,4,"\n\nDear Diary,\n\nToday has been an amazing d...",0
...,...,...,...
895,895,"\n\nDear Diary,\n\nToday has been a pretty une...",2
896,896,"\n\nDear Diary,\n\nToday was a rather uneventf...",2
897,897,"\n\nDear Diary,\n\nToday was a day that left m...",2
898,898,"\n\nDear Diary,\n\nToday has been a pretty une...",2


First, we want to be able to split the dataset into train and test splits. To do that, we need to reorder the data set, so that we can more easily split the data. Right now, the dataset is in 3 sections (one per sentiment). Let's mix them so that the sentiments are still evenly dispered, but not sectioned.

In [30]:

def reorder_dataset(df):
  """Reorders a dataset of journal entries so that the entries go in the order of 0, 1, 2 repeating.

  Args:
    df: A Pandas DataFrame containing the journal entries.

  Returns:
    A Pandas DataFrame containing the journal entries in the reordered order.
  """

  df_temp = pd.DataFrame()
  for i in range(300):
    df_temp = df_temp.append(df.iloc[i])
    df_temp = df_temp.append(df.iloc[i+300])
    df_temp = df_temp.append(df.iloc[i+600])
  return df_temp

# Read the dataset into a Pandas DataFrame
df = pd.read_csv('new_journal_data.csv')

# Reorder the dataset
new_df = reorder_dataset(df.copy())

# Save the reordered dataset to a CSV file
new_df.to_csv('journal_entries_reordered.csv', index=False)


  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])
  df_temp = df_temp.append(df.iloc[i])
  df_temp = df_temp.append(df.iloc[i+300])
  df_temp = df_temp.append(df.iloc[i+600])


In [31]:
new_df

Unnamed: 0.1,Unnamed: 0,Entry,sentiment_id
0,0,"\n\nDear Diary,\n\nToday has been such a wonde...",0
300,300,"\n\nDear Diary,\n\nToday has been one of those...",1
600,600,"\n\nDear Diary,\n\nIt's been a while since I l...",2
1,1,"\n\nDear Diary,\n\nToday has been a wonderful ...",0
301,301,"\n\nDear Diary,\n\nToday has been a really tou...",1
...,...,...,...
598,598,"\n\nDear Diary,\n\nToday has been one of those...",1
898,898,"\n\nDear Diary,\n\nToday has been a pretty une...",2
299,299,"\n\nDear Diary,\n\nToday has been such a wonde...",0
599,599,"\n\nDear Diary,\n\nI am feeling so down and ne...",1


### Fine-tuning BERT

In [27]:
from transformers import AutoTokenizer

# Load the AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the journal entry
journal_entry = "I am having a great day!"

tokens = tokenizer(journal_entry, return_tensors="pt")

# Get the input_ids
input_ids = tokens["input_ids"]

print(tokens)
print(input_ids)

{'input_ids': tensor([[ 101, 1045, 2572, 2383, 1037, 2307, 2154,  999,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([[ 101, 1045, 2572, 2383, 1037, 2307, 2154,  999,  102]])


In [36]:
import pandas as pd
from transformers import AutoTokenizer

# Load the AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Read the journal entries from the dataframe
journal_entries = new_df["Entry"]

# Tokenize the journal entries
tokenized_journal_entries = []
for journal_entry in journal_entries:
    tokens = tokenizer(journal_entry, return_tensors="pt")
    input_ids = tokens["input_ids"]
    tokenized_journal_entry = tokenizer.decode(input_ids[0])
    tokenized_journal_entries.append(tokenized_journal_entry)

# Add the tokenized journal entries to the dataframe
new_df["tokenized_journal_entry"] = tokenized_journal_entries

# Save the dataframe with the tokenized journal entries
new_df.to_csv("journal_entries_tokenized.csv", index=False)


In [38]:
tokenized_journal_entries_df = pd.read_csv('journal_entries_tokenized.csv')
tokenized_journal_entries_df

Unnamed: 0.1,Unnamed: 0,Entry,sentiment_id,tokenized_journal_entry
0,0,"\n\nDear Diary,\n\nToday has been such a wonde...",0,"[CLS] dear diary, today has been such a wonder..."
1,300,"\n\nDear Diary,\n\nToday has been one of those...",1,"[CLS] dear diary, today has been one of those ..."
2,600,"\n\nDear Diary,\n\nIt's been a while since I l...",2,"[CLS] dear diary, it's been a while since i la..."
3,1,"\n\nDear Diary,\n\nToday has been a wonderful ...",0,"[CLS] dear diary, today has been a wonderful d..."
4,301,"\n\nDear Diary,\n\nToday has been a really tou...",1,"[CLS] dear diary, today has been a really toug..."
...,...,...,...,...
895,598,"\n\nDear Diary,\n\nToday has been one of those...",1,"[CLS] dear diary, today has been one of those ..."
896,898,"\n\nDear Diary,\n\nToday has been a pretty une...",2,"[CLS] dear diary, today has been a pretty unev..."
897,299,"\n\nDear Diary,\n\nToday has been such a wonde...",0,"[CLS] dear diary, today has been such a wonder..."
898,599,"\n\nDear Diary,\n\nI am feeling so down and ne...",1,"[CLS] dear diary, i am feeling so down and neg..."


In [39]:
tokenized_journal_entries = tokenized_journal_entries_df['tokenized_journal_entry']
sentiment_scores = tokenized_journal_entries_df['sentiment_id']

# Create a new dataframe with the tokenized entries and the sentiment scores
new_df = pd.DataFrame({"journal_entry": tokenized_journal_entries, "sentiment": sentiment_scores})

# Save the new dataframe to a CSV file
new_df.to_csv("journal_entries_tokenized_and_sentiment.csv", index=False)

Now, we have the entries in a readable format for BERT.

In [42]:
# Load the fine-tuned BERT model
model = AutoModelForSequenceClassification.from_pretrained("fine-tuned-bert-sentiment-analysis")

# Tokenize the new journal entry
journal_entry = "I am having a great day!"
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(journal_entry, return_tensors="pt")

# Predict the sentiment of the new journal entry
inputs = {"input_ids": tokens["input_ids"], "attention_mask": tokens["attention_mask"]}
outputs = model(**inputs)

# Get the predicted sentiment score
sentiment_score = outputs.logits.argmax(dim=1).item()

# Print the predicted sentiment
if sentiment_score == 0:
    print("The journal entry is negative.")
elif sentiment_score == 1:
    print("The journal entry is positive.")
else:
    print("The journal entry is neutral.")


OSError: fine-tuned-bert-sentiment-analysis is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`