# Generating a Dummy Dataset to train a model on

After searching the internet, it has become clear that (understandably) I have had trouble finding a dataset of journal or diary entries with sentiment scores attached with the information that I would like to attach to it (like age of user, certain emotions, etc). So, I am going to generate some dummy data and fine-tune it to my liking.

In [4]:
from faker import Faker
import random
import datetime

### Using Data Generation Libraries

First, I will want to generate some users. Certain demographics of users might give a better picture to my model about what the user's emotional state is. For now, we will just include the age. We can use the Faker library. 

The Faker library is a popular tool for generating fake data. We can use it to create names, addresses, dates, and even random text for journal entries.

In [5]:
fake = Faker()

# Generate random names and dates for users
num_users = 1000  # Number of users
users = [{"name": fake.name(), "birthdate": fake.date_of_birth(minimum_age=18, maximum_age=70)} for _ in range(num_users)]

In [11]:
for i in range(3):
    print(users[i])

{'name': 'Mrs. Sandra Jackson', 'birthdate': datetime.date(1977, 9, 14)}
{'name': 'Robin Juarez', 'birthdate': datetime.date(1963, 7, 1)}
{'name': 'Scott Sanchez', 'birthdate': datetime.date(1979, 10, 30)}


### Generating emotion-specific journal entries.

In [38]:
import openai
import os

Here's an example of generating a sentence with OpenAI's [Completions API](https://platform.openai.com/docs/guides/gpt/completions-api). I start with the prompt "Today, I felt" and the API finishes the sentence. We will generated 1,000 sentences to use. But, first let's look at an example.

In [45]:
api_key = os.environ.get('OA_API_KEY')
openai.api_key = api_key

prompt = "Today, I felt"
generated_text = openai.Completion.create(engine="text-davinci-002", prompt=prompt, max_tokens=75)

# Extract the generated text from the response
journal_entry = generated_text.choices[0].text


In [37]:
prompt + journal_entry

'Today, I felt extra anxious and had a panic attack.I was feeling really overwhelmed and stressed out about a number of things and it all just kind of hit me at once. I had to take a few deep breaths and try to relax. I eventually felt better, but it was a really tough day.'

Now, let's generate some entries for our dataset.

I want them to reflect the 6 basic emotions: sadness, happiness, fear, anger, surprise and disgust.
To do this, I have to make sure there is an equal number of entries between each emotion. Let's make 300 entries per emotion. Let's also add "bored" as a neutral emotion.


So, we will have 7 emotions, 300 entries each, for a total of 2,100 journal entries for our dataset. This should hopefully be enough to train a model on.

We can do this by having the prompt "Today, I felt" and having ChatGPT finish the sentence based on the emotion given.

In [47]:
emotions = ['sad', 'happy', 'fear', 'angry', 'surprised', 'disgusted', 'bored']
journal_entries = []

prompt = "Today, I felt "
for emotion in emotions:
    emotive_prompt = prompt + emotion
    for i in range(300):
        generated_text = openai.Completion.create(engine="text-davinci-002", prompt=emotive_prompt, max_tokens=75)
        journal_entries.append(emotive_prompt + generated_text.choices[0].text)

In [51]:
print(len(journal_entries))
print(len([1 for entry in journal_entries if 'sad' in entry]))
print(len([1 for entry in journal_entries if 'happy' in entry]))
print(len([1 for entry in journal_entries if 'fear' in entry]))
print(len([1 for entry in journal_entries if 'angry' in entry]))
print(len([1 for entry in journal_entries if 'surprised' in entry]))
print(len([1 for entry in journal_entries if 'disgusted' in entry]))
print(len([1 for entry in journal_entries if 'bored' in entry]))
print(journal_entries[299])
print(journal_entries[300])

2100
318
353
300
317
307
301
304
Today, I felt sad because

I miss my friends and family.
Today, I felt happy.

I woke up this morning to the sun shining in through my window, and I just felt so happy. I don't know why, but I just felt really good today. I just felt like everything was going my way and that everything was going to be okay. I don't know what it is, but I just had this really good feeling all day.


Now let's try using sentiment lexicons to provide a positive or negative attribute to an entry. This takes the sum of positive or negative tokens in a sentence and assigns a positive or negative value based on the highest amount of either.

In [57]:
from afinn import Afinn

afinn = Afinn()
num_entries = 1000

# Generate journal entries with positive sentiment
positive_entries = [entry for entry in journal_entries if afinn.score(entry) > 0]

# Generate journal entries with negative sentiment
negative_entries = [entry for entry in journal_entries if afinn.score(entry) < 0]


In [60]:
print(len(positive_entries))
print(len(negative_entries))
print(positive_entries[5])
print(negative_entries[0])

640
1295
Today, I felt sad for no real reason.

I think it might have just been the weather. It was raining and cloudy all day, which can definitely make someone feel down. Sometimes, when we don't know why we're feeling a certain way, it can help to think about what's going on around us that might be affecting our mood.
Today, I felt sad.

It could be because of many different reasons. Maybe something bad happened, or you miss someone. It's important to figure out what is making you sad, so you can try to fix the problem. Sometimes, though, you just need some time to yourself to feel better.


This doesn't seem to work well at all. Above, you can see that we have two obviously negative entries, however the first one is considered negative. This does not seem like a good way to evaluate sentiment.

Let's circle back to the entry generation. We initially used the prompt "Today I felt " ending with a certain emotion and had ChatGPT finish the entry.

This seems to do what we asked, but there's an issue with this. Now every single journal entry will start with "Today I felt" and have the very emotion it's trying to predict in it. This could cause the model to expect "Today, I felt" to be in every entry and that for an entry to have a certain emotion, it would need to have that word's emotion in it.

Let's change this so that the feeling of the emotion is still present in the entry, but the word and the original prompt isn't. We can do this by just telling chat gpt to make a journal entry as if it was the writer feeling a certain way.

In [65]:
# Define a prompt to generate a journal entry with a specific emotion (e.g., disgust)
prompt = "Write a journal entry where the writer is feeling happy. Imagine you are the writer writing in your own personal diary."

# Generate text based on the prompt
response = openai.Completion.create(
    engine="text-davinci-002",  # You can choose the engine based on your requirements
    prompt=prompt,
    max_tokens=75,  # Adjust the max tokens to control the length of the generated text
    temperature=0.7,  # Adjust temperature for creativity (lower values make it more focused)
)

# Extract the generated text from the response
generated_text = response.choices[0].text

print(generated_text)




I'm feeling really happy today! I woke up this morning and the sun was shining, and I just felt really good. Everything feels like it's going my way and I'm just really enjoying life right now. I'm so grateful for everything I have and I just feel really lucky. I hope this feeling lasts forever!


This seems like a much more realistic journal entry. Now, let's create the same amount of entries as before, where we had 300 entries of each of the 7 emotions we stated above, totaling to 2,100 entries.

In [68]:
emotions = ['sad', 'happy', 'fear', 'angry', 'surprised', 'disgusted', 'bored']
entries = []

prompt = "Today, I felt "
for emotion in emotions:
    prompt = "Write a journal entry where the writer is feeling {}. Imagine you are the writer writing in your own personal diary.".format(emotion)
    for i in range(300):
        response = openai.Completion.create(
            engine="text-davinci-002",  
            prompt=prompt,
            max_tokens=75,  
            temperature=0.7, 
        )
        generated_text = response.choices[0].text
        entries.append(generated_text)

In [72]:
entries[900]

"\n\nI'm so angry right now. I can't believe that they would do that to me. I trusted them and they just stabbed me in the back. I don't know if I can ever forgive them for this."

### Data Frame / File Generation

Now, let's make our table. In the future, we will use the fake information we made earlier to create users, but for now, we just care about the entrys and their corresponding sentiments

In [73]:
import pandas as pd

In [81]:
# Specify the number of times each word should be repeated
word_repeat_count = 300

# Generate the array
corresponding_sentiment = [word for word in emotions for _ in range(word_repeat_count)]

print(corresponding_sentiment)


['sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad', 'sad'

In [83]:
corr_sent_indices = [i for i in range(len(emotions)) for _ in range(word_repeat_count)]

0
1


In [84]:
data = {
    'Entry': entries,
    'sentiment': corresponding_sentiment,
    'sentiment_id': corr_sent_indices
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Entry,sentiment,sentiment_id
0,\n\nI'm feeling really sad today. I'm not sure...,sad,0
1,\n\nI'm feeling really sad today. I don't know...,sad,0
2,\n\nI'm feeling really sad today. I don't know...,sad,0
3,\n\nI'm feeling really sad today. I don't know...,sad,0
4,\n\nI'm feeling really sad today. I'm not sure...,sad,0


In [85]:
df.to_csv('dummy_journal_data.csv')

### Creating Training and Test Set

Now that we have our data, let's split it into training and test sets. For this, we'll do 80% for training, 20% for testing. For this to be accurate, we need to make sure the same amount from each category is taken from each. Since each emotion has 300 entries, we will take 60 entries from each emotion to make our testing set.

In the end, we should have 240 of each emotion in the train set for a total of 1680,
and 60 of each emotion in the test set for a total of 420.

In [100]:
train, test = [], []
start, mid, end = 0, 240, 300

for i in range(len(emotions)):
    train.extend(entries[start:mid]) 
    test.extend(entries[mid:end])
    start += 300
    mid += 300
    end += 300

print(len(train))
print(len(test))
print(str(len(train) + len(test)))

1680
420
2100


In [110]:
sent_id_train = []
for i in range(len(emotions)):
    for j in range(240):
        sent_id_train.append(i)

sent_id_test = []
for i in range(len(emotions)):
    for j in range(60):
        sent_id_test.append(i)


Now we can make our tables.

In [106]:
train_data = {
    'entry': train,
    'sentiment_id': sent_id_train
}
train_df = pd.DataFrame(train_data)
train_df.head()

Unnamed: 0,entry,sentiment_id
0,\n\nI'm feeling really sad today. I'm not sure...,0
1,\n\nI'm feeling really sad today. I don't know...,0
2,\n\nI'm feeling really sad today. I don't know...,0
3,\n\nI'm feeling really sad today. I don't know...,0
4,\n\nI'm feeling really sad today. I'm not sure...,0


In [107]:
train_df.to_csv('train_journal_data.csv')

In [108]:
test_data = {
    'entry': test,
    'sentiment_id': sent_id_test
}
test_df = pd.DataFrame(test_data)
test_df.head()

Unnamed: 0,entry,sentiment_id
0,"\n\nDear Diary,\n\nI'm feeling really sad toda...",0
1,\n\nToday was a really tough day. I'm feeling ...,0
2,"\n\nDear Diary,\n\nI'm feeling really sad toda...",0
3,\n\nI'm feeling really sad today. I don't know...,0
4,\n\nI'm feeling really sad today. I don't know...,0


In [109]:
test_df.to_csv('test_journal_data.csv')

Now, we have a training and test set that we can train a model on.