# Emotion and Memory
### Matthew Chong A16156411
### Daneil Byers A15396367
### Steve Kuk A15521681
### Robert Aispuro A12086294

# I. Introduction and Background

### IA. Overview

For our project, we wanted to analyze data on an individual's ability to comprehend and recall information and see if their ability to execute these tasks are influenced by either positive or negative sentiment. We decided to split the data between age groups and divided them into 3 buckets: 18-25, 30-40, 45-55 as well as splitting the data between 3 columns: recalled, imagined and retold.  We focused on specific columns of information from the dataset such as age, draining, distracted similarity, etc, that can potentially influence one’s ability to comprehend and recall information. We took an approach using these variables to simultaneously analyze memory recollection as well as sentiment for individuals.

### IB. Research Question

 Does Positive and Negative emotion influence memory(or academic performance) when isolated regardless of environment and regardless of gender?



### IC. Background & Prior Work

Over the past few years, students' academic performance has been largely affected by the pandemic where in-person classes transitioned into online classes. Traditionally, a student's learning and education has always been delivered in person. Now, students are abruptly forced to adapt to a new learning environment where their education is being delivered through a monitor screen. This sudden change either had a positive or negative academic impact, depending on the prefered learning method on the individual. In the research article, *Integrating students' Perspective about Online Learning: A hierarchy of factors* by Montgomery Van Wart, Van Wart and his team found that the most critical factor of online classes is the "loss of physical interaction". The loss of physical interaction demanded a higher level of interactivity and instructional sophistication in the virtual aspect. Furthermore, how students personally feel about the sudden shift in learning may have an influence in their ability to comprehend and recall information delivered through their computers. The research article, *Impact of online classes on the satisfaction and performance of students during the pandemic period of COVID-19* by Ram Gopal, looks at this factor. The study evaluated student's thoughts about how they personally felt about online learning. The study found that "overall students agreed that online teaching was valuable for them". From this article, it seems that students are satisfied with online learning. In fact, in another article called *The Influence of Virtual Learning Environments in Students' performance* by Paul Alves, Alves found that the more access a student has to VLE (Virtual Learning Environments), the more this leads to an increase of the number of units they register, an increase in the number of units they pass, and a decrease in the "percentage of students who failed all course units". Thus, not only are students satisfied with online learning, they are also overall performing better in their academics. From these articles, we have a general idea that a student's emotions (positive or negative) do in fact play a role in their academic performance.

From our own group's experience, we have noticed that it has gotten increasingly difficult to direct our attention and retain knowledge given in class. As a result of this, we are motivated in finding the factors that influence attention. We want to uncover if positive,negative or neutral emotions contribute to one's ability to focus, retain memory and perform cognitive tasks. More in-depth, we are interested in measuring the difference of cognitive performance when one's emotion is changed. As a result we believe that it is crucial to find the most productive way to retain academic knowledge so that we can better assist fellow students during these unusual times.

- https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-020-00229-8
- https://link.springer.com/article/10.1007/s10639-021-10523-1
- https://www.semanticscholar.org/paper/The-Influence-of-Virtual-Learning-Environments-in-Alves-Miranda/a7c7ba6194633bd68a300522089c795fb4c83a79

### ID. Hypothesis

If an individual is infected with positive emotion then their ability to comprehend and recall information is enhanced. Likewise, if an individual is infected with negative emotions then their ability to comprehend information is imperiled. Positive emotion is supplemental to learning and memory recollection because we believe positive sentiment influences motivation as well as focus which are two essential factors in memory. Negative emotion is detrimental to learning and memory recollection because having a negative mindset in the context of a learning environment restricts one's ability to recall and comprehend new information.

# II. Data Analysis

### IIA. Set Up

First, we need to import all the packages and data we will need

In [1]:
#Installations required

#python -m spacy download en_core_web_lg
#pip install spacy
#pip install vaderSentiment

In [2]:
#Import packages
from lisc import Counts
from lisc.utils.db import SCDB
from lisc.plts.counts import *
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
import spacy

#Data Import
hippDf = pd.read_csv('hippoCorpusV2.csv')

ModuleNotFoundError: No module named 'vaderSentiment'

### IIB. Data Wrangling and Data Cleaning


Next, we will do some data wrangling and cleaning. We are going to focus on these columns and check for null values:
- annotatorAge: Lower limit of the age bucket of the worker.
  Buckets are: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54,55+
  
- story: Story about the imagined or recalled event (15-25     sentences)
- distracted: How distracted were you while writing your story? (5-point Likert)

- draining: How taxing/draining was writing for you emotionally? (5-point Likert)

- frequency: How often do you think about or talk about this event? (5-point Likert)

- importance: How impactful, important, or personal is this story/event to you? (5-point Likert)

- logTimeSinceEvent: Log of time (days) since the recalled event happened

- mainEvent: Short phrase describing the main event described

- similarity: How similar to your life does this event/story feel to you? (5-point Likert)

- stressful: How stressful was this writing task? (5-point Likert)

- summary: Summary of the events in the story (1-3 sentences)

- timeSinceEvent: Time (number of days) since the recalled event happened

#### Check for null values
Let's check to see if our dataset has any null values

In [None]:
#Check for null values, if null value found returns True
print('AnnotatorAge null values ... ',hippDf['annotatorAge'].isnull().values.any())
print('Story null values ...        ',hippDf['story'].isnull().values.any())
print('Distracted null values ...   ', hippDf['distracted'].isnull().values.any())
print('Draining null values ...     ', hippDf['draining'].isnull().values.any())
print('Frequency null values ...    ', hippDf['frequency'].isnull().values.any())
print('Importance null values ...   ', hippDf['importance'].isnull().values.any())
print('LTSinceEvent null values ... ', hippDf['logTimeSinceEvent'].isnull().values.any())
print('Similarity null values ...   ', hippDf['similarity'].isnull().values.any())
print('Stressful null values ...    ', hippDf['stressful'].isnull().values.any())
print('TimeSinceEvent null values ..', hippDf['timeSinceEvent'].isnull().values.any())


As we can see there are some null values which is expected based on the dataset the was provided. Since we are interested in the age of the indivduals and their emotion we are going to drop any row with a NaN value in the AnnotatorAge column and importance column

In [None]:
noNaNHippDf = hippDf.dropna(subset=['annotatorAge','importance',])

In [None]:
noNaNHippDf.shape

We want to make sure our dataset is cleaned so there are no more NaN values. Next, let's take a look at the annotatorAge column

In [None]:
noNaNHippDf['annotatorAge'].unique()


From this, we can see that there are 8 unique age variables. For this project, we will classify the age bucket 18 and 25 as **'Youth'**, 30,35,40 as **'Adults'**, and 45,50,55 as **'Seniors'** into a new column called **"AgeGroup"**

In [None]:
#Categorizes annotatorAge into different age groups
def ageGroup(row):
    if row['annotatorAge'] == 18 or row['annotatorAge'] == 25:
        return 'Youth'
    elif row['annotatorAge'] == 30 or row['annotatorAge'] == 35 or row['annotatorAge'] == 40:
        return 'Adult'
    elif row['annotatorAge'] == 45 or row['annotatorAge'] == 50 or row['annotatorAge'] == 55:
        return 'Senior'
    else:
        return None

In [None]:
#apply new column
ageKey = noNaNHippDf.apply(lambda row: ageGroup(row),axis=1)
ageKey

In [None]:
noNaNHippDf['AgeGroup'] = ageKey


In [None]:
#test to see if function worked and age was classified into 3 groups
noNaNHippDf['AgeGroup'].unique()

This new column is now set with our unique variables and we can move on to continue wrangling the rest of the data

#### Splitting data by group
Next, we will split the data into 3 groups with the memType column with **"recalled"**,**"imagined"**, and **"retold"** so we can analyze them separately later

In [None]:
recalled_df = noNaNHippDf[noNaNHippDf['memType']=="recalled"]
imagined_df = noNaNHippDf[noNaNHippDf['memType']=="imagined"]
retold_df = noNaNHippDf[noNaNHippDf['memType']=="retold"]


We are going to focus on these columns:
- annotatorAge: Lower limit of the age bucket of the worker.
  Buckets are: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54,55+
  
- story: Story about the imagined or recalled event (15-25     sentences)
- distracted: How distracted were you while writing your story? (5-point Likert)

- draining: How taxing/draining was writing for you emotionally? (5-point Likert)

- frequency: How often do you think about or talk about this event? (5-point Likert)

- importance: How impactful, important, or personal is this story/event to you? (5-point Likert)

- logTimeSinceEvent: Log of time (days) since the recalled event happened

- mainEvent: Short phrase describing the main event described

- similarity: How similar to your life does this event/story feel to you? (5-point Likert)

- stressful: How stressful was this writing task? (5-point Likert)

- summary: Summary of the events in the story (1-3 sentences)

- timeSinceEvent: Time (number of days) since the recalled event happened

- recAgnPairId: ID of the recalled story that corresponds to this retold story (null for imagined stories). Group on   this variable to get the recalled-retold pairs.


In [None]:
newRecalled = recalled_df[['annotatorAge','story','distracted','draining','frequency',
                                         'importance','logTimeSinceEvent','mainEvent','similarity',
                                         'stressful','summary','timeSinceEvent','AgeGroup','recAgnPairId','memType']]
newImagined = imagined_df[['annotatorAge','story','distracted','draining','frequency',
                                         'importance','logTimeSinceEvent','mainEvent','similarity',
                                         'stressful','summary','timeSinceEvent','AgeGroup','recAgnPairId','memType']]
newRetold = retold_df[['annotatorAge','story','distracted','draining','frequency',
                                         'importance','logTimeSinceEvent','mainEvent','similarity',
                                         'stressful','summary','timeSinceEvent','AgeGroup','recAgnPairId','memType']]

#### Performing text comparison
We are going to look at how similar the stories are between recalled and retold using the spacy package for analysis later. Though, first we need to prepare the data through wrangling. As a reminder, the recalled group is recalling a previous story from their life and the retold group is trying to retell the same story they gave previously based on the summary that they created after they told their start in the recalled group. We are going to make a new table that combines the recalled story and the retold story so that we can more easily apply the spacy package to it. 


First, we need to find the unique id's that link the two data sections together

In [None]:
idTags = noNaNHippDf['recAgnPairId'].unique()
#Need to remove the null tag
indexOfNull = 0
newIdTags = np.delete(idTags,indexOfNull)
newIdTags

Now that we have all the unique tags we need to find the recalled data and merge it with the retold data

In [None]:
listOfLists = []
secondKeyStory = []
deltaOfTime = []
for tagNumber in range(len(newIdTags)):
    #Get df of unique tag
    tagIds = hippDf[hippDf['recAgnPairId'] == newIdTags[tagNumber]]
    #Grab the first one and add it to the list
    mainVal = tagIds.iloc[0]
    #Take the story of the second
    secondStory = tagIds['story'].iloc[1]
    #Check that the data works right
    timeSinceRe = tagIds['timeSinceEvent'].iloc[1]
    
    
    #Add them to a list to make a DF out of
    listOfLists.append(mainVal)
    secondKeyStory.append(secondStory)
    deltaOfTime.append(timeSinceRe)
    
    
    

In [None]:
recalledAndRetold = pd.DataFrame(listOfLists)
recalledAndRetold['retold stories'] = secondKeyStory
recalledAndRetold['time since recalled'] = deltaOfTime
recalledAndRetold.head()

As we can see in row 26, the data in the time since recalled column makes no sense. After doing further investigation this error orginates from how the data was recorded. So to avoid skewed results we are going to remove any result that is above 1111111 days. We chose this point from looking at the data and seeing that it jumps from 780 days to 
1111111 which is a jump from 2.14 years to 3,044 years.

In [None]:
cleanedRecalledAndRetold = recalledAndRetold[recalledAndRetold['time since recalled'] < 1111111]
cleanedRecalledAndRetold.head()

Lets check and make sure that there are no more extraneous data points

In [None]:
cleanedRecalledAndRetold['time since recalled'].unique()

Now that we no longer have extraneous data points, our data is properly cleaned and wrangled. We are ready to begin our analysis.

### IIC. Data Analysis and Results
- **Sentiment Analysis**
- **Spacy Similarity**

#### Setiment Analysis

Lets run our **Setiment Analysis** first.
This is the method that will be used to calculate the sentiment score which will allow us to compare whether the sentiments are positive or negative.

In [None]:
def sentScore(dataFrame):
    s_score = []
    s_rating = []
    sentiment_obj = SentimentIntensityAnalyzer()
    for i in range(len(dataFrame)):
        s_score.append(sentiment_obj.polarity_scores(dataFrame.iloc[i,1]))
        comp_score = s_score[i]['compound']
        if comp_score > 0.05:
            s_rating.append("Positive")
        elif comp_score <= -0.05:
            s_rating.append("Negative")
        else:
            s_rating.append("Neutral")
    return [s_score, s_rating]

First, we will implement the Sentiment Analysis For Recalled 

In [None]:
newRecalled.head()

In [None]:
#put into function
recalledSentValues = sentScore(newRecalled)

In [None]:
#assign into new columns
newRecalled['sentiment_score'] = recalledSentValues[0]
newRecalled['sentiment'] = recalledSentValues[1]


Here is the new dataframe with 2 new Sentimental Analysis columns for Recalled.

In [None]:
newRecalled.head()

In [None]:
newRecalled.recAgnPairId.notnull().sum()

Next, we will run the Sentiment Analysis For Imagined 

In [None]:
newImagined.head()

In [None]:
#run function and assign them into dataframe
imaginedSentValues = sentScore(newImagined)
newImagined['sentiment_score'] = imaginedSentValues[0]
newImagined['sentiment'] = imaginedSentValues[1]

#here are the 2 new Sentimental Analysis columns for Imagined
newImagined.head()

Finally, we will run the Sentiment Analysis For Retold

In [None]:
newRetold

In [None]:
#Apply function and assign columns into dataframe
retoldSentValues = sentScore(newRetold)
newRetold['sentiment_score'] = retoldSentValues[0]
newRetold['sentiment'] = retoldSentValues[1]

#Here are the 2 new Sentimental Analysis columns for Retold
newRetold.head()

In [None]:
newRetold.recAgnPairId.notnull().sum()

#### Spacy Similarity

Now that we have finished the sentimental analysis, we will run the **Spacy Similarity** analysis. We are going to use their English large package for vectorization because all the stories that were told range drastically. By using the large package, we will be able to better vectorize our data with cosine similarity. Additionally, the retold group retells their story that they gave during their time in the recalled group given a summary that they created at the end of their recalled observation. So we are going to make a new dataset that only focuses on the subjects that were in the **recalled** and **retold** groups. This will allow for us to observe a change, if any, in their storytelling over a random time period

In [None]:
cleanedRecalledAndRetold.recAgnPairId.notnull().sum()

In [None]:
#load spacy's en_core_web_lg
nlp = spacy.load('en_core_web_lg')

In [None]:
#run nlp function from earlier
test1 = cleanedRecalledAndRetold['story'].iloc[0]
test2 = cleanedRecalledAndRetold['retold stories'].iloc[0]
doc1 = nlp(test1)
doc2 = nlp(test2)


By running the nlp function, we can see the similarity between doc1 and doc2

In [None]:
print(doc1.similarity(doc2))

Next, we will create a function that compares the similarity between the recalled story and the retold story called **applyingSpacy**

In [None]:
def applyingSpacy(df):
    sim_score = []
    
    for i in range(len(df)):
        ogStory = df['story'].iloc[i]
        newStory = df['retold stories'].iloc[i]
        nlpComp1 = nlp(ogStory)
        nlpComp2 = nlp(newStory)
        sim_score.append(nlpComp1.similarity(nlpComp2))
        
    return sim_score

In [None]:
#Apply the function, this will take a minute to run
sim_scores = applyingSpacy(cleanedRecalledAndRetold)

Lets double check that all the stories were processed

In [None]:
print(cleanedRecalledAndRetold.shape)
print(len(sim_scores))

Since all the stories were processed lets add it back into the dataframe

In [None]:
cleanedRecalledAndRetold['spacy_sim'] = sim_scores
cleanedRecalledAndRetold.head()

Lets read the age group to the dataframe so when we visualize we can see the differences in the age groups

In [None]:
ageKey = cleanedRecalledAndRetold.apply(lambda row: ageGroup(row),axis=1)


In [None]:
cleanedRecalledAndRetold['AgeGroup'] = ageKey

Lets plot our results from Spacy and see what we get


In [None]:
cleanedRecalledAndRetold.plot.scatter(x='time since recalled', y = 'spacy_sim')

These results seem to be fairly high so lets dive deeper in to why that might be
- Spacy documentation states that when taking the similarity of a document it defaults to the average of the token vectors. An example they state is, “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”"
- Therefore it could be that the words are similar but the idea is not


### Further Analysis
Let's dig a little deeper to try and see why these high overly high scores may be occuring. The first thing that comes to mind is if the stories are different lengths.
We will create an example and see what the results are when the sentences are:
- completely identical 
- identical but one has an extra sentence related to the previous 
- identical but one has an extra sentence not related to the previous 
- identical but with one having an extra sentence that is told in a different way

In [None]:
#Exact same sentences
doc1 = nlp("I like burgers and dogs.")
doc2 = nlp('I like burgers and dogs.')
print(doc1, "<->", doc2, doc1.similarity(doc2))

In [None]:
#Compared to raw sentence comparison
numberOfSimilarWords = 5
totalWords = 5
print("Raw comparison", numberOfSimilarWords / totalWords)

In [None]:
#Same setences but one has an extra sentence that does relate to the previous
doc1 = nlp("I like burgers and dogs. I also like pineapples")
doc2 = nlp('I like burgers and dogs.')
print(doc1, "<->", doc2, doc1.similarity(doc2))

In [None]:
#Compared to raw sentence comparison
numberOfSimilarWords = 5
totalWords = 9
print("Raw comparison", numberOfSimilarWords / totalWords)

In [None]:
#Same setences but one has an extra sentence that does not relate to the previous
doc1 = nlp("I like burgers and dogs. Bob flies planes")
doc2 = nlp('I like burgers and dogs.')
print(doc1, "<->", doc2, doc1.similarity(doc2))

We can see that sentences that are irrelevant decrease the score, and sentences that are similar decrease the score but not as much as an irrelevant extra sentence.
But what about an extra sentence that basically says the same thing as the previous but in a different way?

In [None]:
#Same setences but one has an extra sentence that repeats what is said but in a different way
doc1 = nlp("I like burgers and dogs. Burgers and dogs are what I like")
doc2 = nlp('I like burgers and dogs.')
print(doc1, "<->", doc2, doc1.similarity(doc2))

As we can see in the last sentence the similarity is the highest. With these examples we can more reasonably conclude that our results are in fact accurate for what we are trying to do. This is because we want to focus on how much the stories change, not necessarily if the stories are talking about the same thing. Something that could be interesting for the future is to look into how much the length of an overall document can affect these scores. But for the sake of this project we are going to assume that the results are accurate

Bringing together the analysis from the Spacy package, let's compare the similarity score to the change in sentiment score. To do this, we are going to take the sentiment score from the recalled group and subtract it from the retold group by individual

In [None]:
combinedAnalysis = cleanedRecalledAndRetold
combinedAnalysis["sentChange"] = np.nan
combinedAnalysis = combinedAnalysis.reset_index(drop=True)

In [None]:
cRRKeys = cleanedRecalledAndRetold.recAgnPairId.unique()


Removing these keys as the recAgnPairId does not exist in the newRecalled or the newRetold dataframe

In [None]:
cRRKeys = np.delete(cRRKeys,95)
cRRKeys = np.delete(cRRKeys,611)
cRRKeys = np.delete(cRRKeys,797)
cRRKeys = np.delete(cRRKeys,881)
cRRKeys = np.delete(cRRKeys,939)
cRRKeys = np.delete(cRRKeys,1032)
cRRKeys = np.delete(cRRKeys,1068)

In [None]:
numberToPullVal = 0
i=0
for tagNumber in range(len(cRRKeys)):
    #print(i)
    #i +=1
    recalledRow = newRecalled[newRecalled['recAgnPairId'] == cRRKeys[tagNumber]]
    retoldRow = newRetold[newRetold['recAgnPairId'] == cRRKeys[tagNumber]]

    #Pull the neg, neu, and pos from the column. Stored as a series and not a dict
    #Putting it into a list then pulling from it
    negArrReC = pd.DataFrame(recalledRow['sentiment_score'].tolist())['neg'].tolist()
    neuArrReC = pd.DataFrame(recalledRow['sentiment_score'].tolist())['neu'].tolist()
    posArrReC = pd.DataFrame(recalledRow['sentiment_score'].tolist())['pos'].tolist()

    negArrReT = pd.DataFrame(retoldRow['sentiment_score'].tolist())['neg'].tolist()
    neuArrReT = pd.DataFrame(retoldRow['sentiment_score'].tolist())['neu'].tolist()
    posArrReT = pd.DataFrame(retoldRow['sentiment_score'].tolist())['pos'].tolist()

    #Check the difference between values
    newNegVal = negArrReT[numberToPullVal] - negArrReC[numberToPullVal] 
    newPoVal  = posArrReT[numberToPullVal] - posArrReC[numberToPullVal] 

    if(newPoVal > newNegVal):
        combinedAnalysis.loc[combinedAnalysis.recAgnPairId == cRRKeys[tagNumber], 'sentChange'] = 'Pos'
    elif(newPoVal < newNegVal):
        combinedAnalysis.loc[combinedAnalysis.recAgnPairId == cRRKeys[tagNumber], 'sentChange'] = 'Neg'
    else:
        combinedAnalysis.loc[combinedAnalysis.recAgnPairId == cRRKeys[tagNumber], 'sentChange'] = 'Neu'

In [None]:
combinedAnalysis.head()

In [None]:
combinedAnalysis.columns

Next, let's look at the relationships between the columns **AgeGroup vs Draining**, **AgeGroup vs Stressful**, and **AgeGroup vs Importance** with the Retold and Recalled datasets. 

In [None]:
#Retold Agegroup vs stressful
sns.violinplot(x='AgeGroup',data=newRetold,
            y='stressful')

In [None]:
#Retold Agegroup vs draining
sns.violinplot(x='AgeGroup',data=newRetold,
            y='draining')

In [None]:
#Retold Agegroup vs importance
sns.violinplot(x='AgeGroup',data=newRetold,
            y='importance')

In [None]:
#Recalled Agegroup vs stressful
sns.violinplot(x='AgeGroup',data=newRecalled,
            y='stressful')

In [None]:
#Recalled Agegroup vs draining
sns.violinplot(x='AgeGroup',data=newRecalled,
            y='draining')

In [None]:
#Recalled Agegroup vs importance
sns.violinplot(x='AgeGroup',data=newRecalled,
            y='importance')

In [None]:
#check dataframe's memType column
combinedAnalysis['memType']

We can split our new dataset by age group for further analysis. We will discuss these plots and split datasets later in our conclusion

In [None]:
#split the combined analysis dataset into youth, adult, and seniors only
youthOnly = combinedAnalysis[combinedAnalysis['AgeGroup'] == 'Youth']
adultOnly = combinedAnalysis[combinedAnalysis['AgeGroup'] == 'Adult']
seniorOnly = combinedAnalysis[combinedAnalysis['AgeGroup'] == 'Senior']


### IID. Data Visualizations for Sentimental Analysis

Next, we will look at visualizations for Recalled, Imagined and Retold by Sentiment and Age. In this section, we want to visualize the sentiment of each age group for Recalled, Imagined and Retold. The x-axis is the age while the y-axis is the percentage in their respective sentiment. Below each Bar Graph are the stats displaying each Age Group along with their exact percentage of positive and negative sentiment.

In [None]:
#Recalled Visualization

# Create x-axis based on unique age values in the Data
X = newRecalled['annotatorAge'].unique()
X.sort()
x_axis = np.arange(len(newRecalled['annotatorAge'].unique()))


# Calculate Positive Amount and Percentages
positive = []
positive_total = 0
for x in X:
    value = len(newRecalled[(newRecalled['annotatorAge'] == x) & (newRecalled['sentiment'] == 'Positive')])
    positive.append(value)
    positive_total+=value

positive_percentage = []
for x in positive:
    positive_percentage.append((x/positive_total) * 100)


# Calculate Negative Amount and Percentages
negative = []
negative_total = 0
for x in X:
    value = len(newRecalled[(newRecalled['annotatorAge'] == x) & (newRecalled['sentiment'] == 'Negative')])
    negative.append(value)
    negative_total+=value

negative_percentage = []
for x in negative:
    negative_percentage.append((x/negative_total) * 100)

plt.xticks(x_axis, X)
plt.bar(x_axis + 0.2 ,positive_percentage,0.4,label='Positive')
plt.bar(x_axis - 0.2 ,negative_percentage,0.4,label='Negative')
plt.ylabel('Percentage of Sentiment')
plt.xlabel('Age')
plt.title('Recalled Sentiment')
plt.legend()
plt.show()

positive_stats = {}
negative_stats = {}
index = 0
for val in X:
    positive_stats[val] = positive_percentage[index]
    negative_stats[val] = negative_percentage[index]
    index+=1
print('---------- Positive Stats ----------')
for i in positive_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(positive_stats[i]))

print('---------- Negative Stats ----------')
for i in negative_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(negative_stats[i]))

#Imagined Visualization


# Create x-axis based on unique age values in the Data
X = newImagined['annotatorAge'].unique()
X.sort()
x_axis = np.arange(len(newImagined['annotatorAge'].unique()))


# Calculate Positive Amount and Percentages
positive = []
positive_total = 0
for x in X:
    value = len(newImagined[(newImagined['annotatorAge'] == x) & (newImagined['sentiment'] == 'Positive')])
    positive.append(value)
    positive_total+=value

positive_percentage = []
for x in positive:
    positive_percentage.append((x/positive_total) * 100)


# Calculate Negative Amount and Percentages
negative = []
negative_total = 0
for x in X:
    value = len(newImagined[(newImagined['annotatorAge'] == x) & (newImagined['sentiment'] == 'Negative')])
    negative.append(value)
    negative_total+=value

negative_percentage = []
for x in negative:
    negative_percentage.append((x/negative_total) * 100)

plt.xticks(x_axis, X)
plt.bar(x_axis + 0.2 ,positive_percentage,0.4,label='Positive')
plt.bar(x_axis - 0.2 ,negative_percentage,0.4,label='Negative')
plt.ylabel('Percentage of Sentiment')
plt.xlabel('Age')
plt.title('Imagined Sentiment')
plt.legend()
plt.show()

positive_stats = {}
negative_stats = {}
index = 0
for val in X:
    positive_stats[val] = positive_percentage[index]
    negative_stats[val] = negative_percentage[index]
    index+=1
print('---------- Positive Stats ----------')
for i in positive_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(positive_stats[i]))

print('---------- Negative Stats ----------')
for i in negative_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(negative_stats[i]))

#Retold Visualization   


# Create x-axis based on unique age values in the Data
X = newRetold['annotatorAge'].unique()
X.sort()
x_axis = np.arange(len(newRetold['annotatorAge'].unique()))


# Calculate Positive Amount and Percentages
positive = []
positive_total = 0
for x in X:
    value = len(newRetold[(newRetold['annotatorAge'] == x) & (newRetold['sentiment'] == 'Positive')])
    positive.append(value)
    positive_total+=value

positive_percentage = []
for x in positive:
    positive_percentage.append((x/positive_total) * 100)


# Calculate Negative Amount and Percentages
negative = []
negative_total = 0
for x in X:
    value = len(newRetold[(newRetold['annotatorAge'] == x) & (newRetold['sentiment'] == 'Negative')])
    negative.append(value)
    negative_total+=value

negative_percentage = []
for x in negative:
    negative_percentage.append((x/negative_total) * 100)

plt.xticks(x_axis, X)
plt.bar(x_axis + 0.2 ,positive_percentage,0.4,label='Positive')
plt.bar(x_axis - 0.2 ,negative_percentage,0.4,label='Negative')
plt.ylabel('Percentage of Sentiment')
plt.xlabel('Age')
plt.title('Retold Sentiment')
plt.legend()
plt.show()

positive_stats = {}
negative_stats = {}
index = 0
for val in X:
    positive_stats[val] = positive_percentage[index]
    negative_stats[val] = negative_percentage[index]
    index+=1
print('---------- Positive Stats ----------')
for i in positive_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(positive_stats[i]))

print('---------- Negative Stats ----------')
for i in negative_stats:
    print("Age: {}".format(i) + " Percentage: {}".format(negative_stats[i]))

The previous graph showed us the relationships between Positve and Negative Sentiments by **Age**. We can see that the results between Sentiments by Age are pretty spread out and that there is no clear relationship between the 3 graphs. Though, if we look at each graph individually, we can see some trends throughout the different years. For example, for Recalled, there is a period of increased positivity between the ages 25 to 35. We can also see that for the Recalled and Retold graphs, between the ages of 45-55, there is mostly a negative sentiment within the groups. 

Next, we will look at the relationship between Positive and Negative Sentiments by **AgeGroup**. 

In [None]:
#Plot settings
fig, axes = plt.subplots(3, 1,figsize=(10,15))
fig.tight_layout(h_pad = 4.0)

#Recalled Visualization
graph_2 = sns.histplot(data = newRecalled, x="AgeGroup", hue = 'sentiment', hue_order = ['Positive','Negative'],multiple='dodge', stat = 'percent',common_norm=False,ax=axes[0])
for container in graph_2.containers:
    graph_2.bar_label(container)

graph_2.set(xlabel = 'Age Group')
axes[0].set_title('Recalled Sentiment', fontsize=20)

#Imagined Visualization
graph_2 = sns.histplot(data = newImagined, x="AgeGroup", hue = 'sentiment', hue_order = ['Positive','Negative'],multiple='dodge', stat = 'percent',common_norm=False,ax=axes[1])
for container in graph_2.containers:
    graph_2.bar_label(container)

graph_2.set(xlabel = 'Age Group')
axes[1].set_title('Imagined Sentiment', fontsize=20)

#Retold Visualization
graph_2 = sns.histplot(data = newRetold, x="AgeGroup", hue = 'sentiment', hue_order = ['Positive','Negative'],multiple='dodge', stat = 'percent',common_norm=False,ax=axes[2])
for container in graph_2.containers:
    graph_2.bar_label(container)

graph_2.set(xlabel = 'Age Group')
axes[2].set_title('Retold Sentiment', fontsize=20)

Comparing these graphs, the strongest relationship that stood out is how the Adult AgeGroup column seems to have a more Positive Sentiment than Negative relationship in all 3 Retold, Recalled, and Imagined Sentiments. Maybe this has something to do with the adult stage in life where people tend to be happier? 

Next, let's visualize the Sentimental Analysis between **Retold vs. Recalled Stories**

In [None]:
#select positive and negative sentiments (remove neutral sentiment)
posnegRetold = newRetold[(newRetold['sentiment'] == 'Positive') | (newRetold['sentiment'] == 'Negative')]
posnegRecalled = newRecalled[(newRecalled['sentiment'] == 'Positive') | (newRecalled['sentiment'] == 'Negative')]

#plot parameters
x = posnegRetold['sentiment']
y = posnegRecalled['sentiment']
fig, ax = plt.subplots(figsize=(10,5))
values, bins, patches = plt.hist([x, y], bins=np.arange(3)-0.5, label=['Retold Stories', 'Recalled Stories'],density=True)
 
plt.legend(loc='upper right')
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
for container in ax.containers:
    labels = [f'{x:.1%}' for x in container.datavalues]
    ax.bar_label(container, labels=labels)
plt.show()

From this graph, we can see that when comparing Positive and Negative sentiments between Retold and Recalled stories, there are more Positive Retold stories than Recalled stories. Also, there are more Negative stories in Recalled than Retold.

Finally, for consistency, we will add the plots that we've seen and analyzed from the Spacy Similarity analysis.

In [None]:
cleanedRecalledAndRetold.plot.scatter(x='time since recalled', y = 'spacy_sim')

In [None]:
test = cleanedRecalledAndRetold[cleanedRecalledAndRetold['time since recalled'] < 250]
test.plot.scatter(x='time since recalled', y = 'spacy_sim')

In [None]:
_ages = ['Youth','Adult','Senior']
sns.relplot(data=combinedAnalysis, x='time since recalled', y='spacy_sim', hue='AgeGroup', hue_order=_ages, aspect=1.61)
plt.show()

# III. Conclusion and Discussion

As stated above we want to see how emotion affects memory so let's look into the results we got. First lets see if the emotion of our age group has increased or decreased over time.

In [None]:
sns.countplot(x='AgeGroup',hue='sentChange',data=combinedAnalysis)

As we can see, the emotion change was actually more positive over time in the Adult and Senior age group. As for the youth group the negative and positive emotion is about the same.

Let's take a look at the spacy distribution classified by age group

In [None]:
youth = sns.displot(youthOnly, x='spacy_sim',bins=15)
youth.set(title='Youth Spacy Distribution')
adult = sns.displot(adultOnly, x='spacy_sim',bins=15,color='orange')
adult.set(title='Adult Spacy Distribution')
senior = sns.displot(seniorOnly, x='spacy_sim',bins=15)
senior.set(title='Senior Spacy Distribution')

Next, looking at the distributions, the Adult distribution appears to show the overall best retelling of the previous story, then the Youth distribution, and finally the Senior distribution. This shows us that even if there is better emotion, it does not necessarily lead to better memory. If emotion were to help memory we would see the Senior distribution being the best followed by Adult and then finally Youth.

Let's look at the emotion variables that might be driving this increase. We decided that the variables that speak the most about emotion are:
- Draining which describes how emotionally drained the individual was when telling the story
- Importance which describes how driven the individual was when telling the story
- Stressful which describes how stressed the individual was when telling the story

In [None]:
newRecalled.groupby('AgeGroup').mean().drop(columns=['annotatorAge','frequency','logTimeSinceEvent',
                                                   'similarity','timeSinceEvent','distracted'])

In [None]:
newRetold.groupby('AgeGroup').mean().drop(columns=['annotatorAge','frequency','logTimeSinceEvent',
                                                   'similarity','timeSinceEvent','distracted'])

These three emotional variables are done on a Likert scale meaning 5 being the most stress, 3 being a neutral amount of stress and 1 being no stress at all. Between the two plots, we can see that individuals on average were less emotionally drained during the retelling of the story as opposed to recalling it. The average importance level was higher for each age group during the recalling of the story versus the retelling of the story where they were lower. Finally, the stress decreased for each age group when they were retelling the story as opposed to recalling it. 


We can also look at violin plots to help us better see the distribution of the emotions by age group

In [None]:
stressRec = sns.violinplot(x='AgeGroup',data=newRecalled,
            y='stressful')
stressRec.set(title='Recalled Stress Distribution')

In [None]:
stressRet = sns.violinplot(x='AgeGroup',data=newRetold,
            y='stressful')
stressRet.set(title='Retold Stress Distribution')

We can see that the stress distribution thinned out at 6 and increase from 2 - 3

In [None]:
importanceRec = sns.violinplot(x='AgeGroup',data=newRecalled,
            y='importance')
importanceRec.set(title='Recalled Importance Distribution')

In [None]:
importanceRet = sns.violinplot(x='AgeGroup',data=newRetold,
            y='importance')
importanceRet.set(title='Retold Importance Distribution')

We can see the importance distribution filled in the thinner spots and grew more from 4 to 2

In [None]:
drainingRec = sns.violinplot(x='AgeGroup',data=newRecalled,
            y='draining')
drainingRec.set(title='Recalled Draining Distribution')

In [None]:
drainingRet = sns.violinplot(x='AgeGroup',data=newRetold,
            y='draining')
drainingRet.set(title='Retold Draining Distribution')

We can see that draining followed a similar pattern as stressed did

### Overall
We can 

In [None]:
#Conclusion
#Discussion of your results and how they address your experimental question
#Limitations of analysis discussed
#What additional experiments would be interesting, and what data would you need?