# **Sentiment Similarity System**
This system compares the sentiment between all of a user's Reddit postings and the answers to a question in BDI. In order to pick an answer for a question, each answer to BDI is scored and then the subjects posts are scored. Each post is classified based on which answer it's sentiment score is closest to. The answer classes that have been picked the most, for all the posts, is then chosen for that question.

The system tests different sentiment scorers to analyse which one performs best at the task.

# **Environment Set-up**

### **Importing and Installing Required Libraries**

In [None]:
#libraries utilised to parse and index dataset
import os, sys, glob, csv
import numpy as np
import pandas as pd

from xml.dom import minidom

#Google Colab does not have awessome installed as default so pip install is required
!pip install awessome
import awessome
from awessome.awessome_builder import *

Collecting awessome
  Downloading awessome-0.0.14-py2.py3-none-any.whl (34 kB)
Collecting certifi==2020.6.20
  Downloading certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
[K     |████████████████████████████████| 156 kB 18.9 MB/s 
[?25hCollecting numpy==1.19.2
  Downloading numpy-1.19.2-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
[K     |████████████████████████████████| 14.5 MB 60.7 MB/s 
[?25hCollecting sentence-transformers==0.3.8
  Downloading sentence-transformers-0.3.8.tar.gz (66 kB)
[K     |████████████████████████████████| 66 kB 5.8 MB/s 
[?25hCollecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 74.7 MB/s 
[?25hCollecting tokenizers==0.8.1rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 70.6 MB/s 
Collecting scipy==1.5.3
  Downloading scipy-1.5.3-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 

In [None]:
#change to correct directory
%cd drive/MyDrive/CS408/

/content/drive/MyDrive/CS408


# **Preparing Data**

The dataset has been preprocessed here: https://colab.research.google.com/drive/1J8XAD7JPShZQuBRJ6zfFjV1dUtH4y858?usp=sharing

In [None]:
bdi_questions_answers = pd.read_csv("BDI_csv.csv")

#list of all 21 question names in BDI i.e. ['Sadness', 'Pessimism', ...]
questions = bdi_questions_answers['Question'].unique()

#list of lists where element contains all possible answers per question 
answers = []

#look up in bdi_questions_answers df for all possible answers per question and add to answers
for question_name in questions:
  answers.append(bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question_name)), 'Answer'].values)

In [None]:
%cd ../../question_csvs
#Creating data frames for each question 
agitation_df = pd.read_csv("answer_classes_posts_Agitation.csv") 
appetite_df = pd.read_csv("answer_classes_posts_Changes in Appetite.csv")
sleep_df = pd.read_csv("answer_classes_posts_Changes in Sleeping Pattern.csv") 
concentration_df = pd.read_csv("answer_classes_posts_Concentration Difficulty.csv") 
crying_df = pd.read_csv("answer_classes_posts_Crying.csv") 
guilty_df = pd.read_csv("answer_classes_posts_Guilty Feelings.csv")
indecisive_df = pd.read_csv("answer_classes_posts_Indecisiveness.csv") 
irritability_df = pd.read_csv("answer_classes_posts_Irritability.csv") 
energy_df = pd.read_csv("answer_classes_posts_Loss of Energy.csv") 
sexinterest_df = pd.read_csv("answer_classes_posts_Loss of Interest in Sex.csv") 
interest_df = pd.read_csv("answer_classes_posts_Loss of Interest.csv") 
pleasure_df = pd.read_csv("answer_classes_posts_Loss of Pleasure.csv") 
pastfailure_df = pd.read_csv("answer_classes_posts_Past Failure.csv") 
pessimism_df = pd.read_csv("answer_classes_posts_Pessimism.csv") 
punishment_df = pd.read_csv("answer_classes_posts_Punishment Feelings.csv") 
sadness_df = pd.read_csv("answer_classes_posts_Sadness.csv") 
selfcritcalness_df = pd.read_csv("answer_classes_posts_Self-Criticalness.csv") 
selfdislike_df = pd.read_csv("answer_classes_posts_Self-Dislike.csv") 
suicidal_df = pd.read_csv("answer_classes_posts_Suicidal Thoughts or Wishes.csv") 
fatigue_df = pd.read_csv("answer_classes_posts_Tiredness or Fatigue.csv") 
worthlessness_df = pd.read_csv("answer_classes_posts_Worthlessness.csv") 

#Create a list of the subject names as strings 
# (any df could be used for this)
subjects = agitation_df['Subject'].unique()

#Add all dataframes in order of BDI questions 
BDI_df = [sadness_df, pessimism_df, pastfailure_df, pleasure_df, guilty_df, punishment_df, selfdislike_df, selfcritcalness_df, suicidal_df, crying_df, agitation_df, interest_df, indecisive_df, worthlessness_df, energy_df, sleep_df, irritability_df, appetite_df, concentration_df, fatigue_df, sexinterest_df]

/content/drive/My Drive/CS408/question_csvs


# **Creating Sentiment Intensity Scorers**
In order to calculate a sentiment score for the BDI answers and for user's postings I am using the AWESSOME (A Word Embedding Sentiment Scorer Of Many Emotions) framework. This framework allows you to create your own sentiment intensity scorers and gives you freedom to adapt several parameters to fit your purpose.

Details and implementations of the AWESSOME framework can be found here: https://github.com/cumulative-revelations/awessome.git.

### **BERT Sentiment Scorers**

In [None]:
avg_bert_builder = SentimentIntensityScorerBuilder('avg', 'bert-base-nli-mean-tokens', 'cosine', '600', True)
bert_vader_avg_scorer = avg_bert_builder.build_scorer_from_prebuilt_lexicon('vader')

100%|██████████| 405M/405M [00:07<00:00, 53.4MB/s]


In [None]:
max_bert_builder = SentimentIntensityScorerBuilder('max', 'bert-base-nli-mean-tokens', 'cosine', '600', True)
bert_vader_max_scorer = max_bert_builder.build_scorer_from_prebuilt_lexicon('vader')

### **DistilBERT Scorer**

In [None]:
avg_distil_builder = SentimentIntensityScorerBuilder('avg', 'distilbert-base-nli-stsb-mean-tokens', 'cosine', '600', True)
distil_vader_avg_scorer = avg_distil_builder.build_scorer_from_prebuilt_lexicon('vader')

### **RoBERTa Scorer**

In [None]:
avg_roberta_builder = SentimentIntensityScorerBuilder('avg', 'roberta-base-nli-stsb-mean-tokens', 'cosine', '600', True)
roberta_vader_avg_scorer = avg_roberta_builder.build_scorer_from_prebuilt_lexicon('vader')

# **Computing Sentiment Scores for BDI Answers**


In [None]:
%cd Sentiment_Analysis/

/content/drive/My Drive/CS408/Sentiment_Analysis


**Calculates a sentiment score for the answers to BDI, using the chosen Sentiment Intensity Scorer**


*   scorer_name (*string*): name of scorer to name file
*   scorer (*SentimentIntensityScorer*): chosen sentiment scorer



In [None]:
def score_BDI(scorer_name, scorer):
  #create csv file for the BDI answer sentiment scores 
  filename = scorer_name + "/BDI_sentiment.csv"
  with open(filename, mode='w') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    #add the column headers
    csv_writer.writerow(['Question', 'Answer', 'Sentiment'])

    #calculate sentiment score for each answer per question
    for i, answer in enumerate(answers):
      for a in answer:
        csv_writer.writerow([questions[i], a, scorer.score_sentence(a)]) 

In [None]:
#example of input parameters for score_BDI function
score_BDI('distil_vader_avg_scorer', distil_vader_avg_scorer)

# **Computing Sentiment Scores for Reddit Posts**

**Calculates a sentiment score for all of the Reddit posts, using the chosen Sentiment Intensity Scorer**

*   scorer (*SentimentIntensityScorer*): chosen sentiment scorer
*   folder_path (*string*): path to data folder



In [None]:
def calc_sentiment(scorer):
  #loop through all subject xml files
  folder_path = '/content/drive/MyDrive/CS408/2019_2020_TEST_DATA/'
  for filename in glob.glob(os.path.join(folder_path, '*.xml')):
    #Get all posts per user
    with open(filename, 'r') as f:

      #Get subjectname
      base = (os.path.basename(filename))
      subjectname = os.path.splitext(base)[0]

      #Parse xml file for only the TEXT elements (the posts)
      mydoc = minidom.parse(filename)
      posts = mydoc.getElementsByTagName('TEXT')
      
      file = subjectname + "-post_sentiments.csv"
      with open(file, mode='w') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        #Add the column headers
        csv_writer.writerow(['Post', 'Sentiment'])
        for p in posts:
          post = p.firstChild.data
          if post != "  ":
            csv_writer.writerow([post, scorer.score_sentence(post)])

In [None]:
#example running calc_sentiment
calc_sentiment(distil_vader_avg_scorer)

# **Predicting the BDI Answers**

Predicts all of the answers to the BDI, for each subject.


*   scorer_name (*string*): name of scorer to name file



In [None]:
def predict_answers(scorer_name):

  #create dataframe containing sentiment score for each answer in BDI
  csv_name = scorer_name + '/BDI_sentiment.csv'
  answer_scores = pd.read_csv(csv_name)

  #loop through all subject xml files
  folder_path = '/content/drive/MyDrive/CS408/2019_2020_TEST_DATA/'
  for subject in subjects:

    #create csv per subject for their predicted and real answers 
    createfile = subject + "_predicted_answers-" + scorer_name + ".csv"
    with open(createfile, mode='w') as csv_file:
      csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      #Add the column headers
      csv_writer.writerow(['Question', 'Predicted Answer', 'Real Answer'])

      #gather all sentiment scores for a single user's postings
      readfile = subject + "-post_sentiments.csv"
      temp_df = pd.read_csv(readfile)
      sentiment_posts = []
      for i, row in temp_df.iterrows():
        sentiment_posts.append(row['Sentiment'])

      index = 1
      #Search 2019/2020 txt files for subjectname
      for txtname in glob.glob(os.path.join(folder_path, '*.txt')):
        with open(txtname, 'r') as txt:
          for line in txt:
            values = line.split()
            if (values[0] == subject):
              for i, row in answer_scores:
                csv_writer.writerow(["Q" + str(index), compare_sentiments(row.Sentiment, sentiment_posts), values[index]])
                index += 1

**Compares the sentiment intensity scores between the answers of a question in BDI and a users postings.**
- answer_scores (*list*): list of the sentiment scores per answer statemnt to a question
- post_scores (*list*): list of the sentiment scores for all of a users postings

In [None]:
def compare_sentiments(answer_scores, post_scores):
  #creates list with the values (sentiment scores) from dict 
  answer_scores_list = [*answer_scores.values()]

  #countlist will store how many times an answer has been picked
  # index of list = answer number
  if len(answer_scores_list) > 4:
    countlist = [0, 0, 0, 0, 0, 0, 0]
  else:
    countlist = [0, 0, 0, 0]

  #goes through each post score and picks answer with closest score
  for post_score in post_scores:
    min_difference = min(answer_scores.values(), key=lambda x:abs(x-post_score))
    closest = answer_scores_list.index(min_difference)
    countlist[closest] += 1

  #picks the answer by chosing the index with the highest count
  predicted = countlist.index(max(countlist))

  #question with more than 4 answers, convert to incorporate letters
  if len(countlist) > 4:
    numDict = {0:'0', 1:'1a', 2: '1b', 3: '2a', 4: '2b', 5: '3a', 6: '3b'}
    predicted = numDict[predicted]

  return predicted

# **Evaluation Metrics**
* **Average Hit Rate (AHR)** - calculates the HR (hit rate) averaged over all users, where the HR is how many questions the system has predicted correctly.
* **Average Closeness Rate (ACR)** - calculates how close the system has predicted answer values for every question, averaged over all users.
* **Average Difference between Overall Depression Levels (ADODL)** - calculates how close the system has predicted a subject's depression score compared to their real score, averaged over all users.
* **Depression Category Hit Rate (DCHR)** - calculates the percentage of cases where the automated question answers matched up to the subject's true depression category.

**Convert Answer Class to Respective Value**

> Due to question 16 and question 18 having 7 possible answers (0, 1a, 1b, 2a, 2b, 3a, 3b), the answer values are either stored as strings or variables in the dataframes. When calculating depression scores, only the number is taken into consideration. Thus, the method convert_to_int will take in any answer class and return it as an integer (i.e convert_to_int('1a') returns 1 and convert_to_int(2) just returns 2).










In [None]:
#converts string to integer for the answer class
def convert_to_int(s):
  if type(s) == int:
    return s
  else:
    #if 1a - 3b then take only the number part of string
    s = s[0]
    integer = int(s)
    return integer

### **Calculate AHR**

In [None]:
def calc_HR(subject, model):
  HR = 0
  temp_df = pd.read_csv(subject + "-" + model + ".csv")
  for index, row in temp_df.iterrows(): 
    if (row["Real Answer"] == row["Predicted Answer"]):
      HR += 1
  return HR

In [None]:
def calc_AHR(model):
  #calculate HR for all users then divide by total number of users
  total_correct_guesses = 0
  for subject in subjects:
    total_correct_guesses += calc_HR(subject, model)

  #1890 = 90 subjects * 21 questions
  AHR = (total_correct_guesses / 1890) * 100
  return AHR

### **Calculate ACR**

In [None]:
def calc_CR(row):
  real_answer = convert_to_int(row["Real Answer"])
  predicted_answer = convert_to_int(row["Predicted Answer"])

  #ad = absolute difference i.e. system = 3 and real = 1, so ad = 2
  ad = abs(predicted_answer - real_answer)

  #cr = mad - ad / mad
  CR = (3-ad)/3
  return CR

In [None]:
def calc_ACR(model):
  CR_total = 0
  for subject in subjects:
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    for index, row in temp_df.iterrows():
      CR_total += calc_CR(row)

  ACR = (CR_total/1890) * 100
  return ACR

### **Calculate ADODL**

In [None]:
def calc_DODL(subject, model):
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    real_category = 0
    predicted_category = 0
    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    overall_ad = abs(predicted_category - real_category)
    DODL = (63 - overall_ad) / 63

    return DODL

In [None]:
def calc_ADODL(model):
  total_DODLs = 0
  for subject in subjects:
    total_DODLs += calc_DODL(subject, model)
  ADODL = (total_DODLs / 90) * 100

  return ADODL

### **Calculate DCHR**

In [None]:
'''
  0 - 9: minimal depression
  10 - 18: mild depression
  19 - 29: moderate depression
  30 - 63: severe depression
'''
def calc_DCHR(model):
  correct_guesses = 0
  for subject in subjects:
    real_category = 0
    predicted_category = 0
    temp_df = pd.read_csv(subject + "-" + model + ".csv")

    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    if (0 <= real_category <= 9) & (0 <= predicted_category <= 9):
      correct_guesses += 1
    elif (10 <= real_category <= 18) & (10 <= predicted_category <= 18):
      correct_guesses += 1
    elif (19 <= real_category <= 29) & (19 <= predicted_category <= 29):
      correct_guesses += 1
    elif (30 <= real_category <= 63) & (30 <= predicted_category <= 63):
      correct_guesses += 1
    
  DCHR = (correct_guesses / 90) * 100
  return DCHR

# **Results of Models**

### **BERT Scorers**

**BERT Average Scorer**

In [None]:
%cd bert_vader_avg_scorer

#bert vader avg 
print("AHR: ", calc_AHR('bert_vader_avg_scorer'))
print("ACR: ", calc_ACR('bert_vader_avg_scorer'))
print("ADODL: ", calc_ADODL('bert_vader_avg_scorer'))
print("DCHR: ", calc_DCHR('bert_vader_avg_scorer'))

AHR:  23.597883597883598
ACR:  65.4144620811287
ADODL:  76.56084656084651
DCHR:  18.88888888888889


**BERT Max Scorer**

In [None]:
%cd bert_vader_max_scorer

#bert vader max
print("AHR: ", calc_AHR('bert_vader_max_scorer'))
print("ACR: ", calc_ACR('bert_vader_max_scorer'))
print("ADODL: ", calc_ADODL('bert_vader_max_scorer'))
print("DCHR: ", calc_DCHR('bert_vader_max_scorer'))

AHR:  19.417989417989418
ACR:  60.77601410934738
ADODL:  73.43915343915339
DCHR:  23.333333333333332


### **DistilBERT Scorer**

In [None]:
%cd distil_vader_avg_scorer

#distil avg
print("AHR: ", calc_AHR('distil_vader_avg_scorer'))
print("ACR: ", calc_ACR('distil_vader_avg_scorer'))
print("ADODL: ", calc_ADODL('distil_vader_avg_scorer'))
print("DCHR: ", calc_DCHR('distil_vader_avg_scorer'))

/content/drive/MyDrive/CS408/Sentiment_Analysis/distil_vader_avg_scorer
AHR:  24.55026455026455
ACR:  64.97354497354499
ADODL:  77.1781305114638
DCHR:  28.888888888888886


### **RoBERTa Scorer**

In [None]:
%cd roberta_vader_avg_scorer

#roberta avg
print("AHR: ", calc_AHR('roberta_vader_avg_scorer'))
print("ACR: ", calc_ACR('roberta_vader_avg_scorer'))
print("ADODL: ", calc_ADODL('roberta_vader_avg_scorer'))
print("DCHR: ", calc_DCHR('roberta_vader_avg_scorer'))

/content/drive/My Drive/CS408/Sentiment_Analysis/roberta_vader_avg_scorer
AHR:  23.597883597883598
ACR:  65.4144620811287
ADODL:  76.56084656084651
DCHR:  18.88888888888889
