# **Cosine Similarity System**
This system compares the similarity between all of a user's Reddit postings and the answers to a question in BDI. In order to pick the answer, an average score is calculated per answer and then the answer with the maximum score is chosen.

The system tests different BERT-based pretrained Sentence Transformers models to analyse which one performs best at the task.

# **Environment Set-up**


### **Importing and Installing Required Libraries**

In [None]:
#libraries utilised to parse and index dataset
import os, glob, csv
import pandas as pd
from xml.dom import minidom

#Google Colab does not have SentenceTransformers installed as default so pip install is required
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer, util

In [None]:
#change to correct directory
%cd drive/MyDrive/CS408/

[Errno 2] No such file or directory: 'drive/MyDrive/CS408/'
/content/drive/MyDrive/CS408


# **Preparing Data**
The dataset has been preprocessed here: https://colab.research.google.com/drive/1J8XAD7JPShZQuBRJ6zfFjV1dUtH4y858?usp=sharing

In [None]:
bdi_questions_answers = pd.read_csv("BDI_csv.csv")
bdi_questions_answers.head()

Unnamed: 0,Question Number,Question,Class,Answer
0,1,Sadness,0,I do not feel sad.
1,1,Sadness,1,I feel sad much of the time.
2,1,Sadness,2,I am sad all the time.
3,1,Sadness,3,I am so sad or unhappy that I can't stand it.
4,2,Pessimism,0,I am not discouraged about my future.


In [None]:
#list of all 21 question names in BDI i.e. ['Sadness', 'Pessimism', ...]
questions = bdi_questions_answers['Question'].unique()

#list of lists where element contains all possible answers per question 
answers = []

#look up in bdi_questions_answers df for all possible answers per question and add to answers
for question_name in questions:
  answers.append(bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question_name)), 'Answer'].values)

#preview of the answers list, showing the 4 answers to 'Sadness' in BDI
print(answers[0])

['I do not feel sad.' 'I feel sad much of the time.'
 'I am sad all the time.' "I am so sad or unhappy that I can't stand it."]


## **Create Dataframes**

Convert the wrangled datset csvs into dataframes for ease of use later on. Creating variables and dataframes dynamically is bad practice, thus I made the csvs first and convert them here into their respective dataframes.

In [None]:
%cd question_csvs

#creating data frames for each question 
agitation_df = pd.read_csv("answer_classes_posts_Agitation.csv") 
appetite_df = pd.read_csv("answer_classes_posts_Changes in Appetite.csv")
sleep_df = pd.read_csv("answer_classes_posts_Changes in Sleeping Pattern.csv") 
concentration_df = pd.read_csv("answer_classes_posts_Concentration Difficulty.csv") 
crying_df = pd.read_csv("answer_classes_posts_Crying.csv") 
guilty_df = pd.read_csv("answer_classes_posts_Guilty Feelings.csv")
indecisive_df = pd.read_csv("answer_classes_posts_Indecisiveness.csv") 
irritability_df = pd.read_csv("answer_classes_posts_Irritability.csv") 
energy_df = pd.read_csv("answer_classes_posts_Loss of Energy.csv") 
sexinterest_df = pd.read_csv("answer_classes_posts_Loss of Interest in Sex.csv") 
interest_df = pd.read_csv("answer_classes_posts_Loss of Interest.csv") 
pleasure_df = pd.read_csv("answer_classes_posts_Loss of Pleasure.csv") 
pastfailure_df = pd.read_csv("answer_classes_posts_Past Failure.csv") 
pessimism_df = pd.read_csv("answer_classes_posts_Pessimism.csv") 
punishment_df = pd.read_csv("answer_classes_posts_Punishment Feelings.csv") 
sadness_df = pd.read_csv("answer_classes_posts_Sadness.csv") 
selfcritcalness_df = pd.read_csv("answer_classes_posts_Self-Criticalness.csv") 
selfdislike_df = pd.read_csv("answer_classes_posts_Self-Dislike.csv") 
suicidal_df = pd.read_csv("answer_classes_posts_Suicidal Thoughts or Wishes.csv") 
fatigue_df = pd.read_csv("answer_classes_posts_Tiredness or Fatigue.csv") 
worthlessness_df = pd.read_csv("answer_classes_posts_Worthlessness.csv") 

#create a list of the subject names as strings 
# (any df could be used for this)
subjects = agitation_df['Subject'].unique()

#add all dataframes in order of BDI questions 
BDI_df = [sadness_df, pessimism_df, pastfailure_df, pleasure_df, guilty_df, punishment_df, selfdislike_df, selfcritcalness_df, suicidal_df, crying_df, agitation_df, interest_df, indecisive_df, worthlessness_df, energy_df, sleep_df, irritability_df, appetite_df, concentration_df, fatigue_df, sexinterest_df]

/content/drive/My Drive/CS408/question_csvs


In [None]:
#preview of Agitation dataframe
agitation_df.head()

Unnamed: 0,Subject,Class,Answer,Post
0,subject5897,0,I am no more restless or wound up than usual.,"I didnt drop out, but took some time off afte..."
1,subject5897,0,I am no more restless or wound up than usual.,"Definitely doable, just be prepared for lots ..."
2,subject5897,0,I am no more restless or wound up than usual.,You definitely should!
3,subject5897,0,I am no more restless or wound up than usual.,"Wow I love this, do you have a website?"
4,subject5897,0,I am no more restless or wound up than usual.,Watermelon snow!


# **Choose Model**
Choose the pretrained Sentence Transformer model you would like to use:
`embedder = SentenceTransformer(*pretrained-model-name*)`


*   **BERT (Bidirectional Encoder Representations from Transformers):** works in 2 training stages:
  *   MLM (Masked Language Model): 15% of words in a sequence are replaced with a MASK token and the model then predicts the original values of the masked words - attempts to fill in the blanks.
  *   NSP (Next Sentence Prediction): model receives pairs of sentences as input and learns to predict if the second sentence in the pair follows the first sentence in the original document.
*   **RoBERTa (Robustly optimised BERT approach):** retraining of BERT that removes the NSP task from BERT's pre-training and uses dynamic masking so that the masked token changes during the training epochs.
*   **DistilBERT:** uses a compression technique called distillation where a small model is trained to reproduce behaviour of larger model (BERT).

### **BERT NLI model for Sentence Embeddings**


In [None]:
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

### **Sentence Embeddings Models trained on Paraphrases**


In [None]:
embedder = SentenceTransformer('stsb-bert-base')

In [None]:
embedder = SentenceTransformer('stsb-roberta-base')

In [None]:
embedder = SentenceTransformer('stsb-distilbert-base')

# **BDI Prediction System**

**Scores the cosine similarity between a post and answer, using the chosen embedder, and returns this score**


*   post (*string*): a single Reddit post 
*   answer (*string*): an answer to one of the BDI questions



In [None]:
def score_post(post, answer):
  corpus_embeddings = embedder.encode(post)
  query_embeddings = embedder.encode(answer)
  cos_score = util.pytorch_cos_sim(query_embeddings, corpus_embeddings)[0]
  return cos_score

**Scores cosine similarity between all of the user's posts and a single answer. Returns a list of the scores**


*   posts (*list*): subject's postings
*   answer (*string*): an answer to one of the BDI questions



In [None]:
#For all of a user's posts it scores each post and returns a list of the scores
def score_all_posts(posts, answer):
  cos_scores = []
  for post in posts:
    cos_scores.append(score_post(post, answer))
  return cos_scores

**Selects an answer to a question in BDI by picking the highest cosine similarity average.**

This function will also save the cosine similairity scores to a csv, so that the scores calculated are always saved somewhere. Meaning we can check scores later on and know what post and answer the calculation has been run on.
*   posts (*list*): subject's postings 
*   answers (*list*): possible written answers to a question 
*   subject (*string*): subject name





In [None]:
#csv file where all the cosine similarity scores are saved
with open('all_cosine_scores.csv', mode='w') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(['Subject','Post', 'Answer', 'Cosine Score'])

In [None]:
def select_answer(posts, answers, subject):
  #initialise variables
  total = 0
  index = 0
  averages = []

  #dictionary for questions with 7 possible answers 
  numDict = {0:'0', 1:'1a', 2: '1b', 3: '2a', 4: '2b', 5: '3a', 6: '3b'}
  
  for answer in answers:
      scores = score_all_posts(posts,answer)
      
      for i, score in enumerate(scores):

        #add cosine similarity score between a user's post and their answer to csv
        with open('all_cosine_scores.csv', mode='a') as csv_file:
          csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
          csv_writer.writerow([subject, posts[i], answer, score.item()])
        csv_file.close()

        #total is all the cosine scores added together
        total = total + score.item()
        index += 1

      averages.append(total / index)
      total = 0
      index = 0

  #if question is changes in sleeping/appetite check dictionary for correct answers
  if len(averages) > 4:
    return numDict.get(averages.index(max(averages)))

  #picks the maximum avg 
  return averages.index(max(averages))


**Predicts all users answers to BDI and saves the predicted answers alongside the true answers in a csv.**


*   model (*string*): name of embedder used
*   short_posts (*boolean*): tells system whether to use all posts or to truncate posts to 128 characters.



In [None]:
def run_model(model, truncate):
  for subject in subjects:
    posts = []
    index = 0
    filename = subject + '-' + model + ".csv"
    
    with open(filename, mode='w') as csv_file:
      csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      csv_writer.writerow(['Question', 'Predicted Answer', 'Real Answer'])
      for a in answers:
        g = BDI_df[index].groupby(['Subject'])

        #get subject's answer number for this specific question
        answer = g.get_group(subject).head(1)['Class'].values[0]

        #if it's first loop on subject then gather all their posts
        if posts == []:
          for row in g.get_group(subject).iterrows():
            #only compare short posts if short_posts set to true
            if truncate == True:  
              post = row[1][3]
              posts.append(post[:128])
            else:
              posts.append(row[1][3])

        index += 1
        csv_writer.writerow([("Q" + str(index)), select_answer(posts, a, subject), answer])
    posts.clear()

In [None]:
from sklearn.model_selection import KFold

# **Evaluation Metrics**
* **Average Hit Rate (AHR)** - calculates the HR (hit rate) averaged over all users, where the HR is how many questions the system has predicted correctly.
* **Average Closeness Rate (ACR)** - calculates how close the system has predicted answer values for every question, averaged over all users.
* **Average Difference between Overall Depression Levels (ADODL)** - calculates how close the system has predicted a subject's depression score compared to their real score, averaged over all users.
* **Depression Category Hit Rate (DCHR)** - calculates the percentage of cases where the automated question answers matched up to the subject's true depression category.

**Convert Answer Class to Respective Value**

> Due to question 16 and question 18 having 7 possible answers (0, 1a, 1b, 2a, 2b, 3a, 3b), the answer values are either stored as strings or variables in the dataframes. When calculating depression scores, only the number is taken into consideration. Thus, the method convert_to_int will take in any answer class and return it as an integer (i.e convert_to_int('1a') returns 1 and convert_to_int(2) just returns 2).










In [None]:
#converts string to integer for the answer class
def convert_to_int(s):
  if type(s) == int:
    return s
  else:
    #if 1a - 3b then take only the number part of string
    s = s[0]
    integer = int(s)
    return integer

### **Calculate AHR**

In [None]:
def calc_HR(subject, model):
  HR = 0
  temp_df = pd.read_csv(subject + "-" + model + ".csv")
  for index, row in temp_df.iterrows(): 
    if (row["Real Answer"] == row["Predicted Answer"]):
      HR += 1
  return HR

In [None]:
def calc_AHR(model):
  #calculate HR for all users then divide by total number of users
  total_correct_guesses = 0
  for subject in subjects:
    total_correct_guesses += calc_HR(subject, model)

  #1890 = 90 subjects * 21 questions
  AHR = (total_correct_guesses / 1890) * 100
  return AHR

### **Calculate ACR**

In [None]:
def calc_CR(row):
  real_answer = convert_to_int(row["Real Answer"])
  predicted_answer = convert_to_int(row["Predicted Answer"])

  #ad = absolute difference i.e. system = 3 and real = 1, so ad = 2
  ad = abs(predicted_answer - real_answer)

  #cr = mad - ad / mad
  CR = (3-ad)/3
  return CR

In [None]:
def calc_ACR(model):
  CR_total = 0
  for subject in subjects:
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    for index, row in temp_df.iterrows():
      CR_total += calc_CR(row)

  ACR = (CR_total/1890) * 100
  return ACR

### **Calculate ADODL**

In [None]:
def calc_DODL(subject, model):
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    real_category = 0
    predicted_category = 0
    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    overall_ad = abs(predicted_category - real_category)
    DODL = (63 - overall_ad) / 63

    return DODL

In [None]:
def calc_ADODL(model):
  total_DODLs = 0
  for subject in subjects:
    total_DODLs += calc_DODL(subject, model)
  ADODL = (total_DODLs / 90) * 100

  return ADODL

### **Calculate DCHR**

In [None]:
'''
  0 - 9: minimal depression
  10 - 18: mild depression
  19 - 29: moderate depression
  30 - 63: severe depression
'''
def calc_DCHR(model):
  correct_guesses = 0
  for subject in subjects:
    real_category = 0
    predicted_category = 0
    temp_df = pd.read_csv(subject + "-" + model + ".csv")

    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    if (0 <= real_category <= 9) & (0 <= predicted_category <= 9):
      correct_guesses += 1
    elif (10 <= real_category <= 18) & (10 <= predicted_category <= 18):
      correct_guesses += 1
    elif (19 <= real_category <= 29) & (19 <= predicted_category <= 29):
      correct_guesses += 1
    elif (30 <= real_category <= 63) & (30 <= predicted_category <= 63):
      correct_guesses += 1
    
  DCHR = (correct_guesses / 90) * 100
  return DCHR

# **Results of Models**

## **BERT NLI Model**

In [None]:
run_model('bert', False)

In [None]:
print("AHR: ", calc_AHR('bert'))
print("ACR: ", calc_ACR('bert'))
print("ADODL: ", calc_ADODL('bert'))
print("DCHR: ", calc_DCHR('bert'))

AHR:  21.11111111111111
ACR:  60.105820105819966
ADODL:  73.2627865961199
DCHR:  24.444444444444443


## **BERT STSb Model**

In [None]:
%cd MyDrive/CS408/Cosine_Sim/bert
run_model('bert', False)

In [None]:
print("AHR: ", calc_AHR('stsb-bert'))
print("ACR: ", calc_ACR('stsb-bert'))
print("ADODL: ", calc_ADODL('stsb-bert'))
print("DCHR: ", calc_DCHR('stsb-bert'))

AHR:  21.64021164021164
ACR:  60.52910052910045
ADODL:  75.1675485008818
DCHR:  18.88888888888889


## **RoBERTa STSb Model**

In [None]:
%cd ../roberta
run_model('roberta', False)

In [None]:
print("AHR: ", calc_AHR('roberta'))
print("ACR: ", calc_ACR('roberta'))
print("ADODL: ", calc_ADODL('roberta'))
print("DCHR: ", calc_DCHR('roberta'))

AHR:  24.92063492063492
ACR:  66.2610229276897
ADODL:  79.06525573192232
DCHR:  30.0


## **DistilBERT STSb Model**

In [None]:
%cd ../distil
run_model('distilbert', False)

In [None]:
print("AHR: ", calc_AHR('distilbert'))
print("ACR: ", calc_ACR('distilbert'))
print("ADODL: ", calc_ADODL('distilbert'))
print("DCHR: ", calc_DCHR('distilbert'))

AHR:  23.174603174603174
ACR:  61.199294532627746
ADODL:  76.43738977072303
DCHR:  17.77777777777778


## **RoBERTa (Truncated) STSb Model**
This model is the same as the first RoBERTa model except it truncates the posts to 128 characters.

In [None]:
%cd ../Cosine_Sim/roberta-2/
run_model('roberta-2', True)

In [None]:
print("AHR: ", calc_AHR('roberta-2'))
print("ACR: ", calc_ACR('roberta-2'))
print("ADODL: ", calc_ADODL('roberta-2'))
print("DCHR: ", calc_DCHR('roberta-2'))

AHR:  25.71428571428571
ACR:  67.05467372134045
ADODL:  79.29453262786592
DCHR:  32.22222222222222
