# **Multi Class Text Classification using Linear Support Vector Machine (SVM) with Scikit-Learn**
This notebook implements a Support Vector Machine to predict the severity of someone's depression, using their Reddit postings.

This model is in response to, and uses the dataset from, task 3 of CLEF's eRisk 2021 workshop: https://early.irlab.org.


# **Environment Set-up**

### **Importing and Installing Required Libraries**

In [32]:
#libraries utilised to parse and index dataset
import numpy as np
import pandas as pd
import csv, glob

#scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

#move to the correct directory
%cd drive/MyDrive/CS408/

[Errno 2] No such file or directory: 'drive/MyDrive/CS408/'
/content/drive/MyDrive/CS408/SVM/question_results-k1


# **Preparing Data**
The dataset has been preprocessed here: https://colab.research.google.com/drive/1J8XAD7JPShZQuBRJ6zfFjV1dUtH4y858?usp=sharing

In [36]:
bdi_questions_answers = pd.read_csv("BDI_csv.csv")

#list of all 21 question names in BDI i.e. ['Sadness', 'Pessimism', ...]
questions = bdi_questions_answers['Question'].unique()

#list of lists where element contains all possible answers per question 
answers = []

#look up in bdi_questions_answers df for all possible answers per question and add to answers
for question_name in questions:
  answers.append(bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question_name)), 'Answer'].values)

In [37]:
%cd question_csvs/

/content/drive/My Drive/CS408/question_csvs


In [38]:
#Creating data frames for each question 
agitation_df = pd.read_csv("answer_classes_posts_Agitation.csv") 
appetite_df = pd.read_csv("answer_classes_posts_Changes in Appetite.csv")
sleep_df = pd.read_csv("answer_classes_posts_Changes in Sleeping Pattern.csv") 
concentration_df = pd.read_csv("answer_classes_posts_Concentration Difficulty.csv") 
crying_df = pd.read_csv("answer_classes_posts_Crying.csv") 
guilty_df = pd.read_csv("answer_classes_posts_Guilty Feelings.csv")
indecisive_df = pd.read_csv("answer_classes_posts_Indecisiveness.csv") 
irritability_df = pd.read_csv("answer_classes_posts_Irritability.csv") 
energy_df = pd.read_csv("answer_classes_posts_Loss of Energy.csv") 
sexinterest_df = pd.read_csv("answer_classes_posts_Loss of Interest in Sex.csv") 
interest_df = pd.read_csv("answer_classes_posts_Loss of Interest.csv") 
pleasure_df = pd.read_csv("answer_classes_posts_Loss of Pleasure.csv") 
pastfailure_df = pd.read_csv("answer_classes_posts_Past Failure.csv") 
pessimism_df = pd.read_csv("answer_classes_posts_Pessimism.csv") 
punishment_df = pd.read_csv("answer_classes_posts_Punishment Feelings.csv") 
sadness_df = pd.read_csv("answer_classes_posts_Sadness.csv") 
selfcritcalness_df = pd.read_csv("answer_classes_posts_Self-Criticalness.csv") 
selfdislike_df = pd.read_csv("answer_classes_posts_Self-Dislike.csv") 
suicidal_df = pd.read_csv("answer_classes_posts_Suicidal Thoughts or Wishes.csv") 
fatigue_df = pd.read_csv("answer_classes_posts_Tiredness or Fatigue.csv") 
worthlessness_df = pd.read_csv("answer_classes_posts_Worthlessness.csv") 

#Create a list of the subject names as strings 
# (any df could be used for this)
subjects = agitation_df['Subject'].unique()

#Add all dataframes in order of BDI questions 
BDI_df = [sadness_df, pessimism_df, pastfailure_df, pleasure_df, guilty_df, punishment_df, selfdislike_df, selfcritcalness_df, suicidal_df, crying_df, agitation_df, interest_df, indecisive_df, worthlessness_df, energy_df, sleep_df, irritability_df, appetite_df, concentration_df, fatigue_df, sexinterest_df]

# **Create and Run SVM Multi Class Classifier**
In the loop below a linear SVM is created to classify a user's Reddit posts per question of BDI. 

K-fold Cross-Validation (where k=5, in this case) is performed using Sci-kit Learn's KFold module. I set shuffle to True in KFold's parameters so that the train/test data has varied subjects/posts. Then, random_state is set to 42 so that the shuffled folds are the same each time per question.

In [6]:
%cd ../SVM/question_results

/content/drive/My Drive/CS408/SVM/question_results


In [7]:
#loop through each question dataframe
for i, df in enumerate(BDI_df):
  
  #set data and labels 
  X = df.Post
  y = pd.Series(df.Class)

  #split into 5 shuffled folds
  kf = KFold(n_splits=5, random_state=42, shuffle=True)
  kf.get_n_splits(X, y)

  #variable used to name files, indicating the test fold
  kfold = 0

  for train_index, test_index in kf.split(X, y):
    kfold += 1
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    #create Linear SVM using SGD Classifier
    sgd = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
                  ])
    
    #fit the model
    sgd.fit(X_train, y_train)

    #obtain predicted classes 
    y_pred = sgd.predict(X_test)

    #create dataframe storing the predicted answers
    resultsdf = y_test.to_frame()
    resultsdf.insert(1, 'Predicted Class', value=y_pred)

    #sort the shuffled data back into place, 
    # in order to match predicted answers with true class, subject and posts
    resultsdf.sort_index(inplace=True)
    sorted_indexes = resultsdf.index.to_list()

    #create csv per question and per fold, making it easy to calculate evaluation metrics
    filename = questions[i] + '_results-k' + str(kfold) + '.csv'
    with open(filename, mode='w') as csv_file:
      csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      csv_writer.writerow(['Subject', 'Post', 'Class', 'Predicted Class'])
      count = 0
      for j, row in df.iterrows():
        if count == len(sorted_indexes):
          break
        if (j == sorted_indexes[count]):
          subject = row['Subject']
          post = row['Post']
          real_class = resultsdf.loc[sorted_indexes[count], 'Class']
          pred_class = resultsdf.loc[sorted_indexes[count], 'Predicted Class']
          csv_writer.writerow([subject, post, real_class, pred_class])
          count += 1

# **Evaluation Metrics**
* **Average Hit Rate (AHR)** - calculates the HR (hit rate) averaged over all users, where the HR is how many questions the system has predicted correctly.
* **Average Closeness Rate (ACR)** - calculates how close the system has predicted answer values for every question, averaged over all users.
* **Average Difference between Overall Depression Levels (ADODL)** - calculates how close the system has predicted a subject's depression score compared to their real score, averaged over all users.
* **Depression Category Hit Rate (DCHR)** - calculates the percentage of cases where the automated question answers matched up to the subject's true depression category.

**Convert Answer Class to Respective Value**

> Due to question 16 and question 18 having 7 possible answers (0, 1a, 1b, 2a, 2b, 3a, 3b), the answer values are either stored as strings or variables in the dataframes. When calculating depression scores, only the number is taken into consideration. Thus, the method convert_to_int will take in any answer class and return it as an integer (i.e convert_to_int('1a') returns 1 and convert_to_int(2) just returns 2).










In [39]:
#converts string to integer for the answer class
def convert_to_int(s):
  if type(s) == int:
    return s
  else:
    #if 1a - 3b then take only the number part of string
    s = s[0]
    integer = int(s)
    return integer

### **Calculate AHR**

In [40]:
def calc_HR(subject, model):
  HR = 0
  temp_df = pd.read_csv(subject + "-" + model + ".csv")
  for index, row in temp_df.iterrows(): 
    if (row["Real Answer"] == row["Predicted Answer"]):
      HR += 1
  return HR

In [41]:
def calc_AHR(model):
  #calculate HR for all users then divide by total number of users
  total_correct_guesses = 0
  for subject in subjects:
    total_correct_guesses += calc_HR(subject, model)

  #1890 = 90 subjects * 21 questions
  AHR = (total_correct_guesses / 1890) * 100
  return AHR

### **Calculate ACR**

In [42]:
def calc_CR(row):
  real_answer = convert_to_int(row["Real Answer"])
  predicted_answer = convert_to_int(row["Predicted Answer"])

  #ad = absolute difference i.e. system = 3 and real = 1, so ad = 2
  ad = abs(predicted_answer - real_answer)

  #cr = mad - ad / mad
  CR = (3-ad)/3
  return CR

In [43]:
def calc_ACR(model):
  CR_total = 0
  for subject in subjects:
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    for index, row in temp_df.iterrows():
      CR_total += calc_CR(row)

  ACR = (CR_total/1890) * 100
  return ACR

### **Calculate ADODL**

In [44]:
def calc_DODL(subject, model):
    temp_df = pd.read_csv(subject + "-" + model + ".csv")
    real_category = 0
    predicted_category = 0
    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    overall_ad = abs(predicted_category - real_category)
    DODL = (63 - overall_ad) / 63

    return DODL

In [45]:
def calc_ADODL(model):
  total_DODLs = 0
  for subject in subjects:
    total_DODLs += calc_DODL(subject, model)
  ADODL = (total_DODLs / 90) * 100

  return ADODL

### **Calculate DCHR**

In [46]:
'''
  0 - 9: minimal depression
  10 - 18: mild depression
  19 - 29: moderate depression
  30 - 63: severe depression
'''
def calc_DCHR(model):
  correct_guesses = 0
  for subject in subjects:
    real_category = 0
    predicted_category = 0
    temp_df = pd.read_csv(subject + "-" + model + ".csv")

    #need to account for 1a/1b etc
    for index, row in temp_df.iterrows(): 
      real_category += convert_to_int(row["Real Answer"])
      predicted_category += convert_to_int(row["Predicted Answer"])

    if (0 <= real_category <= 9) & (0 <= predicted_category <= 9):
      correct_guesses += 1
    elif (10 <= real_category <= 18) & (10 <= predicted_category <= 18):
      correct_guesses += 1
    elif (19 <= real_category <= 29) & (19 <= predicted_category <= 29):
      correct_guesses += 1
    elif (30 <= real_category <= 63) & (30 <= predicted_category <= 63):
      correct_guesses += 1
    
  DCHR = (correct_guesses / 90) * 100
  return DCHR

## **Results**

In [48]:
%cd ../SVM/Subject_Results/

/content/drive/My Drive/CS408/SVM/Subject_Results


In [50]:
print(calc_AHR('svm'))
print(calc_ACR('svm'))
print(calc_ADODL('svm'))
print(calc_DCHR('svm'))

29.15343915343915
68.67724867724877
77.40740740740736
21.11111111111111
