# **Exploring and Preparing Dataset**
This notebook explores and processes the dataset provided by CLEF for Task 3 of CLEF's eRisk 2021 workshop: https://early.irlab.org.

### **Importing and Installing Required Libraries**

I have added all the required data for my project to my Google Drive in the directory 'CS408', making it easy to navigate and access data using shell commands.



*   %pwd - print working directory
*   %cd filepath - change directory


In [None]:
import pandas as pd
import csv, glob, os
from xml.dom import minidom

%cd drive/MyDrive/CS408/

[Errno 2] No such file or directory: 'drive/MyDrive/CS408/'
/content/drive/MyDrive/CS408


### **Indexing Beck's Depression Inventory (BDI)**

For this task, subjects have provided their Reddit history for CLEF's task alongside their answers to Beck's Depression Inventory (BDI). BDI is a 21-item, self-reported questionnaire that is used to estimate the severity of someone's depression. 

The aim of the systems are to accurately predict a user's answers to BDI, given their Reddit history.

In order to easily locate and use data from BDI, I have created a csv containing all the questions and possible answers. This csv has 4 columns:


*   **Question Number:** position of question in BDI (1-21)
*   **Question:** the name of the question statement (i.e. 'Sadness', 'Agitation' etc)
*   **Class:** answer value (0-3 possible values)
  * For questions 16 and 18 (Changes in Sleeping Pattern and Changes in Appetite) there are 6 possible answer values: 0, 1a, 1b, 2a, 2b, 3a, 3b
*   **Answer:** written statement corresponding to the answer value

In [None]:
bdi_questions_answers = pd.read_csv("BDI_csv.csv")

#list of all 21 question names in BDI i.e. ['Sadness', 'Pessimism', ...]
questions = bdi_questions_answers['Question'].unique()

#list of lists where element contains all possible answers per question 
answers = []

#look up in bdi_questions_answers df for all possible answers per question and add to answers
for question_name in questions:
  answers.append(bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question_name)), 'Answer'].values)

In [None]:
bdi_questions_answers.head()

Unnamed: 0,Question Number,Question,Class,Answer
0,1,Sadness,0,I do not feel sad.
1,1,Sadness,1,I feel sad much of the time.
2,1,Sadness,2,I am sad all the time.
3,1,Sadness,3,I am so sad or unhappy that I can't stand it.
4,2,Pessimism,0,I am not discouraged about my future.


## **Wrangling CLEF dataset**

The dataset provided by CLEF includes:
*   **90 XML Files:** each subject has a XML file which includes all of their Reddit posts
*   **Space delimited TXT file for 2019 & 2020:** each year a txt file containing all subjects and their answers to BDI is created

I have taken this data and created a csv for each question in Beck's Depression Inventory (BDI) which contains all of the subject's Reddit posts alongside their answer to this question. 

The csvs created for each question have 4 columns:

*   **Subject:** subject name
*   **Class:** the answer value chosen by subject 
*   **Answer:** the corresponding answer statement 
*   **Post:** one of the subject's Reddit posts













In [None]:
%cd question_csvs/

/content/drive/MyDrive/CS408/question_csvs


In [None]:
index = 0
for question in questions:
  index += 1
  #Create csv per question with all subjects posts along with their answer value to question
  newfile = "answer_classes_posts_" + question + ".csv"
  with open(newfile, mode='w') as csv_file:
      csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      #Add the column headers
      csv_writer.writerow(['Subject', 'Class', 'Answer', 'Post'])
      folder_path = '/content/drive/MyDrive/CS408/2019_2020_TEST_DATA/'
      allitems = []
      answer = ''

      #loop through all subject xml files
      for filename in glob.glob(os.path.join(folder_path, '*.xml')):
        #Get all posts per user
        with open(filename, 'r') as f:

          #Clear allitems per user so there's no duplicate posts
          allitems = []

          #Parse xml file for only the TEXT elements (the posts)
          mydoc = minidom.parse(filename)
          items = mydoc.getElementsByTagName('TEXT')
          allitems = allitems + items

          #Get subjectname
          base = (os.path.basename(filename))
          subjectname = os.path.splitext(base)[0]

          #Search 2019/2020 txt files for subjectname
          for txtname in glob.glob(os.path.join(folder_path, '*.txt')):
            with open(txtname, 'r') as txt:
              for line in txt:
                values = line.split()
                if (values[0] == subjectname):
                  for i in allitems:
                    post = i.firstChild.data
                    #Check post isn't empty then add post to csv
                    if post != "  ":
                      answer = bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question) & (bdi_questions_answers['Class'] == values[index])), 'Answer']
                      csv_writer.writerow([subjectname, values[index], answer.values[0], post])

In [12]:
sadness_df = pd.read_csv("answer_classes_posts_Sadness.csv") 

In [13]:
sadness_df

Unnamed: 0,Subject,Class,Answer,Post
0,subject5897,2,I am sad all the time.,"I didnt drop out, but took some time off afte..."
1,subject5897,2,I am sad all the time.,"Definitely doable, just be prepared for lots ..."
2,subject5897,2,I am sad all the time.,You definitely should!
3,subject5897,2,I am sad all the time.,"Wow I love this, do you have a website?"
4,subject5897,2,I am sad all the time.,Watermelon snow!
...,...,...,...,...
43759,subject97982,3,I am so sad or unhappy that I can't stand it.,Added\n
43760,subject97982,3,I am so sad or unhappy that I can't stand it.,i keep going offline to add more friends i ha...
43761,subject97982,3,I am so sad or unhappy that I can't stand it.,and Slugma has flame body for breeding =D
43762,subject97982,3,I am so sad or unhappy that I can't stand it.,certainly\n
