# LIWC pipeline

This notebook contains a pipeline for analyzing Tactus email therapies with LIWC. The goal is to process thousands of therapies with a single program. The alternative Orange3 requires seperately loading and processing each therapy file and this involves too much work  

The first code block specifies the require libraries. This includes some general Python libraries and some specific libraries developed in our research project. They are stored in the folder orangehackathon/libs

In [None]:
import csv
import re
import sys
import time

sys.path.append("../libs/")
import tactusloaderLIB
import OWEmailSorterLIB
import markduplicatesLIB
import removemarkedtextLIB
import LIWCLIB

The next code block specifies the location of the therapy files

In [None]:
DIRECTORY = "/home/erikt/projects/e-mental-health/usb/releases/20191106-auke"

One Python function was developed for storing the results of the data analysis. In Orange3 the module SaveData can be used for this task.

In [None]:
DEFAULTOUTFILE="out.csv"
FIELDNAMEDATE = "date"
FIELDNAMEFROM = "from"
FIELDNAMEFILE = "file"
FIELDNAMENBROFMAILS = "nbr of mails"
CLIENT = "CLIENT"
COUNSELOR = "COUNSELOR"
MAXROWCOUNTER = 0
FROMTARGET = CLIENT
STUDENTFEATURENAMES = [FIELDNAMEFILE,FIELDNAMEFROM,FIELDNAMENBROFMAILS,"4 i","7 shehe","8 they","31 posemo",\
                       "32 negemo","50 cogproc","51 insight","52 cause","54 tentat",\
                       "90 focuspast","91 focuspresent","92 focusfuture"]

def addZero(string):
    while len(string) < 2: string = "0"+string
    return(string)

def time2str(timeObj):
    date = str(timeObj.tm_year)+"-"+addZero(str(timeObj.tm_mon))+"-"+addZero(str(timeObj.tm_mday))
    time = addZero(str(timeObj.tm_hour))+":"+addZero(str(timeObj.tm_min))+":"+addZero(str(timeObj.tm_sec))
    return(date+" "+time)

def floatPrecision5(number):
    if type(number) != type(0.5): return(number)
    else: return(float("{0:.5f}".format(number)))

def saveResults(allLiwcResults,fileName=DEFAULTOUTFILE):
    if len(allLiwcResults) > 0:
        fieldNames = [x.name for x in allLiwcResults[0].domain.metas]
        fieldNames += [x.name for x in allLiwcResults[0].domain.variables]
        # fieldNames = STUDENTFEATURENAMES
        outFile = open(fileName,"w")
        with outFile as csvFile:
            csvwriter = csv.DictWriter(csvFile,fieldnames=fieldNames)
            csvwriter.writeheader()
            for liwcResults in allLiwcResults:
                if MAXROWCOUNTER > 0:
                    rowCounter = 0
                    row = {}
                    for liwcResultsRow in liwcResults:
                        if liwcResultsRow[FIELDNAMEFROM] == FROMTARGET:
                            rowCounter += 1
                            if rowCounter <= MAXROWCOUNTER:
                                for fieldName in fieldNames:
                                    if fieldName == FIELDNAMEDATE:
                                        row[fieldName] = time2str(time.localtime(liwcResultsRow[fieldName].value))
                                    elif not re.match("^\d+\s",fieldName):
                                        try: row[fieldName] = liwcResultsRow[fieldName].value
                                        except: pass
                                    elif fieldName in row: 
                                        row[fieldName] += floatPrecision5(liwcResultsRow[fieldName].value)
                                    else: 
                                        row[fieldName] = floatPrecision5(liwcResultsRow[fieldName].value)
                    if len(row) > 0:
                        for fieldName in row:
                            if re.match("^\d+\s",fieldName) and rowCounter > 0: 
                                row[fieldName] = floatPrecision5(row[fieldName]/min(rowCounter,MAXROWCOUNTER))
                        row[FIELDNAMENBROFMAILS] = rowCounter
                        csvwriter.writerow(row)
                else:
                    row = {}
                    for liwcResultsRow in liwcResults:
                        for fieldName in fieldNames:
                            if fieldName == FIELDNAMEDATE:
                                row[fieldName] = time2str(time.localtime(liwcResultsRow[fieldName].value))
                            elif not re.match("^\d+\s",fieldName):
                                try: row[fieldName] = liwcResultsRow[fieldName].value
                                except: pass
                            else: 
                                row[fieldName] = floatPrecision5(liwcResultsRow[fieldName].value)
                        if len(row) > 0: csvwriter.writerow(row)
        outFile.close()

Finally there is a loop which loads each available therapy file, runs the Orange3 pipeline and saves the results. The Orange3 pipeline contains these parts:

1. tactusloader: determine file name and read its contents
2. sortMails: sort the mails from the file chronologically
3. markduplicates: mark the parts of the mail text included from an earlier mail
4. removemarkedtext: remove the marked text from the mail
5. LIWC: perform a LIWC analysis on the remaining texts (includes tokenization)

The output of LIWC (one table row per mail) are stored in the current folder (default file name: out.csv)

In [None]:
allLiwcResults = []
for patientId in range(1,1000):
    fileName = tactusloaderLIB.makeFileName(str(patientId))
    mails = tactusloaderLIB.processFile(DIRECTORY,fileName+".gz")
    sortedMails = OWEmailSorterLIB.filterEmails(mails,filter_asc=True)
    markedMails = markduplicatesLIB.processCorpus(sortedMails)
    strippedMails = removemarkedtextLIB.processCorpus(markedMails)
    liwcResults = LIWCLIB.processCorpus(strippedMails)
    allLiwcResults.append(liwcResults)
saveResults(allLiwcResults)