# T-score

This notebook contains a pipeline for comparing the vocabulary of two sets of Tactus emails with eachother by the t-score. The goal is to find tokens which appear more frequently in one set than in the other, and vice versa. This notebook uses much of the preprocessing of the notebook liwc.py in this directory. 

The first code block specifies the required libraries. This includes some general Python libraries and some specific libraries developed in our research project. These project-specific libraries can be found in the folder orangehackathon/libs

In [1]:
import csv
import re
import sys
import time

sys.path.append("../libs/")
import tactusloaderLIB
import OWEmailSorterLIB
import markduplicatesLIB
import removemarkedtextLIB
import LIWCLIB

The next code block specifies the location of the therapy files

In [2]:
DIRECTORY = "/home/erikt/projects/e-mental-health/usb/releases/20191217"

One Python function was developed for storing the results of the data analysis (SaveResults). In Orange3 the module SaveData can be used for this task. (SaveResults might not be necessary for this notebook tscore.ipynb)

In [3]:
DEFAULTOUTFILE="out.csv"
FIELDNAMEDATE = "date"
FIELDNAMEFROM = "from"
FIELDNAMEFILE = "file"
FIELDNAMENBROFMAILS = "nbr of mails"
CLIENT = "CLIENT"
COUNSELOR = "COUNSELOR"
FROMTARGET = CLIENT
NBROFMATCHES = "Number of matches"

# data selection settings
PROCESSALLFEATURES = True
AVERAGEROWS = False
NBROFKEPTROWS = 4
MINNBROFMATCHES = 50
STUDENTFEATURENAMES = [FIELDNAMEFILE,FIELDNAMEFROM,FIELDNAMENBROFMAILS,"4 i","7 shehe","8 they","31 posemo",\
                       "32 negemo","50 cogproc","51 insight","52 cause","54 tentat",\
                       "90 focuspast","91 focuspresent","92 focusfuture"]

def addZero(string):
    while len(string) < 2: string = "0"+string
    return(string)

def time2str(timeObj):
    date = str(timeObj.tm_year)+"-"+addZero(str(timeObj.tm_mon))+"-"+addZero(str(timeObj.tm_mday))
    time = addZero(str(timeObj.tm_hour))+":"+addZero(str(timeObj.tm_min))+":"+addZero(str(timeObj.tm_sec))
    return(date+" "+time)

def floatPrecision5(number):
    if type(number) != type(0.5): return(number)
    else: return(float("{0:.5f}".format(number)))

def saveResults(allLiwcResults,fileName=DEFAULTOUTFILE):
    if len(allLiwcResults) > 0:
        fieldNames = STUDENTFEATURENAMES
        if PROCESSALLFEATURES:
            fieldNames = [x.name for x in allLiwcResults[0].domain.variables]
            fieldNames += [x.name for x in allLiwcResults[0].domain.metas]
            fieldNames += [FIELDNAMENBROFMAILS]
        outFile = open(fileName,"w")
        with outFile as csvFile:
            csvwriter = csv.DictWriter(csvFile,fieldnames=fieldNames)
            csvwriter.writeheader()
            for liwcResults in allLiwcResults:
                if AVERAGEROWS:
                    rowCounter = 0
                    row = {}
                    for liwcResultsRow in liwcResults:
                        liwcResultsRow[FIELDNAMEFILE] = re.sub("-an.xml.gz","",str(liwcResultsRow[FIELDNAMEFILE]))
                        if liwcResultsRow[FIELDNAMEFROM] == FROMTARGET:
                            rowCounter += 1
                            nbrOfMatches = 0
                            if NBROFMATCHES in liwcResultsRow: nbrOfMatches = int(liwcResultsRow[NBROFMATCHES])
                            if (NBROFKEPTROWS == 0 or rowCounter <= NBROFKEPTROWS) and \
                               (MINNBROFMATCHES == 0 or nbrOfMatches >= MINNBROFMATCHES):
                                for fieldName in fieldNames:
                                    if fieldName == FIELDNAMEDATE:
                                        row[fieldName] = time2str(time.localtime(liwcResultsRow[fieldName].value))
                                    elif not re.match("^\d+\s",fieldName):
                                        try: row[fieldName] = liwcResultsRow[fieldName].value
                                        except: pass
                                    elif fieldName in row: 
                                        row[fieldName] += floatPrecision5(liwcResultsRow[fieldName].value)
                                    else: 
                                        row[fieldName] = floatPrecision5(liwcResultsRow[fieldName].value)
                    if len(row) > 0:
                        for fieldName in row:
                            if re.match("^\d+\s",fieldName) and rowCounter > 0: 
                                row[fieldName] = floatPrecision5(row[fieldName]/min(rowCounter,NBROFKEPTROWS))
                        row[FIELDNAMENBROFMAILS] = rowCounter
                        csvwriter.writerow(row)
                else:
                    rowCounter = 0
                    row = {}
                    for liwcResultsRow in liwcResults:
                        liwcResultsRow[FIELDNAMEFILE] = re.sub("-an.xml.gz","",str(liwcResultsRow[FIELDNAMEFILE]))
                        if liwcResultsRow[FIELDNAMEFROM] == FROMTARGET:
                            rowCounter += 1
                            nbrOfMatches = liwcResultsRow[NBROFMATCHES]
                            if (NBROFKEPTROWS == 0 or rowCounter <= NBROFKEPTROWS) and \
                               (MINNBROFMATCHES == 0 or nbrOfMatches >= MINNBROFMATCHES):
                                for fieldName in fieldNames:
                                    if fieldName == FIELDNAMEDATE:
                                        row[fieldName] = time2str(time.localtime(liwcResultsRow[fieldName].value))
                                    elif not re.match("^\d+\s",fieldName):
                                        try: row[fieldName] = liwcResultsRow[fieldName].value
                                        except: pass
                                    else: 
                                        row[fieldName] = floatPrecision5(liwcResultsRow[fieldName].value)
                                if len(row) > 0: csvwriter.writerow(row)
        outFile.close()

We will comparethe texts in emails from clients that finished the treatment versus clients that dropped out. Thus we need the metadata which specifies the results of the therapy for each client.

In [4]:
import gzip

DIRDROPOUT = "/home/erikt/projects/e-mental-health/usb/releases/20200305/"
FILEDROPOUT = "selected.csv.gz"
DELIMITER = ","
FIELDNAMEDROPOUT = "dropout"
FIELDNAMETEXT = "text"
FIELDNAMEFILE = "file"

dropout = {}
inFile = gzip.open(DIRDROPOUT+FILEDROPOUT,"rt",encoding="utf-8")
csvreader = csv.DictReader(inFile,delimiter=DELIMITER)
for row in csvreader: dropout[row[FIELDNAMEFILE]] = row[FIELDNAMEDROPOUT]
inFile.close()

len([x for x in dropout if dropout[x] != "?"]) == 791

True

Finally there is a loop which loads each available therapy file, runs the Orange3 pipeline. The Orange3 pipeline contains these parts:

1. tactusloader: determine file name and read its contents
2. sortMails: sort the mails from the file chronologically
3. markduplicates: mark the parts of the mail text included from an earlier mail
4. removemarkedtext: remove the marked text from the mail


In [5]:
MAXMAILS = 4

allLiwcResults = []
mailTexts = [[],[],[]]
for patientId in list(range(1,1988)):
    if patientId % 100 == 0: print(patientId,end=" ")
    fileName = tactusloaderLIB.makeFileName(str(patientId))
    fileNameId = re.sub("-an.xml$","",fileName)
    if fileNameId in dropout and (dropout[fileNameId] == "1" or dropout[fileNameId] == "2"):
        mailText = ""
        try:
            mails = tactusloaderLIB.processFile(DIRECTORY,fileName+".gz")
            if len(mails) > 0:
                sortedMails = OWEmailSorterLIB.filterEmails(mails[0],filter_asc=True)
                markedMails = markduplicatesLIB.processCorpus(sortedMails)
                strippedMails = removemarkedtextLIB.processCorpus(markedMails)
                mailCounter = 0
                for strippedMail in strippedMails:
                    if strippedMail[FIELDNAMEFROM] == CLIENT and mailCounter < MAXMAILS:
                        mailText += str(strippedMail[FIELDNAMETEXT])
                        mailCounter += 1
        except:
            print("problem processing file",fileName)
            continue
        mailTexts[int(dropout[fileNameId])].append(mailText)

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 

In [6]:
for i in range(0,len(mailTexts)): print(len(mailTexts[i]))

0
437
354


Next, convert the text to the data format of the t-score script: /home/erikt/projects/newsgac/fasttext-runs/tscore.py . There are two ways for computing the t-scores: count every separate word used by a client or count each word used by a client only once. The texts can be prepared for the second type of counts with the function uniqTextArray().

In [27]:
NBROFTOKENS = "totalFreq"
NBROFTYPES = "nbrOfWords"
WORDFREQS = "wordFreqs"
MAXCOUNT = "maxCount"

def makeTscoreData(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, MAXCOUNT:len(textArray), WORDFREQS:{} }
    for text in textArray:
        for token in text.split():
            data[NBROFTOKENS] += 1
            if token in data[WORDFREQS]: 
                data[WORDFREQS][token] += 1
            else:
                data[WORDFREQS][token] = 1
                data[NBROFTYPES] += 1
    return(data)

def uniqText(text):
    seen = {}
    for word in text.split():
        if not word in seen: seen[word] = True
    return(" ".join(list(seen.keys())))

def uniqTextArray(textArrayIn):
    textArrayOut = []
    for text in textArrayIn:
        textArrayOut.append(uniqText(text))
    return(textArrayOut)

tscoreData1 = makeTscoreData(uniqTextArray(mailTexts[1]))
tscoreData2 = makeTscoreData(uniqTextArray(mailTexts[2]))

In [30]:
sys.path.append("/home/erikt/projects/newsgac/fasttext-runs")
import tscore
import operator

TOPPOS = 20

outFile = open("out.csv","w")
csvwriter = csv.DictWriter(outFile,["position","token","tscore","freqDropouts","freqFinishers"])
csvwriter.writeheader()
tscores = tscore.computeTscoreList(tscoreData1,tscoreData2)
position = 0
for tuple in sorted(tscores.items(), key=operator.itemgetter(1),reverse=True):
    position += 1
    (token,tscore) = tuple
    if token in tscoreData1[WORDFREQS]: frequency1 = tscoreData1[WORDFREQS][token]
    else: frequency1 = 0
    if token in tscoreData2[WORDFREQS]: frequency2 = tscoreData2[WORDFREQS][token]
    else: frequency2 = 0
    csvwriter.writerow({"position":position,"token":token,"tscore":tscore,"freqDropouts":frequency1,"freqFinishers":frequency2})
    if position <= TOPPOS or position+TOPPOS > len(tscores): print(position,token,tscore,frequency1,frequency2)
outFile.close()

1 mvg 2.6069437444795414 16 3
2 hapje 2.5902771425883144 8 0
3 late 2.3803980606815736 39 16
4 schoonouders 2.3313558655467217 12 2
5 armen 2.2281041549259886 9 1
6 externe 2.1831297072252833 6 0
7 Vanmiddag 2.1831297072252833 6 0
8 opgestoken 2.1831297072252833 6 0
9 oppas 2.1831297072252833 6 0
10 an 2.0246074095722575 8 1
11 Mvg 2.0246074095722575 8 1
12 groenten 1.960228533856725 10 2
13 inhoudt 1.960228533856725 10 2
14 langzamer 1.9513725557372972 5 0
15 bios 1.9513725557372972 5 0
16 uitwerking 1.9513725557372972 5 0
17 verkouden 1.9513725557372972 5 0
18 T 1.9513725557372972 5 0
19 baalt 1.9513725557372972 5 0
20 ondervinden 1.9513725557372972 5 0
29871 sportschool -4.093160180780137 9 33
29872 leeg -4.098771481280528 25 55
29873 Een -4.101059978239603 103 143
29874 lijkt -4.143443243577608 68 106
29875 voordelen -4.14647358033523 136 178
29876 hoofd -4.18232537098759 63 101
29877 emotionele -4.202963434984855 28 60
29878 groep -4.209407611770016 18 47
29879 Dank -4.21590930676

The bottom of the list contains words that are more frequently used by the finishers among the clients (category 2) than by the dropouts (category 1). However, the reverse is not true for the top of the list, which contains mainly frequent words. Why is this the case? To give an example: whu is the tscore of *me* (frequencies 355 and 322) similar to that of *mvg* (16, 3)?

The problem proved to be the computation of the t-score. We removed the concepts of text lengths and vocabulary length from the definition and replaced this with maximum attainable value: the number of clients per group, since the words were only counted once for each client. We kept using add-0.5 smoothing and adjusted the maximum attanable value with 0.5 as well to account for this. The top of the list improved a lot, with *mvg*, *hapje* and *late* occuring at the top of the list. The bottom of the list, words typically used by finishers did not contain many clear cases like *PS* on place -89 (0,13) but no errors could be found here.