# T-score

This notebook contains a pipeline for comparing the vocabulary of two sets of Tactus emails with eachother by the t-score. The goal is to find tokens which appear more frequently in one set than in the other, and vice versa. This notebook uses much of the preprocessing of the notebook liwc.py in this directory. 

The first code block specifies the required libraries. This includes some general Python libraries and some specific libraries developed in our research project. These project-specific libraries can be found in the folder orangehackathon/libs

In [None]:
import csv
import re
import sys
import time

sys.path.append("../libs/")
import tactusloaderLIB
import OWEmailSorterLIB
import markduplicatesLIB
import removemarkedtextLIB
import LIWCLIB

The next code block specifies the location of the therapy files

In [None]:
DIRECTORY = "/home/erikt/projects/e-mental-health/usb/releases/20191217"

One Python function was developed for storing the results of the data analysis (SaveResults). In Orange3 the module SaveData can be used for this task. (SaveResults might not be necessary for this notebook tscore.ipynb)

We will comparethe texts in emails from clients that finished the treatment versus clients that dropped out. Thus we need the metadata which specifies the results of the therapy for each client.

In [None]:
import gzip

DIRDROPOUT = "/home/erikt/projects/e-mental-health/usb/releases/20200305/"
FILEDROPOUT = "selected.csv.gz"
DELIMITER = ","
FIELDNAMEDROPOUT = "dropout"
FIELDNAMETEXT = "text"
FIELDNAMEFILE = "file"
FIELDNAMEFROM = "from"
CLIENT = "CLIENT"
NBROFCLIENTS = 791

dropout = {}
inFile = gzip.open(DIRDROPOUT+FILEDROPOUT,"rt",encoding="utf-8")
csvreader = csv.DictReader(inFile,delimiter=DELIMITER)
for row in csvreader: dropout[row[FIELDNAMEFILE]] = row[FIELDNAMEDROPOUT]
inFile.close()

len([x for x in dropout if dropout[x] != "?"]) == NBROFCLIENTS

Finally there is a loop which loads each available therapy file, runs the Orange3 pipeline. The Orange3 pipeline contains these parts:

1. tactusloader: determine file name and read its contents
2. sortMails: sort the mails from the file chronologically
3. markduplicates: mark the parts of the mail text included from an earlier mail
4. removemarkedtext: remove the marked text from the mail

The file loading takes some time. The program counts from 0 to 1987 in steps of 100 to indicate its progress.

In [None]:
MAXMAILS = 4
MAXCLIENT = 1987
CUTOFF = 1000

def wordCount(text): return(len(text.split()))

allLiwcResults = []
mailTexts = [{},{},{}]
print("(0.."+str(MAXCLIENT)+"):",0,end=" ")
for patientId in range(1,MAXCLIENT+1):
    if patientId % 100 == 0: print(patientId,end=" ")
    fileName = tactusloaderLIB.makeFileName(str(patientId))
    fileNameId = re.sub("-an.xml$","",fileName)
    if fileNameId in dropout and (dropout[fileNameId] == "1" or dropout[fileNameId] == "2"):
        mailText = ""
        try:
            mails = tactusloaderLIB.processFile(DIRECTORY,fileName+".gz")
            if len(mails) > 0:
                sortedMails = OWEmailSorterLIB.filterEmails(mails[0],filter_asc=True)
                markedMails = markduplicatesLIB.processCorpus(sortedMails)
                strippedMails = removemarkedtextLIB.processCorpus(markedMails)
                mailCounter = 0
                for strippedMail in strippedMails:
                    if strippedMail[FIELDNAMEFROM] == CLIENT and mailCounter < MAXMAILS:
                        mailTextPerMail = str(strippedMail[FIELDNAMETEXT])
                        mailCounter += 1
                        if wordCount(mailTextPerMail) >= CUTOFF: 
                            mailText += mailTextPerMail+" "
        except:
            print("problem processing file",fileName)
            continue
        mailTexts[int(dropout[fileNameId])][fileNameId] = mailText
print(patientId)

In [None]:
DROPOUTID = 1
FINISHERID = 2
NBROFDROPOUTS = 437
NBROFFINISHERS = 354

len(mailTexts[DROPOUTID]) == NBROFDROPOUTS and len(mailTexts[FINISHERID]) == NBROFFINISHERS

Next, convert the text to the data format of the t-score script: /home/erikt/projects/newsgac/fasttext-runs/tscore.py . There are two ways for computing the t-scores: count every separate word used by a client or count each word used by a client only once. The texts can be prepared for the second type of counts with the function uniqueTextArray() which removes all duplicate words from the texts (case-sensitive).

In [None]:
NBROFTOKENS = "totalFreq"
NBROFTYPES = "nbrOfWords"
WORDFREQS = "wordFreqs"
MAXCOUNT = "maxCount"

def removeEmptyMails(mails):
    clientsWithoutMails = [ clientId for clientId in mails if mails[clientId] == "" ]
    mailsCopy = dict(mails)
    for clientId in clientsWithoutMails:
        del(mailsCopy[clientId])
    return(mailsCopy)
        
def makeTscoreData(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, MAXCOUNT:len(textArray), WORDFREQS:{} }
    for text in textArray:
        for token in text.split():
            data[NBROFTOKENS] += 1
            if token in data[WORDFREQS]: 
                data[WORDFREQS][token] += 1
            else:
                data[WORDFREQS][token] = 1
                data[NBROFTYPES] += 1
    return(data)

def uniqueText(text):
    seen = {}
    for word in text.split():
        if not word in seen: seen[word] = True
    return(" ".join(list(seen.keys())))

def uniqueTextArray(textDictIn):
    textArrayOut = []
    for fileNameId in textDictIn:
        textArrayOut.append(uniqueText(textDictIn[fileNameId]))
    return(textArrayOut)

def normalizeMaxCount(tscoreData,fraction):
    tscoreData[MAXCOUNT] = round(tscoreData[MAXCOUNT]*fraction,1)
    for word in tscoreData["wordFreqs"]:
        tscoreData["wordFreqs"][word] = round(tscoreData["wordFreqs"][word]*fraction,1)
    return(tscoreData)
    
tscoreData1 = makeTscoreData(uniqueTextArray(removeEmptyMails(mailTexts[1])))
tscoreData2 = makeTscoreData(uniqueTextArray(removeEmptyMails(mailTexts[2])))
# tscoreData2 = normalizeMaxCount(tscoreData2,tscoreData1["maxCount"]/tscoreData2["maxCount"])

In [None]:
sys.path.append("/home/erikt/projects/newsgac/fasttext-runs")
import tscore
import operator

TOPPOS = 20

def computeTscores(tscoreData1,tscoreData2):
    outFile = open("out.csv","w")
    csvwriter = csv.DictWriter(outFile,["position","token","tscore","freqDropouts","freqFinishers"])
    csvwriter.writeheader()
    tscores = tscore.computeTscoreUniqueWords(tscoreData1,tscoreData2)
    position = 0
    for tuple in sorted(tscores.items(), key=operator.itemgetter(1),reverse=True):
        position += 1
        (token,tokenTscore) = tuple
        if token in tscoreData1[WORDFREQS]: frequency1 = tscoreData1[WORDFREQS][token]
        else: frequency1 = 0
        if token in tscoreData2[WORDFREQS]: frequency2 = tscoreData2[WORDFREQS][token]
        else: frequency2 = 0
        csvwriter.writerow({"position":position,"token":token,"tscore":tokenTscore,\
                            "freqDropouts":frequency1,"freqFinishers":frequency2})
        if position <= TOPPOS or position+TOPPOS > len(tscores): 
            print(position,token,tokenTscore,frequency1,frequency2)
    outFile.close()
    
computeTscores(tscoreData1,tscoreData2)

The top of the list contains words that are more frequently used by the dropouts among the clients (category 1) than by the finishers (category 2), as can be seen from the counts in last two columns, for example for *mvg*: 16 > 3. 

Originally, this observation was not true for the top of the list, which contained mainly frequent words. Why was this the case? The problem proved to be the computation of the t-score. We removed the concepts of text lengths and vocabulary length used in the original definition of the t-score (Church, Gale, Hanks & Hindle, 1991) and replaced this with maximum attainable value: the number of clients per group, since the words were only counted once for each client. We kept using add-0.5 smoothing and adjusted the maximum attanable value with 0.5 as well to account for this. The top of the list improved a lot, with *mvg*, *hapje* and *late* occuring at the top of the list. The bottom of the list, words typically used by finishers did not contain many clear cases like *PS* on place -89 (0 vs 13) but no errors could be found here.

In [None]:
CONTEXTSIZE = 5
dropoutLabel = 2
searchWord = "stress"

def printContext(text,word):
    wordsInText = text.split()
    for i in range(0,len(wordsInText)):
        if wordsInText[i] == word:
            for j in range(i-CONTEXTSIZE,i+CONTEXTSIZE+1):
                try: print(wordsInText[j],end=" ")
                except: pass
            print()    

for clientId in mailTexts[dropoutLabel]:
    if re.search(r"\b"+searchWord+r"\b",mailTexts[dropoutLabel][clientId]): 
        print(dropout[clientId],clientId)
        printContext(mailTexts[dropoutLabel][clientId],searchWord)

## Sanity check: male vs female words

In [None]:
import numpy as np
import os
import pandas as pd
import re
import sys
import xml.etree.ElementTree as ET
sys.path.append('/home/erikt/projects/e-mental-health/data-processing')
import tactus2table

DIRECTORY = "/home/erikt/projects/e-mental-health/usb/tmp/20190917/"
FILENAMEPREFIX = "^AdB"
TITLE = "0-title"
FIELDNAMEINTAKE = "Intake"
FIELDNAMEID = "0-id"
FIELDNAMEGENDER = "geslacht"
GZEXTENSION = ".gz"
XMLEXTENSION = ".xml"
MALE = "man"
FEMALE = "vrouw"

def shortenFileName(fileName):
    return(re.sub(XMLEXTENSION,"",re.sub(GZEXTENSION,"",fileName)))

In [None]:
genders = {}
for inFileName in os.listdir(DIRECTORY):
    if re.search(FILENAMEPREFIX,inFileName):
        root = tactus2table.readRootFromFile(DIRECTORY+"/"+inFileName)
        questionnaires = tactus2table.getQuestionnaires(root,inFileName)
        for questionnaire in questionnaires: 
            if questionnaire[TITLE] == INTAKE:
                for fieldName in questionnaire:
                    if re.search(FIELDNAMEGENDER,fieldName):
                        genders[shortenFileName(questionnaire[FIELDNAMEID])] = questionnaire[fieldName].lower()
                        break
                break

In [None]:
genderKeys = list(np.unique(np.array([genders[key] for key in genders])))
genderMailTexts = {}
for key in genderKeys: genderMailTexts[key] = {}
for result in range(0,len(mailTexts)):
    for fileId in mailTexts[result].keys():
        genderMailTexts[genders[fileId]][fileId] = mailTexts[result][fileId]
[(key,len(genderMailTexts[key])) for key in genderKeys]  

In [None]:
tscoreData1 = makeTscoreData(uniqueTextArray(genderMailTexts[MALE]))
tscoreData2 = makeTscoreData(uniqueTextArray(genderMailTexts[FEMALE]))
computeTscores(tscoreData1,tscoreData2)