# T-score

This notebook contains a pipeline for comparing the vocabulary of two sets of Tactus emails with eachother by the t-score. The goal is to find tokens which appear more frequently in one set than in the other, and vice versa. This notebook uses much of the preprocessing of the notebook liwc.py in this directory. 

The first code block specifies the required libraries. This includes some general Python libraries and some specific libraries developed in our research project. These project-specific libraries can be found in the folder orangehackathon/libs

In [None]:
import csv
import re
import sys
import time
from IPython.display import clear_output

sys.path.append("../libs/")
import tactusloaderLIB
import OWEmailSorterLIB
import markduplicatesLIB
import removemarkedtextLIB
import LIWCLIB

The next code block specifies the location of the therapy files

In [None]:
DIRECTORY = "/home/erikt/projects/e-mental-health/usb/releases/20200320"

The therapy files are read with the Orange3 pipeline. The Orange3 pipeline contains these parts:

1. tactusloader: determine file name and read its contents
2. sortMails: sort the mails from the file chronologically
3. markduplicates: mark the parts of the mail text included from an earlier mail
4. removemarkedtext: remove the marked text from the mail
5. LIWC: analyse the text with LIWC

The file loading takes several minutes. The program counts from 1 to the number of files to indicate its progress.

In [None]:
MAXCLIENT = 1987

def readTactusData(maxClient):
    allLiwcResults = {}
    allMails = {}
    emptyFiles = []
    problemFiles = []
    for patientId in range(1,maxClient+1):
        clear_output(wait=True)
        print("processing:",patientId)
        fileName = tactusloaderLIB.makeFileName(str(patientId))
        fileNameId = re.sub("-an.xml$","",fileName)
        try:
            mails = tactusloaderLIB.processFile(DIRECTORY,fileName+".gz")
            if len(mails[0]) > 0:
                sortedMails = OWEmailSorterLIB.filterEmails(mails[0],filter_asc=True)
                markedMails = markduplicatesLIB.processCorpus(sortedMails)
                strippedMails = removemarkedtextLIB.processCorpus(markedMails)
                liwcResults = LIWCLIB.processCorpus(strippedMails)
                allLiwcResults[fileNameId] = liwcResults
                allMails[fileNameId] = strippedMails
            else: emptyFiles.append(fileName)
        except:
            problemFiles.append(fileName)
            continue
    if len(emptyFiles) > 0:
        print("Found empty or nonexistant files:",emptyFiles)
    if len(problemFiles) > 0:
        print("There were problems processing these files:",problemFiles)
    return(allLiwcResults,allMails)

allLiwcResults,allMails = readTactusData(MAXCLIENT)

In [None]:
len(allMails)

We will comparethe texts in emails from clients that finished the treatment versus clients that dropped out. Thus we need the metadata which specifies the results of the therapy for each client.

In [None]:
import gzip

DIRDROPOUT = "/home/erikt/projects/e-mental-health/usb/releases/20200305/"
FILEDROPOUT = "selected.csv.gz"
DELIMITER = ","
FIELDNAMEDROPOUT = "dropout"
FIELDNAMETEXT = "text"
FIELDNAMEFILE = "file"
FIELDNAMEFROM = "from"
CLIENT = "CLIENT"
COUNSELOR = "COUNSELOR"
NBROFCLIENTS = 791
CODEDROPOUT = "1"
CODEFINISHER = "2"

dropout = {}
inFile = gzip.open(DIRDROPOUT+FILEDROPOUT,"rt",encoding="utf-8")
csvreader = csv.DictReader(inFile,delimiter=DELIMITER)
for row in csvreader: dropout[row[FIELDNAMEFILE]] = row[FIELDNAMEDROPOUT]
inFile.close()

len([x for x in dropout if dropout[x] != "?"]) == NBROFCLIENTS

We will compare collections of texts with the t-score. First we will compute the t-scores of various text collections with a function `makeTscoreData`. After this we can compare these t-scores with the function `compareTscoreData`. We rely on the t-score script `/home/erikt/projects/newsgac/fasttext-runs/tscore.py` for making the comparisons. 

There are two ways for computing the t-scores: count every separate word used by a client or count each word used by a client only once. The texts can be prepared for the second type of counts with the function `removeDuplicateTokensTextArray` which removes all duplicate words from the texts (case-sensitive).

In [None]:
NBROFTOKENS = "totalFreq"
NBROFTYPES = "nbrOfWords"
WORDFREQS = "wordFreqs"
NBROFGROUPS = "nbrOfGroups"

def normalizeMaxCount(tscoreData,fraction):
    tscoreData[MAXCOUNT] = round(tscoreData[MAXCOUNT]*fraction,1)
    for word in tscoreData["wordFreqs"]:
        tscoreData["wordFreqs"][word] = round(tscoreData["wordFreqs"][word]*fraction,1)
    return(tscoreData)

def removeEmptyMails(textArrayIn):    
    return([text for text in textArrayIn if text != ""])

def removeDuplicateTokensText(text):
    seen = {}
    for word in text.split():
        if not word in seen: seen[word] = True
    return(" ".join(list(seen.keys())))

def removeDuplicateTokensTextArray(textArrayIn):
    textArrayOut = []
    for text in textArrayIn:
        textArrayOut.append(removeDuplicateTokens(text))
    return(textArrayOut)

def makeTscoreData(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, NBROFGROUPS:len(textArray), WORDFREQS:{} }
    for text in textArray:
        for token in text.split():
            data[NBROFTOKENS] += 1
            if token in data[WORDFREQS]: 
                data[WORDFREQS][token] += 1
            else:
                data[WORDFREQS][token] = 1
                data[NBROFTYPES] += 1
    return(data)

def makeTscoreDataBigrams(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, NBROFGROUPS:len(textArray), WORDFREQS:{} }
    for text in textArray:
        tokens = text.split()
        for i in range(1,len(tokens)):
            bigram = tokens[i-1]+" "+tokens[i]
            data[NBROFTOKENS] += 1
            if bigram in data[WORDFREQS]: 
                data[WORDFREQS][bigram] += 1
            else:
                data[WORDFREQS][bigram] = 1
                data[NBROFTYPES] += 1
    return(data)

In [None]:
sys.path.append("/home/erikt/projects/newsgac/fasttext-runs")
import tscore
import operator

def compareTscoreData(tscoreData1,tscoreData1reference,tscoreData2,tscoreData2reference,nbrOfShow):
    outFile = open("out.csv","w")
    csvwriter = csv.DictWriter(outFile,["position","token","tscore","freqDropouts","freqFinishers"])
    csvwriter.writeheader()
    tscores1 = tscore.computeTscore(tscoreData1,tscoreData1reference)
    tscores2 = tscore.computeTscore(tscoreData2,tscoreData2reference)
    tscores1sorted = sorted(tscores1.items(), key=operator.itemgetter(1),reverse=True)
    tscores2sorted = sorted(tscores2.items(), key=operator.itemgetter(1),reverse=True)
    combined = [ tscores1sorted[i]+tscores2sorted[i] for i in range(min(len(tscores1sorted),len(tscores2sorted))) ]
    position = 0
    for tuple in combined:
        position += 1
        (token1,token1Tscore,token2,token2Tscore) = tuple
        if token1 in tscoreData1[WORDFREQS]: frequency11 = tscoreData1[WORDFREQS][token1]
        else: frequency11 = 0
        if token1 in tscoreData1reference[WORDFREQS]: frequency1r = tscoreData1reference[WORDFREQS][token1]
        else: frequency1r = 0
        if token2 in tscoreData2[WORDFREQS]: frequency22 = tscoreData2[WORDFREQS][token2]
        else: frequency22 = 0
        if token2 in tscoreData2reference[WORDFREQS]: frequency2r = tscoreData2reference[WORDFREQS][token2]
        else: frequency2r = 0
        csvwriter.writerow({"position":position,"token":token1,"tscore":token1Tscore,\
                            "freqDropouts":frequency11,"freqFinishers":frequency1r})
        if position <= nbrOfShow: 
            print("{0:6d}. {2:5.1f} {3:6d} {4:6d} {1:<20s} {5:6d}. {7:5.1f} {8:6d} {9:6d} {6:<20s}".\
                  format(position,token1,round(token1Tscore,1),frequency11,frequency1r,\
                         position,token2,round(token2Tscore,1),frequency22,frequency2r))
    outFile.close()

Now we choose the text collections that we want to compare. Here are the choices:

1. all mails of all clients
2. all mails of all counselors
3. the final mail written by a counselor to a client that dropped out
4. the final mail written to a counselor to a client that finished the therapy
5. the first four mails of a client that dropped out
6. the first four mails of a client that finished the therapy
7. the long mails among the first four mails of a client that dropped out
8. the long mails among the first four mails of a client that finished the therapy
9. all mails of all clients that dropped out
10. all mails of all clients that finished the therapy

In [None]:
clientWithMails = [ clientId for clientId in allMails.keys() if len(allMails[clientId]) > 0 ][0]
textFieldId = LIWCLIB.getFieldId(allMails[clientWithMails][0],FIELDNAMETEXT) 

def wordCount(text): return(len(text.split()))

def selectMailsFrom(mails,sender):
    return([ mail for mail in mails if mail[FIELDNAMEFROM] == sender ])

def selectAllMailTextFrom(mails,sender):
    return(" ".join([ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ]))

def selectLastNMailTextsFrom(mails,sender,n):
    return(" ".join([ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ][-n:]))

def selectFirstNMailTextsFrom(mails,sender,n,cutoff=0):
    firstNtexts = [ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ][0:n]
    return(" ".join([ text for text in firstNtexts if wordCount(text) >= cutoff ]))

allMailsClients = [ selectAllMailTextFrom(allMails[thisId],CLIENT) for thisId in allMails ]
allMailsCounselors = [ selectAllMailTextFrom(allMails[thisId],COUNSELOR) for thisId in allMails ]

NBROFSELECTED = 1
lastCounselorMails = { thisId:selectLastNMailTextsFrom(allMails[thisId],COUNSELOR,NBROFSELECTED) for thisId in allMails }
lastCounselorMailsDropout = [ lastCounselorMails[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
lastCounselorMailsFinisher = [ lastCounselorMails[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

NBROFSELECTED = 4
firstFourClientMails = { thisId:selectFirstNMailTextsFrom(allMails[thisId],CLIENT,NBROFSELECTED) for thisId in allMails }
firstFourClientMailsDropout = [ firstFourClientMails[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
firstFourClientMailsFinisher = [ firstFourClientMails[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

CUTOFF = 1000
NBROFSELECTED = 4
firstFourClientMailsCutoff = { thisId:selectFirstNMailTextsFrom(allMails[thisId],CLIENT,NBROFSELECTED,cutoff=CUTOFF) for thisId in allMails }
firstFourClientMailsCutoffDropout = [ firstFourClientMailsCutoff[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
firstFourClientMailsCutoffFinisher = [ firstFourClientMailsCutoff[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

allMailsClientsDropout = [ selectAllMailTextFrom(allMails[clientId],CLIENT) for clientId in allMails if dropout[clientId] == CODEDROPOUT ]
allMailsClientsFinisher = [ selectAllMailTextFrom(allMails[clientId],CLIENT) for clientId in allMails if dropout[clientId] == CODEFINISHER ]

len(removeEmptyMails(allMailsClients)),len(removeEmptyMails(allMailsCounselors)),\
len(removeEmptyMails(lastCounselorMailsDropout)),len(removeEmptyMails(lastCounselorMailsFinisher)),\
len(removeEmptyMails(firstFourClientMailsDropout)),len(removeEmptyMails(firstFourClientMailsFinisher)),\
len(removeEmptyMails(firstFourClientMailsCutoffDropout)),len(removeEmptyMails(firstFourClientMailsCutoffFinisher)), \
len(removeEmptyMails(allMailsClientsDropout)),len(removeEmptyMails(allMailsClientsFinisher))

Since we have selected different groups of text, we can use the t-score for comparing the word usage in our group with another. Note that the comparison results will be better for larger word groups. In general it does not makes sense to compare two small groups of texts because neither of the two will provide a good model of the language to compare the other with. In those cases it is better to compare a small text with a large text.

## Comparing all mail texts: dropouts vs finishers

The two groups of all dropout mails and all finisher mails are large enough to compare with each other. However the results may not be very useful for two reasons. First, on average the finishers took part in the therapy for a longer time than the dropouts. So they participated more frequently in later assignments and the language used in these assignments will have an effect on the measurements. A second reasond reason for the results being less interesting is that we would like to detect likely dropouts as early as possible in the therapy, so we do not have the time to collect their responses during several weeks.  

In [None]:
NBROFSHOW = 40

print("counts:","allMailsClientsDropout:",sum([wordCount(text) for text in allMailsClientsDropout]),"tokens;",len(allMailsClientsDropout),"mails;", \
                "allMailsClientsFinisher:",sum([wordCount(text) for text in allMailsClientsFinisher]),"tokens;",len(allMailsClientsFinisher),"mails", )
tscoreData1 = makeTscoreData(removeEmptyMails(allMailsClientsDropout))
tscoreData2 = makeTscoreData(removeEmptyMails(allMailsClientsFinisher))
compareTscoreData(tscoreData1,tscoreData2,tscoreData2,tscoreData1,NBROFSHOW)

The fact that the finishers completed a larger part of the therapy is indeed reflected in their word usage, in particular by the therapy words *pols*, *vinger*, *actieplan*, *nameting*, *traject* and *opdracht*. The word *Dank* is also interesting here (displaying gratitude for a successful therapy?) as well as the two food-related diminuitives (*etentje* and *pilsje*: ability to relativate?). *gedachten*, *denken* and *herken* show the ability to reflect but may be invoked by later assignments. *trek* is an old-fashioned words correlated with the age of the client. The comma (*,*) was more used in this group (longer sentences?) while the periode (*.*) was more used by the dropouts (shorter sentences?).

Interestingly, dropouts use more numbers (*NUM*) in their mails and more frequently refer to themselves (*mijn* and *Mijn*). The word *IK* is less interesting: it is an unusual spelling of a common word. There are seven more of these (*HET*, *DAT*, *VAN*, *DE*, *MIJN*, *EN* and *IS*) but they could be caused by a few clients writing all caps. There are many references to people and relations: *relatie*, *kinderen*, *huisarts*, *partner*, *ex*, *ouders*, *vrienden*, *contacten* and *vriendin*). *huisarts* could be used because of medical problems or extreme drinking behavior. While the finishers did not use negative words frequently, the dropouts did: *nadelen*, *problemen* and *klachten*. Two other interesting related words are *werk* and *hobby*, and there are also the three behavior-related words *drink*, *gebruik* and *stoppen*.

In [None]:
UNDEFINED = "UNDEF"

def makeCollocation(tokens,i,nbrOfTokensBefore,nbrOfTokensAfter):
    collocationList = [tokens[i]]
    for j in range(i-1,i-nbrOfTokensBefore-1,-1):
        if j >= 0: collocationList = [tokens[j]]+collocationList
        else: collocationList = [UNDEFINED]+collocationList
    for j in range(i+1,i+nbrOfTokensAfter+1):
        if j < len(tokens): collocationList.append(tokens[j])
        else: collocationList.append(UNDEFINED)
    return(" ".join(collocationList))
            
def addToCollocations(collocations,tokens,i,nbrOfTokensBefore,nbrOfTokensAfter):
    collocation = makeCollocation(tokens,i,nbrOfTokensBefore,nbrOfTokensAfter)
    if collocation in collocations: collocations[collocation] += 1
    else: collocations[collocation] = 1
        
def collocations(texts,token,nbrOfTokensBefore=2,nbrOfTokensAfter=3,nbrOfCollocations=20):
    collocations = {}
    for text in texts:
        tokens = text.split()
        for i in range(0,len(tokens)):
            if tokens[i] == token: addToCollocations(collocations,tokens,i,nbrOfTokensBefore,nbrOfTokensAfter)
    for key,value in sorted(collocations.items(), key=lambda item: item[1],reverse=True)[0:nbrOfCollocations]:
        print(value,key)
        
collocations(allMailsClientsDropout,"NUM",nbrOfTokensBefore=3,nbrOfTokensAfter=0)

In [None]:
DROPOUTID = 1
FINISHERID = 2
NBROFDROPOUTS = 437
NBROFFINISHERS = 354

len(mailTexts[DROPOUTID]) == NBROFDROPOUTS and len(mailTexts[FINISHERID]) == NBROFFINISHERS

In [None]:
sum([wordCount(text) for text in allMailsClientsDropout])

The top of the list contains words that are more frequently used by the dropouts among the clients (category 1) than by the finishers (category 2), as can be seen from the counts in last two columns, for example for *mvg*: 16 > 3. 

Originally, this observation was not true for the top of the list, which contained mainly frequent words. Why was this the case? The problem proved to be the computation of the t-score. We removed the concepts of text lengths and vocabulary length used in the original definition of the t-score (Church, Gale, Hanks & Hindle, 1991) and replaced this with maximum attainable value: the number of clients per group, since the words were only counted once for each client. We kept using add-0.5 smoothing and adjusted the maximum attanable value with 0.5 as well to account for this. The top of the list improved a lot, with *mvg*, *hapje* and *late* occuring at the top of the list. The bottom of the list, words typically used by finishers did not contain many clear cases like *PS* on place -89 (0 vs 13) but no errors could be found here.

In [None]:
CONTEXTSIZE = 5
dropoutLabel = 2
searchWord = "stress"

def printContext(text,word):
    wordsInText = text.split()
    for i in range(0,len(wordsInText)):
        if wordsInText[i] == word:
            for j in range(i-CONTEXTSIZE,i+CONTEXTSIZE+1):
                try: print(wordsInText[j],end=" ")
                except: pass
            print()    

for clientId in mailTexts[dropoutLabel]:
    if re.search(r"\b"+searchWord+r"\b",mailTexts[dropoutLabel][clientId]): 
        print(dropout[clientId],clientId)
        printContext(mailTexts[dropoutLabel][clientId],searchWord)

## Sanity check: male vs female words

In [None]:
import numpy as np
import os
import pandas as pd
import re
import sys
import xml.etree.ElementTree as ET
sys.path.append('/home/erikt/projects/e-mental-health/data-processing')
import tactus2table

DIRECTORY = "/home/erikt/projects/e-mental-health/usb/tmp/20190917/"
FILENAMEPREFIX = "^AdB"
TITLE = "0-title"
FIELDNAMEINTAKE = "Intake"
FIELDNAMEID = "0-id"
FIELDNAMEGENDER = "geslacht"
GZEXTENSION = ".gz"
XMLEXTENSION = ".xml"
MALE = "man"
FEMALE = "vrouw"

def shortenFileName(fileName):
    return(re.sub(XMLEXTENSION,"",re.sub(GZEXTENSION,"",fileName)))

In [None]:
genders = {}
for inFileName in os.listdir(DIRECTORY):
    if re.search(FILENAMEPREFIX,inFileName):
        root = tactus2table.readRootFromFile(DIRECTORY+"/"+inFileName)
        questionnaires = tactus2table.getQuestionnaires(root,inFileName)
        for questionnaire in questionnaires: 
            if questionnaire[TITLE] == INTAKE:
                for fieldName in questionnaire:
                    if re.search(FIELDNAMEGENDER,fieldName):
                        genders[shortenFileName(questionnaire[FIELDNAMEID])] = questionnaire[fieldName].lower()
                        break
                break

In [None]:
genderKeys = list(np.unique(np.array([genders[key] for key in genders])))
genderMailTexts = {}
for key in genderKeys: genderMailTexts[key] = {}
for result in range(0,len(mailTexts)):
    for fileId in mailTexts[result].keys():
        genderMailTexts[genders[fileId]][fileId] = mailTexts[result][fileId]
[(key,len(genderMailTexts[key])) for key in genderKeys]  

In [None]:
tscoreData1 = makeTscoreData(uniqueTextArray(genderMailTexts[MALE]))
tscoreData2 = makeTscoreData(uniqueTextArray(genderMailTexts[FEMALE]))
computeTscores(tscoreData1,tscoreData2)