# T-score

This notebook contains a pipeline for comparing the vocabulary of two sets of Tactus emails with eachother by the t-score. The goal is to find tokens which appear more frequently in one set than in the other, and vice versa. This notebook uses much of the preprocessing of the notebook liwc.py in this directory. 

The first code block specifies the required libraries. This includes some general Python libraries and some specific libraries developed in our research project. These project-specific libraries can be found in the folder orangehackathon/libs

In [None]:
import csv
import re
import sys
import time
from IPython.display import clear_output

sys.path.append("../libs/")
import tactusloaderLIB
import OWEmailSorterLIB
import markduplicatesLIB
import removemarkedtextLIB
import LIWCLIB

The next code block specifies the location of the therapy files

In [None]:
DIRECTORY = "/home/erikt/projects/e-mental-health/usb/releases/20200320"

The therapy files are read with the Orange3 pipeline. The Orange3 pipeline contains these parts:

1. tactusloader: determine file name and read its contents
2. sortMails: sort the mails from the file chronologically
3. markduplicates: mark the parts of the mail text included from an earlier mail
4. removemarkedtext: remove the marked text from the mail
5. LIWC: analyse the text with LIWC

The file loading takes several minutes. The program counts from 1 to the number of files to indicate its progress.

In [None]:
MAXCLIENT = 1987

def readTactusData(maxClient):
    allLiwcResults = {}
    allMails = {}
    emptyFiles = []
    problemFiles = []
    for patientId in range(1,maxClient+1):
        clear_output(wait=True)
        print("processing:",patientId)
        fileName = tactusloaderLIB.makeFileName(str(patientId))
        fileNameId = re.sub("-an.xml$","",fileName)
        try:
            mails = tactusloaderLIB.processFile(DIRECTORY,fileName+".gz")
            if len(mails[0]) > 0:
                sortedMails = OWEmailSorterLIB.filterEmails(mails[0],filter_asc=True)
                markedMails = markduplicatesLIB.processCorpus(sortedMails)
                strippedMails = removemarkedtextLIB.processCorpus(markedMails)
                liwcResults = LIWCLIB.processCorpus(strippedMails)
                allLiwcResults[fileNameId] = liwcResults
                allMails[fileNameId] = strippedMails
            else: emptyFiles.append(fileName)
        except:
            problemFiles.append(fileName)
            continue
    if len(emptyFiles) > 0:
        print("Found empty or nonexistant files:",emptyFiles)
    if len(problemFiles) > 0:
        print("There were problems processing these files:",problemFiles)
    return(allLiwcResults,allMails)

allLiwcResults,allMails = readTactusData(MAXCLIENT)

In [None]:
len(allMails)

We will comparethe texts in emails from clients that finished the treatment versus clients that dropped out. Thus we need the metadata which specifies the results of the therapy for each client.

In [None]:
import gzip

DIRDROPOUT = "/home/erikt/projects/e-mental-health/usb/releases/20200305/"
FILEDROPOUT = "selected.csv.gz"
DELIMITER = ","
FIELDNAMEDROPOUT = "dropout"
FIELDNAMETEXT = "text"
FIELDNAMEFILE = "file"
FIELDNAMEFROM = "from"
CLIENT = "CLIENT"
COUNSELOR = "COUNSELOR"
NBROFCLIENTS = 791
CODEDROPOUT = "1"
CODEFINISHER = "2"

dropout = {}
inFile = gzip.open(DIRDROPOUT+FILEDROPOUT,"rt",encoding="utf-8")
csvreader = csv.DictReader(inFile,delimiter=DELIMITER)
for row in csvreader: dropout[row[FIELDNAMEFILE]] = row[FIELDNAMEDROPOUT]
inFile.close()

len([x for x in dropout if dropout[x] != "?"]) == NBROFCLIENTS

We will compare collections of texts with the t-score. First we will compute the t-scores of various text collections with a function `makeTscoreData`. After this we can compare these t-scores with the function `compareTscoreData`. We rely on the t-score script `/home/erikt/projects/newsgac/fasttext-runs/tscore.py` for making the comparisons. 

There are two ways for computing the t-scores: count every separate word used by a client or count each word used by a client only once. The texts can be prepared for the second type of counts with the function `removeDuplicateTokensTextArray` which removes all duplicate words from the texts (case-sensitive).

In [None]:
NBROFTOKENS = "totalFreq"
NBROFTYPES = "nbrOfWords"
WORDFREQS = "wordFreqs"
NBROFGROUPS = "nbrOfGroups"

def normalizeMaxCount(tscoreData,fraction):
    tscoreData[MAXCOUNT] = round(tscoreData[MAXCOUNT]*fraction,1)
    for word in tscoreData["wordFreqs"]:
        tscoreData["wordFreqs"][word] = round(tscoreData["wordFreqs"][word]*fraction,1)
    return(tscoreData)

def removeEmptyMails(textArrayIn):    
    return([text for text in textArrayIn if text != ""])

def removeDuplicateTokensText(text):
    seen = {}
    for word in text.split():
        if not word in seen: seen[word] = True
    return(" ".join(list(seen.keys())))

def removeDuplicateTokensTextArray(textArrayIn):
    textArrayOut = []
    for text in textArrayIn:
        textArrayOut.append(removeDuplicateTokens(text))
    return(textArrayOut)

def makeTscoreData(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, NBROFGROUPS:len(textArray), WORDFREQS:{} }
    for text in textArray:
        for token in text.split():
            data[NBROFTOKENS] += 1
            if token in data[WORDFREQS]: 
                data[WORDFREQS][token] += 1
            else:
                data[WORDFREQS][token] = 1
                data[NBROFTYPES] += 1
    return(data)

def makeTscoreDataBigrams(textArray):
    data = { NBROFTOKENS:0, NBROFTYPES:0, NBROFGROUPS:len(textArray), WORDFREQS:{} }
    for text in textArray:
        tokens = text.split()
        for i in range(1,len(tokens)):
            bigram = tokens[i-1]+" "+tokens[i]
            data[NBROFTOKENS] += 1
            if bigram in data[WORDFREQS]: 
                data[WORDFREQS][bigram] += 1
            else:
                data[WORDFREQS][bigram] = 1
                data[NBROFTYPES] += 1
    return(data)

In [None]:
sys.path.append("/home/erikt/projects/newsgac/fasttext-runs")
import tscore
import operator
from termcolor import colored
# alternative: from IPython.display import Markdown; display(Markdown("htmlcode"))

HIGHLIGHTCOLOR = "blue"
ADDEDSPACES = 9
MAXSTRING = 20

def fillAddedSpaces(tokenLen):
    addedSpaces = ""
    i = 0
    while i < ADDEDSPACES and tokenLen+i-ADDEDSPACES < MAXSTRING:
        addedSpaces += " "
        i += 1
    return(addedSpaces)

def compareTscoreData(tscoreData1,tscoreData1reference,tscoreData2,tscoreData2reference,nbrOfShow,coloredWords=[]):
    outFile = open("out.csv","w")
    csvwriter = csv.DictWriter(outFile,["position","token","tscore","freqDropouts","freqFinishers"])
    csvwriter.writeheader()
    tscores1 = tscore.computeTscore(tscoreData1,tscoreData1reference)
    tscores2 = tscore.computeTscore(tscoreData2,tscoreData2reference)
    tscores1sorted = sorted(tscores1.items(), key=operator.itemgetter(1),reverse=True)
    tscores2sorted = sorted(tscores2.items(), key=operator.itemgetter(1),reverse=True)
    combined = [ tscores1sorted[i]+tscores2sorted[i] for i in range(min(len(tscores1sorted),len(tscores2sorted))) ]
    position = 0
    for tuple in combined:
        position += 1
        (token1,token1Tscore,token2,token2Tscore) = tuple
        if token1 in tscoreData1[WORDFREQS]: frequency11 = tscoreData1[WORDFREQS][token1]
        else: frequency11 = 0
        if token1 in tscoreData1reference[WORDFREQS]: frequency1r = tscoreData1reference[WORDFREQS][token1]
        else: frequency1r = 0
        if token2 in tscoreData2[WORDFREQS]: frequency22 = tscoreData2[WORDFREQS][token2]
        else: frequency22 = 0
        if token2 in tscoreData2reference[WORDFREQS]: frequency2r = tscoreData2reference[WORDFREQS][token2]
        else: frequency2r = 0
        csvwriter.writerow({"position":position,"token":token1,"tscore":token1Tscore,\
                            "freqDropouts":frequency11,"freqFinishers":frequency1r})
        if position <= nbrOfShow:
            token1AddedSpaces = ""
            if token1 in coloredWords: 
                token1 = colored(token1,HIGHLIGHTCOLOR)
                token1AddedSpaces = fillAddedSpaces(len(token1))
            if token2 in coloredWords: token2 = colored(token2,HIGHLIGHTCOLOR)
            print("{0:6d}. {2:5.1f} {3:7d} {4:7d} {1:<20s}{5:s} {0:6d}. {7:5.1f} {8:7d} {9:7d} {6:<20s}".\
                  format(position,token1,round(token1Tscore,1),frequency11,frequency1r,token1AddedSpaces, \
                                  token2,round(token2Tscore,1),frequency22,frequency2r))
    outFile.close()

Now we choose the text collections that we want to compare. Here are the choices:

1. all mails of all clients
2. all mails of all counselors
3. the final mail written by a counselor to a client that dropped out
4. the final mail written to a counselor to a client that finished the therapy
5. the first four mails of a client that dropped out
6. the first four mails of a client that finished the therapy
7. the long mails among the first four mails of a client that dropped out
8. the long mails among the first four mails of a client that finished the therapy
9. all mails of all clients that dropped out
10. all mails of all clients that finished the therapy

In [None]:
clientWithMails = [ clientId for clientId in allMails.keys() if len(allMails[clientId]) > 0 ][0]
textFieldId = LIWCLIB.getFieldId(allMails[clientWithMails][0],FIELDNAMETEXT) 

def wordCount(text): return(len(text.split()))

def selectMailsFrom(mails,sender):
    return([ mail for mail in mails if mail[FIELDNAMEFROM] == sender ])

def selectAllMailTextFrom(mails,sender):
    return(" ".join([ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ]))

def selectLastNMailTextsFrom(mails,sender,n):
    return(" ".join([ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ][-n:]))

def selectFirstNMailTextsFrom(mails,sender,n,cutoff=0):
    firstNtexts = [ mail.metas[textFieldId] for mail in selectMailsFrom(mails,sender) ][0:n]
    return(" ".join([ text for text in firstNtexts if wordCount(text) >= cutoff ]))

allMailsClients = [ selectAllMailTextFrom(allMails[thisId],CLIENT) for thisId in allMails ]
allMailsCounselors = [ selectAllMailTextFrom(allMails[thisId],COUNSELOR) for thisId in allMails ]

NBROFSELECTED = 1
lastCounselorMails = { thisId:selectLastNMailTextsFrom(allMails[thisId],COUNSELOR,NBROFSELECTED) for thisId in allMails }
lastCounselorMailsDropout = [ lastCounselorMails[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
lastCounselorMailsFinisher = [ lastCounselorMails[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

NBROFSELECTED = 4
firstFourClientMails = { thisId:selectFirstNMailTextsFrom(allMails[thisId],CLIENT,NBROFSELECTED) for thisId in allMails }
firstFourClientMailsDropout = [ firstFourClientMails[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
firstFourClientMailsFinisher = [ firstFourClientMails[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

CUTOFF = 1000
NBROFSELECTED = 4
firstFourClientMailsCutoff = { thisId:selectFirstNMailTextsFrom(allMails[thisId],CLIENT,NBROFSELECTED,cutoff=CUTOFF) for thisId in allMails }
firstFourClientMailsCutoffDropout = [ firstFourClientMailsCutoff[clientId] for clientId in dropout if dropout[clientId] == CODEDROPOUT ]
firstFourClientMailsCutoffFinisher = [ firstFourClientMailsCutoff[clientId] for clientId in dropout if dropout[clientId] == CODEFINISHER ]

allMailsClientsDropout = [ selectAllMailTextFrom(allMails[clientId],CLIENT) for clientId in allMails if dropout[clientId] == CODEDROPOUT ]
allMailsClientsFinisher = [ selectAllMailTextFrom(allMails[clientId],CLIENT) for clientId in allMails if dropout[clientId] == CODEFINISHER ]

len(removeEmptyMails(allMailsClients)),len(removeEmptyMails(allMailsCounselors)),\
len(removeEmptyMails(lastCounselorMailsDropout)),len(removeEmptyMails(lastCounselorMailsFinisher)),\
len(removeEmptyMails(firstFourClientMailsDropout)),len(removeEmptyMails(firstFourClientMailsFinisher)),\
len(removeEmptyMails(firstFourClientMailsCutoffDropout)),len(removeEmptyMails(firstFourClientMailsCutoffFinisher)), \
len(removeEmptyMails(allMailsClientsDropout)),len(removeEmptyMails(allMailsClientsFinisher))

There is a difference in the number of dropout sessions with counselor mails (437) and the number of dropout sessions with client mails (433). The reason for this is that there are four clients which contributed only a single empty mail to the session.  

In [None]:
for thisId in allMails:
    mailText = selectAllMailTextFrom(allMails[thisId],CLIENT)
    if (dropout[thisId] == "1" or dropout[thisId] == "2") and len(mailText) == 0: print(thisId,dropout[thisId])

Since we have selected different groups of text, we can use the t-score for comparing the word usage in our group with another. Note that the comparison results will be better for larger word groups. In general it does not makes sense to compare two small groups of texts because neither of the two will provide a good model of the language to compare the other with. In those cases it is better to compare a small text with a large text.

## Comparing all mail texts: dropouts vs finishers

The two groups of all dropout mails and all finisher mails are large enough to compare with each other. However the results may not be very useful for two reasons. First, on average the finishers took part in the therapy for a longer time than the dropouts. So they participated more frequently in later assignments and the language used in these assignments will have an effect on the measurements. A second reasond reason for the results being less interesting is that we would like to detect likely dropouts as early as possible in the therapy, so we do not have the time to collect their responses during several weeks.  

In [None]:
NBROFSHOW = 40

print("counts:","allMailsClientsDropout:",sum([wordCount(text) for text in allMailsClientsDropout]), \
                "tokens;",len(removeEmptyMails(allMailsClientsDropout)),"mails;", \
                "allMailsClientsFinisher:",sum([wordCount(text) for text in allMailsClientsFinisher]), \
                "tokens;",len(removeEmptyMails(allMailsClientsFinisher)),"mails" )
tscoreData1 = makeTscoreData(removeEmptyMails(allMailsClientsDropout))
tscoreData2 = makeTscoreData(removeEmptyMails(allMailsClientsFinisher))
compareTscoreData(tscoreData1,tscoreData2,tscoreData2,tscoreData1,NBROFSHOW,\
                  coloredWords=["relatie","kinderen","huisarts","partner","ex","ouders","vrienden","contacten","vriendin",\
                                "pols","vinger","actieplan","nameting","traject","opdracht"])

The fact that the finishers completed a larger part of the therapy is indeed reflected in their word usage, in particular by the therapy words *pols*, *vinger*, *actieplan*, *nameting*, *traject* and *opdracht*. The word *Dank* is also interesting here (displaying gratitude for a successful therapy?) as well as the two food-related diminuitives (*etentje* and *pilsje*: ability to relativate?). *gedachten*, *denken* and *herken* show the ability to reflect but may be invoked by later assignments. *trek* is an old-fashioned words correlated with the age of the client. The comma (*,*) was more used in this group (longer sentences?) while the periode (*.*) was more used by the dropouts (shorter sentences?).

Interestingly, dropouts use more numbers (*NUM*) in their mails and more frequently refer to themselves (*mijn* and *Mijn*). The word *IK* is less interesting: it is an unusual spelling of a common word. There are seven more of these (*HET*, *DAT*, *VAN*, *DE*, *MIJN*, *EN* and *IS*) but they could be caused by a few clients writing all caps. There are many references to people and relations: *relatie*, *kinderen*, *huisarts*, *partner*, *ex*, *ouders*, *vrienden*, *contacten* and *vriendin*). *huisarts* could be used because of medical problems or extreme drinking behavior. While the finishers did not use negative words frequently, the dropouts did: *nadelen*, *problemen* and *klachten*. Two other interesting related words are *werk* and *hobby*, and there are also the three behavior-related words *drink*, *gebruik* and *stoppen*.

The next function `collocations()` can be used to lookup tokens in context in the mails. For example, we expected that the word *pols* was predominantly used in the context of the name of the Tactus treatment programme *Vinger aan de pols* rather than in the literal meaning (*wrist*). Searching forthe contexts of the token with the `collocations()` function confirmed this expectation.

In [None]:
UNDEFINED = "UNDEF"
NBROFTOKENSAFTER = 0
NBROFTOKENSBEFORE = 3
NBROFCOLLOCATIONS = 20

def makeCollocation(tokens,i,nbrOfTokensBefore,nbrOfTokensAfter):
    collocationList = [tokens[i]]
    for j in range(i-1,i-nbrOfTokensBefore-1,-1):
        if j >= 0: collocationList = [tokens[j]]+collocationList
        else: collocationList = [UNDEFINED]+collocationList
    for j in range(i+1,i+nbrOfTokensAfter+1):
        if j < len(tokens): collocationList.append(tokens[j])
        else: collocationList.append(UNDEFINED)
    return(" ".join(collocationList))
            
def addToCollocations(collocations,tokens,i,nbrOfTokensBefore,nbrOfTokensAfter):
    collocation = makeCollocation(tokens,i,nbrOfTokensBefore,nbrOfTokensAfter)
    if collocation in collocations: collocations[collocation] += 1
    else: collocations[collocation] = 1
        
def collocations(texts,token,nbrOfTokensBefore=NBROFTOKENSBEFORE,nbrOfTokensAfter=NBROFTOKENSAFTER,nbrOfCollocations=NBROFCOLLOCATIONS):
    collocations = {}
    total = 0
    for text in texts:
        tokens = text.split()
        for i in range(0,len(tokens)):
            if tokens[i] == token:
                total += 1
                addToCollocations(collocations,tokens,i,nbrOfTokensBefore,nbrOfTokensAfter)
    print("total:",total)
    for key,value in sorted(collocations.items(), key=lambda item: item[1],reverse=True)[0:nbrOfCollocations]:
        print(value,key)
        
collocations(allMailsClientsFinisher,"pols",nbrOfCollocations=7)

## Comparing the final counselor mails: dropouts vs finishers

We would like to know if it is possible to predict the result of the therapy (dropout vs finisher) based on the text of the final mail sent by the counselor. By themselves these mails do contain a lot of text so we compare them with all the mails of the counselors.

In [None]:
print("counts:","lastCounselorMailsDropout:",sum([wordCount(text) for text in lastCounselorMailsDropout]),\
                "tokens;",len(removeEmptyMails(lastCounselorMailsDropout)),"mails;", \
                "lastCounselorMailsFinisher:",sum([wordCount(text) for text in lastCounselorMailsFinisher]),\
                "tokens;",len(removeEmptyMails(lastCounselorMailsFinisher)),"mails" )
tscoreData1 = makeTscoreData(removeEmptyMails(lastCounselorMailsDropout))
tscoreData2 = makeTscoreData(removeEmptyMails(lastCounselorMailsFinisher))
tscoreData3 = makeTscoreData(removeEmptyMails(allMailsCounselors))
compareTscoreData(tscoreData1,tscoreData3,tscoreData2,tscoreData3,NBROFSHOW,
                  coloredWords=["invullen","vragenlijst","stuur","vullen","verzoek","actief","niet"])

The top finisher words provide the most clues for group identification. We observe calls to action (*invullen*, *vragenlijst*, *stuur*, *vullen* and *verzoek*) and expressions of completion (*afgerond* and *bedankt*). Among the top dropout words we find a reference to the clients status: *actief* (in combination with *niet*). *stuur* also appears here, in the instruction to the client on how to return to the programme.

In [None]:
collocations(lastCounselorMailsFinisher,"actief",nbrOfTokensBefore=2,nbrOfTokensAfter=2,nbrOfCollocations=5)

## Comparing the first four client mails: dropouts vs finishers

In [None]:
print("counts:","firstFourClientMailsDropout:",sum([wordCount(text) for text in firstFourClientMailsDropout]),\
                "tokens;",len(removeEmptyMails(firstFourClientMailsDropout)),"mails;", \
                "firstFourClientMailsFinisher:",sum([wordCount(text) for text in firstFourClientMailsFinisher]),\
                "tokens;",len(removeEmptyMails(firstFourClientMailsFinisher)),"mails" )
tscoreData1 = makeTscoreData(removeEmptyMails(firstFourClientMailsDropout))
tscoreData2 = makeTscoreData(removeEmptyMails(firstFourClientMailsFinisher))
tscoreData3 = makeTscoreData(removeEmptyMails(allMailsClients))
compareTscoreData(tscoreData1,tscoreData3,tscoreData2,tscoreData3,NBROFSHOW,
                  coloredWords=["jaren","(","vaak","voordelen","top","werd","vader","In",\
                                "regelmatig","contact","Top","altijd","'s","naam",":",\
                                "ben","gebruik","hoogte","relatie","eet","partner","Vraag","heb","interesses",\
                                "ongeveer","ouders","vrijetijdsbesteding","beantwoorden","problemen","soms"])

Most of the words (25 of 40) in the two top usage lists are the same, showing that the language usage in the early mails is different from that in the other mails. In particular this is true for some of the relation words that we found in the comparison between all dropout and finisher mails (*kinderen* and *vrienden*). Interestingly, the dropout list contains several first person singular verbs (*ben*, *gebruik*, *eet* and *heb*) not present in the finisher list with the exception of *drink*. The dropout list still contains more relative words (*relatie*, *partner* and *ouders*) than the finisher list.

A direct comparison between the two text collections reveals similar patterns: first person singular verb words (*heb*, *ben*, *drink* and *gebruik*) and relation words (*vriendin* and *zoontje*) among the top dropout words. A new topic is prominent among the finsher words: illnesses and physical problems (*kanker*, *pijn*, *chemo*, *arthritis*).

In [None]:
compareTscoreData(tscoreData1,tscoreData2,tscoreData2,tscoreData1,NBROFSHOW)

## Comparing the longest among the first four client mails: dropouts vs finishers

In [None]:
print("counts:","firstFourClientMailsCutoffDropout:",sum([wordCount(text) for text in firstFourClientMailsCutoffDropout]),\
                "tokens;",len(removeEmptyMails(firstFourClientMailsCutoffDropout)),"mails;", \
                "firstFourClientMailsCutoffFinisher:",sum([wordCount(text) for text in firstFourClientMailsCutoffFinisher]),\
                "tokens;",len(removeEmptyMails(firstFourClientMailsCutoffFinisher)),"mails" )
tscoreData1 = makeTscoreData(removeEmptyMails(firstFourClientMailsCutoffDropout))
tscoreData2 = makeTscoreData(removeEmptyMails(firstFourClientMailsCutoffFinisher))
tscoreData3 = makeTscoreData(removeEmptyMails(allMailsClients))
compareTscoreData(tscoreData1,tscoreData3,tscoreData2,tscoreData3,NBROFSHOW,coloredWords=[\
                 "(","was",".","kon","contacten","kreeg","depressie",",","geleden","vader","broer","ad","familie","per","ORG","sinds",\
                 "kinderen","“","s","Vraag","drink","eet","”","soms","ben","dagelijks","…","hobby","maanden","ongeveer","baan","weinig"])

When we examine only the longest (>= 1000 tokens) among the first four client mails, we see similar patterns: among the top dropout words several firts person singular verbs (*drink*, *eet* and *ben*) but person relation words (*kinderen* and *vader*) are now more present among the top finisher words (*contacten*, *vader*, *broer* and *familie*).

## Comparing LIWC data

We have compared word usage in the mail texts with the t-score. Next we want to compare the frequencies of LIWC categories in the mail texts. 

In [None]:
TOTALTOKENS = "Number of matches"
NBROFNBRS = "number count"
EMPTYLIWCDICT = { TOTALTOKENS:0, NBROFNBRS:0 }

def isLiwcFeature(featureName):
    return(re.search(r"^[0-9]",str(featureName)))

def combineLiwcResults(mails):
    if len(mails) == 0: return(emptyResults)
    else:
        liwcResults = dict(EMPTYLIWCDICT)
        for i in range(0,len(mails)):
            liwcResults[TOTALTOKENS] += mails[i][TOTALTOKENS]
            liwcResults[NBROFNBRS] += mails[i][NBROFNBRS]*mails[i][TOTALTOKENS]
            for liwcField in mails[0].domain.variables:
                if isLiwcFeature(liwcField):
                    if not liwcField in liwcResults: 
                        liwcResults[liwcField] = mails[i][liwcField]*mails[i][TOTALTOKENS]
                    else:
                        liwcResults[liwcField] += mails[i][liwcField]*mails[i][TOTALTOKENS]
        return(liwcResults)
    
def extractLiwcScoresFrom(liwcResultsIn,sender):
    liwcResultsOut = {}
    for clientId in liwcResultsIn:
        mails = []
        for mail in liwcResultsIn[clientId]:
            if mail[FIELDNAMEFROM] == sender: mails.append(mail)
        liwcResultsOut[clientId] = combineLiwcResults(mails)
    return(liwcResultsOut)

def combineLiwcScores(liwcResultsIn):
    liwcResultsOut = dict(EMPTYLIWCDICT)
    for clientId in liwcResultsIn:
        liwcResultsOut[TOTALTOKENS] += liwcResultsIn[clientId][TOTALTOKENS]
        liwcResultsOut[NBROFNBRS] += liwcResultsIn[clientId][NBROFNBRS]
        for liwcField in liwcResultsIn[clientId].keys():
            if str(liwcField) in liwcResultsOut:
                liwcResultsOut[str(liwcField)] += liwcResultsIn[clientId][liwcField]
            else:
                liwcResultsOut[str(liwcField)] = liwcResultsIn[clientId][liwcField]
    return(liwcResultsOut)

def makeTscoreDataLiwc(liwcScores):
    tScoreData = { NBROFTOKENS:int(liwcScores[TOTALTOKENS]), NBROFTYPES:int(len(liwcScores)-2), NBROFGROUPS:int(len(liwcScores)-2), WORDFREQS:{} }
    for liwcField in liwcScores.keys():
        if isLiwcFeature(liwcField):
            tScoreData[WORDFREQS][liwcField] = int(liwcScores[liwcField])
    return(tScoreData)

## Comparing all LIWC data: dropouts vs finishers

In [None]:
liwcScoresClientDropout = combineLiwcScores(extractLiwcScoresFrom({clientId:allLiwcResults[clientId] for clientId in allLiwcResults if dropout[clientId] == CODEDROPOUT},CLIENT))
liwcScoresClientFinisher = combineLiwcScores(extractLiwcScoresFrom({clientId:allLiwcResults[clientId] for clientId in allLiwcResults if dropout[clientId] == CODEFINISHER},CLIENT))
liwcTscoresClientDropout = makeTscoreDataLiwc(liwcScoresClientDropout)
liwcTscoresClientFinisher = makeTscoreDataLiwc(liwcScoresClientFinisher)

In [None]:
NBROFSHOWLIWC = 37
print("counts:","liwcTscoresClientDropout:",liwcTscoresClientDropout[NBROFTOKENS],\
                "tokens;",
                "liwcTscoresClientFinisher:",liwcTscoresClientFinisher[NBROFTOKENS],\
                "tokens;")
compareTscoreData(liwcTscoresClientDropout,liwcTscoresClientFinisher,liwcTscoresClientFinisher,liwcTscoresClientDropout,NBROFSHOWLIWC,\
                 coloredWords=["41 family","42 friend","40 social","4 i","43 female","44 male","7 shehe","8 they","5 we","6 you"])

## Comparing LIWC data of the final counselor mails: dropouts vs finishers

In [None]:
def getLastLiwcScoresFrom(liwcResultsIn,sender):
    liwcResultsOut = {}
    for clientId in liwcResultsIn:
        for i in range(0,len(liwcResultsIn)):
            if liwcResultsIn[clientId][-1-i][FIELDNAMEFROM] == sender:
                liwcResultsOut[clientId] = combineLiwcResults([liwcResultsIn[clientId][-1-i]])
                break
    return(liwcResultsOut)

liwcScoresCounselorDropout = combineLiwcScores(\
                             getLastLiwcScoresFrom({clientId:allLiwcResults[clientId] for clientId in allLiwcResults if dropout[clientId] == CODEDROPOUT},COUNSELOR))
liwcScoresCounselorFinisher = combineLiwcScores(\
                              getLastLiwcScoresFrom({clientId:allLiwcResults[clientId] for clientId in allLiwcResults if dropout[clientId] == CODEFINISHER},COUNSELOR))
liwcScoresCounselor = combineLiwcScores(extractLiwcScoresFrom(allLiwcResults,COUNSELOR))
liwcTscoresCounselorDropout = makeTscoreDataLiwc(liwcScoresCounselorDropout)
liwcTscoresCounselorFinisher = makeTscoreDataLiwc(liwcScoresCounselorFinisher)
liwcTscoresCounselor = makeTscoreDataLiwc(liwcScoresCounselor)

NBROFSHOWLIWC = 37
print("counts:","liwcTscoresClientDropout:",liwcTscoresCounselorDropout[NBROFTOKENS],\
                "tokens;",
                "liwcTscoresClientFinisher:",liwcTscoresCounselorFinisher[NBROFTOKENS],\
                "tokens;")
compareTscoreData(liwcTscoresCounselorDropout,liwcTscoresCounselor,liwcTscoresCounselorFinisher,liwcTscoresCounselor,NBROFSHOW,\
                  coloredWords=["40 social","102 space","81 affiliation","62 hear","15 negate","56 differ","42 friend"])

In [None]:
SEP = "@"

tokenCounts = {}
searchKey = "social"

for token in {clientId:allLiwcResults[clientId] for clientId in allLiwcResults if dropout[clientId] == CODEDROPOUT}["AdB0001"][-1].metas[2].split():
    keys = token.split(SEP)
    for key in keys[1:]:
        if key == searchKey:
            if key[0] in tokenCounts: tokenCounts[keys[0]] += 1
            else: tokenCounts[keys[0]] = 1
print(tokenCounts)

## Sanity check: male vs female words

In [None]:
import numpy as np
import os
import pandas as pd
import re
import sys
import xml.etree.ElementTree as ET
sys.path.append('/home/erikt/projects/e-mental-health/data-processing')
import tactus2table

DIRECTORY = "/home/erikt/projects/e-mental-health/usb/tmp/20190917/"
FILENAMEPREFIX = "^AdB"
TITLE = "0-title"
FIELDNAMEINTAKE = "Intake"
FIELDNAMEID = "0-id"
FIELDNAMEGENDER = "geslacht"
GZEXTENSION = ".gz"
XMLEXTENSION = ".xml"
MALE = "man"
FEMALE = "vrouw"
INTAKE = "Intake"

def shortenFileName(fileName):
    return(re.sub(XMLEXTENSION,"",re.sub(GZEXTENSION,"",fileName)))

In [None]:
genders = {}
for inFileName in os.listdir(DIRECTORY):
    if re.search(FILENAMEPREFIX,inFileName):
        root = tactus2table.readRootFromFile(DIRECTORY+"/"+inFileName)
        questionnaires = tactus2table.getQuestionnaires(root,inFileName)
        for questionnaire in questionnaires: 
            if questionnaire[TITLE] == INTAKE:
                for fieldName in questionnaire:
                    if re.search(FIELDNAMEGENDER,fieldName):
                        genders[shortenFileName(questionnaire[FIELDNAMEID])] = questionnaire[fieldName].lower()
                        break
                break

In [None]:
genderKeys = list(np.unique(np.array([genders[key] for key in genders])))
genderMailTexts = {}
for key in genderKeys: genderMailTexts[key] = {}
for result in range(0,len(mailTexts)):
    for fileId in mailTexts[result].keys():
        genderMailTexts[genders[fileId]][fileId] = mailTexts[result][fileId]
[(key,len(genderMailTexts[key])) for key in genderKeys]  

In [None]:
tscoreData1 = makeTscoreData(uniqueTextArray(genderMailTexts[MALE]))
tscoreData2 = makeTscoreData(uniqueTextArray(genderMailTexts[FEMALE]))
computeTscores(tscoreData1,tscoreData2)

## Get treatment years

In [None]:
FIELDNAMEDATE = "date"

endYears = []
startYears = []
treatmentYearList = {}
for clientId in allMails:
    startYear = str(allMails[clientId][0][FIELDNAMEDATE])[0:4]
    endYear = str(allMails[clientId][-1][FIELDNAMEDATE])[0:4]
    startYears.append(int(startYear))
    endYears.append(int(endYear))
    treatmentYears = startYear+"-"+endYear
    if treatmentYears in treatmentYearList: treatmentYearList[treatmentYears] += 1
    else: treatmentYearList[treatmentYears] = 1

years = {}
for i in range(0,len(startYears)):
    for year in range(startYears[i],endYears[i]+1):
        if year in years: years[year] += 1
        else: years[year] = 1
for year in sorted(years.keys()):
    print(years[year],year)
    
sum(years.values())

## Count mails of dropouts and completers 

In [None]:
import numpy as np
import math

CODECOMPLETER = "2"

def std(data,avg):
    total = 0
    for i in range(0,len(data)): total += (data[i]-avg)**2
    return(math.sqrt(total/len(data)))

nbrOfClientMailsDropouts = []
nbrOfClientMailsCompleters = []
for clientId in allMails:
    clientMailCounter = 0
    for i in range(0,len(allMails[clientId])):
        if allMails[clientId][i][FIELDNAMEFROM] == CLIENT: clientMailCounter += 1
    if dropout[clientId] == CODEDROPOUT: nbrOfClientMailsDropouts.append(clientMailCounter)
    if dropout[clientId] == CODECOMPLETER: nbrOfClientMailsCompleters.append(clientMailCounter)
        
print("dropouts: ","average:",round(np.mean(nbrOfClientMailsDropouts),1),"+-",round(np.std(nbrOfClientMailsDropouts),1),\
                   "  median:",round(np.median(nbrOfClientMailsDropouts),1),"+-",round(std(nbrOfClientMailsDropouts,np.median(nbrOfClientMailsDropouts)),1))
print("finishers:","average:",round(np.mean(nbrOfClientMailsCompleters),1),"+-",round(np.std(nbrOfClientMailsCompleters),1),\
                   "median:",round(np.median(nbrOfClientMailsCompleters),1),"+-",round(std(nbrOfClientMailsCompleters,np.median(nbrOfClientMailsCompleters)),1))

In [None]:
for d in nbrOfClientMailsDropouts: print(d,end=" ")

In [None]:
for f in nbrOfClientMailsFinishers: print(f,end=" ")

In [None]:
len(nbrOfClientMailsDropouts)