<center><h1>Mapping the emotional face. How individual face parts contribute to successful emotion recognition.</h1></center>

# 1. Reading Logfiles  

This notebook takes all logfiles from the experiment and transforms them into [pandas](http://pandas.pydata.org/) dataframes. Then these tables are saved as csv files for later re-use in other notebooks where the actual analysis happens.

<p>All aquired data (i.e. the logfiles that are imported in this notebook, located in the folder ./experiment/app/static/logfiles/) are licensed under a creative commons Public Domain Dedication (No Rights Reserved).<br><a rel="license" href="http://creativecommons.org/publicdomain/zero/1.0/"><img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" alt="CC0" /></a></p>

## Import modules

Modules and global variables needed in most or all notebooks are stored in the file myBasics.py

In [1]:
from myBasics import *

Versions of modules used in this and the following notebooks:

In [2]:
# Sebastian Raschka; https://github.com/rasbt/watermark/blob/master/docs/watermark.ipynb
%load_ext watermark
%watermark -v -m -d -u -p python,numpy,scipy,pandas,scikit-learn,matplotlib,seaborn,pillow -g

last updated: 2016-03-11 

CPython 2.7.11
IPython 4.0.3

python 2.7.11
numpy 1.10.4
scipy 0.17.0
pandas 0.17.1
scikit-learn 0.17
matplotlib 1.5.1
seaborn 0.7.0
pillow 3.1.1

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.13.0-79-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
Git hash   : 40611ed2048f6dce426ab1d22dd072e6049cf005


## Get Logfiles

In [3]:
def getLogfile(whichFolder, whichExperiment):
    logList = []
    for filename in os.listdir(whichFolder):
        if fnmatch.fnmatch(filename, whichExperiment):
            logList.append(whichFolder+filename)
    return logList

In [4]:
myFolder = '../experiment/app/static/logfiles/'
myExperiment = 'log*.txt'
logList = getLogfile(myFolder,myExperiment)
logList.sort()

These are all the logfiles we have. Note that \#10 and \#67 are missing, because they were not started (e.g. participant did not come to the appointment). But these are not dropouts and there are no data that were unusable or had to be removed from the analyses.

In [5]:
i = 0
for log in logList:
    print i,':\t',log
    i+=1

0 :	../experiment/app/static/logfiles/logfile1.txt
1 :	../experiment/app/static/logfiles/logfile11.txt
2 :	../experiment/app/static/logfiles/logfile12.txt
3 :	../experiment/app/static/logfiles/logfile13.txt
4 :	../experiment/app/static/logfiles/logfile14.txt
5 :	../experiment/app/static/logfiles/logfile15.txt
6 :	../experiment/app/static/logfiles/logfile16.txt
7 :	../experiment/app/static/logfiles/logfile17.txt
8 :	../experiment/app/static/logfiles/logfile18.txt
9 :	../experiment/app/static/logfiles/logfile19.txt
10 :	../experiment/app/static/logfiles/logfile2.txt
11 :	../experiment/app/static/logfiles/logfile20.txt
12 :	../experiment/app/static/logfiles/logfile21.txt
13 :	../experiment/app/static/logfiles/logfile22.txt
14 :	../experiment/app/static/logfiles/logfile23.txt
15 :	../experiment/app/static/logfiles/logfile24.txt
16 :	../experiment/app/static/logfiles/logfile25.txt
17 :	../experiment/app/static/logfiles/logfile26.txt
18 :	../experiment/app/static/logfiles/logfile27.txt
19 :	

Example of how the head of the logfile looks:

In [6]:
for index,entry in enumerate(open(logList[-1],'r')):
    print entry
    if index > 6:
        break

####### THIS IS A LOGFILE FOR THE DYNAMIC MASKING FACE EXPERIMENT ######

Participant Number: 96

Date and Time: 2016-03-10 15:04:08.090637

age: 20 ,gender: 1 ,environ: 2 ,occup: 0 ,advert: 0

###################################################################



time	cumtime	express	ident	button	filename	evaluation	stopRT	choiceRT	maskNum	maskList

2016-03-10 15:04:53	0.0	2	1	ang	img/m_ang_cut.png	HIT	25384.0	34289.0	26	33-43-3-21-44-34-8-10-36-20-15-1-29-19-38-9-25-40-47-16-32-4-13-23-12-0



### Get information about age and gender

In [7]:
def getDemographics(logList):
    #empty dict to write to
    d = {}
    # loop through the logfiles of all participants
    for log in logList:
        # get the filename of the logfile (clean)
        logName = log[log.rfind('/logfile')+len('/logfile'):log.rfind('.')]
        # loop throught the logfiles content
        for index,entry in enumerate(open(log,'r')):
            # the demographics are stored in the 3rd row
            if index ==3:
                # we get the conents of that row, split it and put in in a list
                thisEntry = entry.split()
                
                d['p'+('000'+logName)[-3:]] = {'age':int(thisEntry[1]) , 'gender':int(thisEntry[3])}
    
    # make a dataframe
    demographicsDf = pd.DataFrame(d).T
    # missing values (99) are turned into nans
    demographicsDf = demographicsDf.replace(99,np.nan)
    return demographicsDf

In [8]:
demoDf = getDemographics(logList)

In [9]:
demoDf.head()

Unnamed: 0,age,gender
p001,24,0
p002,30,0
p003,25,0
p004,26,0
p005,28,1


In [10]:
demoDf.describe()

Unnamed: 0,age,gender
count,94.0,94.0
mean,23.595745,0.361702
std,3.370712,0.48307
min,18.0,0.0
25%,21.0,0.0
50%,24.0,0.0
75%,25.0,1.0
max,36.0,1.0


In [11]:
# 0 == female
# 1 == male
# 2 == other
# 99/NaN == no information
demoDf['gender'].value_counts()

0    60
1    34
Name: gender, dtype: int64

## Some meta-information and sanity checks

### Get the times when the first and last trials were shown

In [12]:
def getClocktimes(logFile):
    timing = []
    for entry in open(logFile,'r'):
        if 'img' in entry:
            timing.append(entry.split()[1])
    print logFile, '; started at %s ; finished at: %s' %(timing[0], timing[-1])

In [13]:
for logFile in logList:
    getClocktimes(logFile)

../experiment/app/static/logfiles/logfile1.txt ; started at 15:01:20 ; finished at: 16:02:43
../experiment/app/static/logfiles/logfile11.txt ; started at 11:52:16 ; finished at: 12:35:53
../experiment/app/static/logfiles/logfile12.txt ; started at 11:52:55 ; finished at: 13:05:16
../experiment/app/static/logfiles/logfile13.txt ; started at 14:08:20 ; finished at: 14:56:36
../experiment/app/static/logfiles/logfile14.txt ; started at 15:31:56 ; finished at: 16:26:33
../experiment/app/static/logfiles/logfile15.txt ; started at 15:31:41 ; finished at: 16:38:39
../experiment/app/static/logfiles/logfile16.txt ; started at 11:57:53 ; finished at: 12:44:36
../experiment/app/static/logfiles/logfile17.txt ; started at 12:00:49 ; finished at: 12:56:46
../experiment/app/static/logfiles/logfile18.txt ; started at 14:14:05 ; finished at: 15:25:08
../experiment/app/static/logfiles/logfile19.txt ; started at 14:32:17 ; finished at: 15:22:41
../experiment/app/static/logfiles/logfile2.txt ; started at 1

### Get the length of each block and the cumulated time of blocks

In [14]:
def getBlocktimes(logfile):
    cumtime = []
    blocktime = []
    for entry in open(logfile,'r'):
        if 'img' in entry:
            cumtime.append( float(entry.split()[2]) )
            try:
                if cumtime[-2] > cumtime[-1]:
                    blocktime.append(round(cumtime[-2]/60,2))
            except:
                pass
    blocktime.append(round(cumtime[-1]/60,2))
    print logfile[logfile.rfind('/')+1:], '; block lengths: %s ; total length: %smin' %(blocktime, sum(blocktime))
    return sum(blocktime)

In [15]:
allTimes = []
for logFile in logList:
    allTimes.append( getBlocktimes(logFile) )

logfile1.txt ; block lengths: [35.99, 24.64] ; total length: 60.63min
logfile11.txt ; block lengths: [20.44, 22.21] ; total length: 42.65min
logfile12.txt ; block lengths: [38.39, 33.4] ; total length: 71.79min
logfile13.txt ; block lengths: [23.76, 23.57] ; total length: 47.33min
logfile14.txt ; block lengths: [31.64, 21.84] ; total length: 53.48min
logfile15.txt ; block lengths: [36.64, 28.59] ; total length: 65.23min
logfile16.txt ; block lengths: [25.37, 20.7] ; total length: 46.07min
logfile17.txt ; block lengths: [30.37, 24.59] ; total length: 54.96min
logfile18.txt ; block lengths: [36.83, 33.6] ; total length: 70.43min
logfile19.txt ; block lengths: [26.08, 23.18] ; total length: 49.26min
logfile2.txt ; block lengths: [20.65, 16.21] ; total length: 36.86min
logfile20.txt ; block lengths: [25.32, 19.44] ; total length: 44.76min
logfile21.txt ; block lengths: [24.72, 17.07] ; total length: 41.79min
logfile22.txt ; block lengths: [43.75, 15.84] ; total length: 59.59min
logfile23.t

### Summary statistics for time taken

In [16]:
pd.DataFrame( allTimes ).describe()

Unnamed: 0,0
count,94.0
mean,49.935638
std,9.202309
min,32.56
25%,43.3275
50%,48.755
75%,55.44
max,71.79


### Transform to Pandas dataFrame

In [17]:
def makePandas(filename):
    # we load the csv into pandas
    df = pd.read_csv(open(filename,'r'),
                skiprows=6,
                header=0,
                sep='\t')
    
    # the index in passed into a column, so we do not loose it when reindexing
    df['id'] = df.index
    # we sort the data frame by the values we want to use for the (mulit)index
    df = df.sort_values(by='express')
    df = df.sort_values(by='ident')
    # we set which variables are the new multi-index
    df = df.set_index(['ident','express','id'],drop=False)
    # we rename the variables because to avoid ambiguity
    df.rename(columns={'ident': 'i', 'express': 'e','id':'#'}, inplace=True)
    ### ugly hack to make index hierarchical
    df = df.unstack(0).stack(1).unstack(0).stack(1).unstack(0).stack(1)
    
    return df

In [18]:
df = makePandas(logList[-1])

We have a multi-index with three variables: the identity of the face (male/female), the facial expression and the trial number (id)

In [19]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,time,cumtime,e,i,button,filename,evaluation,stopRT,choiceRT,maskNum,maskList,#
ident,express,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0,9,2016-03-10 15:08:11,198.14952,0,0,hap,img/f_hap_cut.png,HIT,12275,16115,13,3-28-29-7-24-44-45-22-10-43-30-18-34,9
0,0,27,2016-03-10 15:15:00,607.15364,0,0,hap,img/f_hap_cut.png,HIT,19242,20994,20,40-15-0-37-1-27-45-46-44-23-18-43-22-3-41-29-1...,27
0,0,37,2016-03-10 15:18:36,823.46331,0,0,hap,img/f_hap_cut.png,HIT,16303,17847,17,9-44-23-35-39-14-1-40-46-19-28-20-26-37-17-21-30,37
0,0,43,2016-03-10 15:20:31,937.80766,0,0,hap,img/f_hap_cut.png,HIT,5062,7262,6,36-29-47-3-43-20,43
0,0,58,2016-03-10 15:23:36,1123.17332,0,0,hap,img/f_hap_cut.png,HIT,5377,6801,6,19-20-16-39-13-44,58


Given that there are 2 faces times 7 expressions times 16 repetitions, there should be 2*7*16=224 entries in each DataFrame

In [20]:
for logFile in logList:
    thisDf = makePandas(logFile)
    assert len(list(thisDf.index)) == 2*7*16, "wrong number of entries in df %s" % logFile
    print logFile, '\tnumber of entries: ', len(list(thisDf.index))

../experiment/app/static/logfiles/logfile1.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile11.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile12.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile13.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile14.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile15.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile16.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile17.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile18.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile19.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile2.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile20.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile21.txt 	number of entries:  224
../experiment/app/static/logfiles/logfile22.txt 	numb

### Save as csv

In [21]:
def saveCsv(filelist):
    for filename in filelist:
        df = makePandas(filename)
                
        # define a name for the csv file that is created
        pName = filename[filename.rfind('/logfile')+len('/logfile'):filename.rfind('.')]
        csvName = 'pandas_logfile'+ ('000'+pName)[-3:] + '.csv'
        
        # check if the number of trials is correct
        correctNumber = 7*2*8*2 # there are 7 emotions, 2 identities, 8 repetitions per block and 2 blocks (=224)
        
        if len(df.index.levels[-1]) != correctNumber:
            csvName = 'invalid_'+csvName
        else: 
            pass
        
        print "...saving", csvName
        
        df.to_csv('../rawTables/'+csvName)

In [22]:
saveCsv(logList)

...saving pandas_logfile001.csv
...saving pandas_logfile011.csv
...saving pandas_logfile012.csv
...saving pandas_logfile013.csv
...saving pandas_logfile014.csv
...saving pandas_logfile015.csv
...saving pandas_logfile016.csv
...saving pandas_logfile017.csv
...saving pandas_logfile018.csv
...saving pandas_logfile019.csv
...saving pandas_logfile002.csv
...saving pandas_logfile020.csv
...saving pandas_logfile021.csv
...saving pandas_logfile022.csv
...saving pandas_logfile023.csv
...saving pandas_logfile024.csv
...saving pandas_logfile025.csv
...saving pandas_logfile026.csv
...saving pandas_logfile027.csv
...saving pandas_logfile028.csv
...saving pandas_logfile029.csv
...saving pandas_logfile003.csv
...saving pandas_logfile030.csv
...saving pandas_logfile031.csv
...saving pandas_logfile032.csv
...saving pandas_logfile033.csv
...saving pandas_logfile034.csv
...saving pandas_logfile035.csv
...saving pandas_logfile036.csv
...saving pandas_logfile037.csv
...saving pandas_logfile038.csv
...savin

### Sort the list of logfiles in ascending order

In [23]:
pandasList = getLogfile('../rawTables/','pandas_*')
pandasList.sort()

Now, we have a df with the following multi-index:
- identity (0=female, 1=male)
- expression (0=happy, ... 6=neutral)
- id (ascending number as experiment progresses)

### Load the csv

In [24]:
df = pd.read_csv('../rawTables/pandas_logfile069.csv',index_col=[0,1,2],header=0)

Example:

In [25]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,time,cumtime,e,i,button,filename,evaluation,stopRT,choiceRT,maskNum,maskList,#
ident,express,id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0,13,2016-01-19 10:07:51,255.73029,0,0,hap,img/f_hap_cut.png,HIT,5152,7296,6,42-44-21-13-33-15,13
0,0,26,2016-01-19 10:10:36,420.75442,0,0,hap,img/f_hap_cut.png,HIT,2742,4079,3,13-30-41,26
0,0,35,2016-01-19 10:12:38,542.74912,0,0,hap,img/f_hap_cut.png,HIT,17895,19008,18,40-46-3-47-32-5-41-0-12-17-19-14-22-34-24-8-2-21,35
0,0,44,2016-01-19 10:14:27,652.15954,0,0,hap,img/f_hap_cut.png,HIT,5905,7025,6,8-12-4-30-40-13,44
0,0,62,2016-01-19 10:18:11,875.79968,0,0,hap,img/f_hap_cut.png,HIT,7089,8969,8,22-36-35-25-39-2-30-15,62


### Check if logfiles are not corrupted

In [26]:
for pandasFile in pandasList:
    print "checking logfile %s ..." % pandasFile
    # load each stored df into pandas
    thisDf = pd.read_csv(pandasFile,index_col=[0,1,2],header=0)
    #loop through identities
    for ident in thisDf.index.levels[0]:
        # loop through expressions
        for express in thisDf.index.levels[1]:

            # This is to double-check if each condition (expression of a particular face)
            # has always exactly 16 trials (there a two blocks a 8 trials per condition)
            assert len(thisDf.ix[ident].ix[express]) ==16 ,'trial numbers corrupted'

            # This is to double-check whether the number of revealed tiles in the
            # variable maskNum is equal to the number of items in the maskList.
            for entry in thisDf.ix[ident].ix[express].index:
                #print entry,df.ix[ident].ix[express].ix[entry]['maskNum'],len(df.ix[ident].ix[express].ix[entry]['maskList'].split('-'))
                assert thisDf.ix[ident].ix[express].ix[entry]['maskNum']==len(thisDf.ix[ident].ix[express].ix[entry]['maskList'].split('-')),'mask numbers corrupted'

checking logfile ../rawTables/pandas_logfile001.csv ...
checking logfile ../rawTables/pandas_logfile002.csv ...
checking logfile ../rawTables/pandas_logfile003.csv ...
checking logfile ../rawTables/pandas_logfile004.csv ...
checking logfile ../rawTables/pandas_logfile005.csv ...
checking logfile ../rawTables/pandas_logfile006.csv ...
checking logfile ../rawTables/pandas_logfile007.csv ...
checking logfile ../rawTables/pandas_logfile008.csv ...
checking logfile ../rawTables/pandas_logfile009.csv ...
checking logfile ../rawTables/pandas_logfile011.csv ...
checking logfile ../rawTables/pandas_logfile012.csv ...
checking logfile ../rawTables/pandas_logfile013.csv ...
checking logfile ../rawTables/pandas_logfile014.csv ...
checking logfile ../rawTables/pandas_logfile015.csv ...
checking logfile ../rawTables/pandas_logfile016.csv ...
checking logfile ../rawTables/pandas_logfile017.csv ...
checking logfile ../rawTables/pandas_logfile018.csv ...
checking logfile ../rawTables/pandas_logfile019.