## Lab Report 02: Semiotics Data

This should be enought to get you going on analyzing your data. Below, I have done a bit of cleaning and organizing to get you going, but this script starts by loading all the raw data, so you have access to that as well.

I have not aggregated the data this time, that is, you have every data point for every participant. So, if you  e.g. want to find a mean value for each participant for each condition, you will need to do that yourself. To do hypothesis-testing, you will want to find mean values, but it may also be useful to see whether a specific word or nonword pair is an outlier.

In [1]:
import pandas as pd


In [50]:
# load the raw data and inspect

df_sem_raw = pd.read_csv("https://raw.githubusercontent.com/ethanweed/ExPsyLing/master/datasets/Lexical-decision/2021/semiotics_2021_raw.csv")
df_sem_raw.head()

Unnamed: 0,block,browser_codename,browser_name,browser_version,condition,correct_response,date_startdate,date_startdateUTC,date_starttime,experiment_debug,...,screen_availableHeight,screen_availableWidth,screen_colorDepth,screen_height,screen_pixelDepth,screen_screenX,screen_screenY,screen_width,stim,system_os
0,practice,Mozilla,Netscape,5.0 (X11; CrOS x86_64 14150.64.0) AppleWebKit/...,Nonword,RIGHT,03-10-21,03-10-21,12:33:37,0,...,720,1366,24,768,24,10,0,1366,ip-bown,Linux x86_64
1,practice,Mozilla,Netscape,5.0 (X11; CrOS x86_64 14150.64.0) AppleWebKit/...,Nonword,RIGHT,03-10-21,03-10-21,12:33:37,0,...,720,1366,24,768,24,10,0,1366,hort-sain,Linux x86_64
2,practice,Mozilla,Netscape,5.0 (X11; CrOS x86_64 14150.64.0) AppleWebKit/...,Unrelated,LEFT,03-10-21,03-10-21,12:33:37,0,...,720,1366,24,768,24,10,0,1366,sand-pepper,Linux x86_64
3,practice,Mozilla,Netscape,5.0 (X11; CrOS x86_64 14150.64.0) AppleWebKit/...,Related,LEFT,03-10-21,03-10-21,12:33:37,0,...,720,1366,24,768,24,10,0,1366,table-chair,Linux x86_64
4,practice,Mozilla,Netscape,5.0 (X11; CrOS x86_64 14150.64.0) AppleWebKit/...,Unrelated,LEFT,03-10-21,03-10-21,12:33:37,0,...,720,1366,24,768,24,10,0,1366,shark-dull,Linux x86_64


In [51]:
# look at all the column headers, to see what is in this data set
df_sem_raw.columns

Index(['block', 'browser_codename', 'browser_name', 'browser_version',
       'condition', 'correct_response', 'date_startdate', 'date_startdateUTC',
       'date_starttime', 'experiment_debug', 'experiment_parameters',
       'experiment_pilot', 'experiment_taskname', 'experiment_taskversion',
       'jatosStudyResultId', 'jatosVersion', 'queryParams_batchId',
       'queryParams_generalMultiple', 'response', 'response_time',
       'screen_availableHeight', 'screen_availableWidth', 'screen_colorDepth',
       'screen_height', 'screen_pixelDepth', 'screen_screenX',
       'screen_screenY', 'screen_width', 'stim', 'system_os'],
      dtype='object')

In [52]:
# make a new dataframe, with just the variables we need

df_sem = pd.DataFrame(
    {'participantID': df_sem_raw['jatosStudyResultId'],
     'block': df_sem_raw['block'],
     'condition': df_sem_raw['condition'],
     'stimulus': df_sem_raw['stim'],
     'correct_response': df_sem_raw['correct_response'],
     'response': df_sem_raw['response'],
     'rt': df_sem_raw['response_time']
    }) 
df_sem.head()

Unnamed: 0,participantID,block,condition,stimulus,correct_response,response,rt
0,239,practice,Nonword,ip-bown,RIGHT,right,4062
1,239,practice,Nonword,hort-sain,RIGHT,right,1684
2,239,practice,Unrelated,sand-pepper,LEFT,left,1686
3,239,practice,Related,table-chair,LEFT,left,908
4,239,practice,Unrelated,shark-dull,LEFT,left,2493


In [53]:
# make data in correct_response column lowercase to match response column

df_sem['correct_response'] = [x.lower() for x in list(df_sem['correct_response'])]
df_sem.head()

Unnamed: 0,participantID,block,condition,stimulus,correct_response,response,rt
0,239,practice,Nonword,ip-bown,right,right,4062
1,239,practice,Nonword,hort-sain,right,right,1684
2,239,practice,Unrelated,sand-pepper,left,left,1686
3,239,practice,Related,table-chair,left,left,908
4,239,practice,Unrelated,shark-dull,left,left,2493


In [54]:
# make a new column that codes whether or not the response was correct

# make a list with either "True" if the actual response was the same as the correct response,
# or "False" if was different.
correct = list(df_sem['correct_response'] == df_sem['response'])

# make a new colunn in the dataframe with the True/False correct response values
df_sem.insert(loc = 6, column = 'correct', value = correct)

df_sem.head()

Unnamed: 0,participantID,block,condition,stimulus,correct_response,response,correct,rt
0,239,practice,Nonword,ip-bown,right,right,True,4062
1,239,practice,Nonword,hort-sain,right,right,True,1684
2,239,practice,Unrelated,sand-pepper,left,left,True,1686
3,239,practice,Related,table-chair,left,left,True,908
4,239,practice,Unrelated,shark-dull,left,left,True,2493


In [55]:
# Add a new column "group" that codes whether the data come from semiotics students or linguistics students

df_sem = df_sem.assign(group = ['semiotics']*df_sem.shape[0])
df_sem.head()

Unnamed: 0,participantID,block,condition,stimulus,correct_response,response,correct,rt,group
0,239,practice,Nonword,ip-bown,right,right,True,4062,semiotics
1,239,practice,Nonword,hort-sain,right,right,True,1684,semiotics
2,239,practice,Unrelated,sand-pepper,left,left,True,1686,semiotics
3,239,practice,Related,table-chair,left,left,True,908,semiotics
4,239,practice,Unrelated,shark-dull,left,left,True,2493,semiotics




Now you should have all the data you need. As you can see below, there are four conditions: Nonword, Unrelated, Related, and Filler. The related words are _semantically_ related, e.g. "table-chair", while the unrelated words are (hopefully) not so related, e.g. "sand-pepper". 

There are also four different blocks: practice, A, B, and C. All participants got the same practice block, but each paricipant was randomly assigned to one of the other blocks. Thus, A, B, and C are all "experiment" blocks, but contain different sets of word or nonword pairs, although the Filler pairs were the same for everybody. Hopefully, if I set it up right, no matter which block a participant was in, they should have had about the same number of trial in each condition as participants in the other blocks.

participantID is taken from a value in the server output: `jatosStudyResultId`; the number itself doesn't mean anything, it is just an identifier.

In [70]:
print("conditions:", df_sem['condition'].unique())
print("blocks:", df_sem['block'].unique())
df_sem['correct'].value_counts()

conditions: ['Nonword' 'Unrelated' 'Related' 'Filler']
blocks: ['practice' 'C' 'B' 'A']


True     993
False     93
Name: correct, dtype: int64