# Concatenate Raw Data

Psytoolkit's data management appears unreliable, as such all downloaded data should be stored both locally & on Git (without assuming the Psytoolkit portal will maintain these files).

This notebook processes many data files:

- Read all subfiles
- Identify unique participant codes (experiments)
- Return a table to map to all unique experiments


This notebook is also used to consider the distribution of times for estimating cost on Amazon Mechanical Turk.


```
author: Zach Wolpe
email:  zachcolinwolpe@gmail.com
```

In [1]:
import pandas as pd
import os

# Identify Unique Paths

Here we indentify all unique experiments & produce a dataframe to 

In [134]:
path     = '../data/all data files'
inc_beta = False

final_data = None; final_data_times = None
for d in os.listdir(path):
    if '.DS_Store' not in d:
        if ('beta' not in d) or ('beta' in d) == inc_beta:
            data       = pd.read_csv(path + '/' + d + '/data.csv',       index_col=False)
            data_times = pd.read_csv(path + '/' + d + '/data_times.csv', index_col=False)
            # --- add columns ---x
            data[['path']]       = path + '/' + d
            data_times[['path']] = path + '/' + d
            data[['beta']]       = ('beta' in d)
            data_times[['beta']] = ('beta' in d)
            if final_data is None:
                final_data       = data
                final_data_times = data_times
            else:
                final_data = final_data.append(data, ignore_index=True)
                final_data_times = final_data_times.append(data_times, ignore_index=True)
    
final_data.head()
final_data_times.head()

Unnamed: 0,participant,participant_code,Welcome_Screen,wcst_task,n_back_task,corsi_block_span_task,fitts_law,navon_task,INFOSCREEN,TIME_start,TIME_end,TIME_total,path,beta
0,s.32ff642a-efe0-436f-8075-fa703d677fed.txt,,,,,,,,0,2021-05-01-15-50,2021-05-01-16-09,19.0,../data/all data files/data 17 May,False
1,s.88be81bb-b23f-4c34-b7ad-18cce2be5cb9.txt,,,,,,,,0,2021-05-13-15-43,,,../data/all data files/data 17 May,False
2,s.c6ad6698-4f59-4753-9ae9-6f8d562c5fe1.txt,,,,,,,,0,2021-05-12-15-03,,,../data/all data files/data 17 May,False
3,s.d3b74af9-3b24-4820-83a0-67986b3ec0bf.txt,,,,,,,,0,2021-05-10-06-39,2021-05-10-06-57,18.0,../data/all data files/data 17 May,False
4,s.32ff642a-efe0-436f-8075-fa703d677fed.txt,,,,,,,,0,2021-05-01-15-50,2021-05-01-16-09,19.0,../data/all data files/data 16 May,False


# Identify Duplicates

In [129]:
# ---- remove incomplete files ----x
final_data       = final_data.loc[final_data['TIME_total'].notnull(),]
final_data_times = final_data_times.loc[final_data['TIME_total'].notnull(),]
    

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [104]:

final_data['participant_code:1'][final_data['participant_code:1'].duplicated()]

# 4     851366
# 5     904653
# 6     429398
# 7     490901
# 8     851366
# 9     904653
# 10    429398
# 11    490901
# 12    851366
# 13    904653
# 14    429398
# 15    490901
# 23    851366
# 24    429398
# 25    490901
final_data.loc[final_data['participant_code:1']==851366,]
for c in final_data.columns:
    print(c)
    print(
    len(final_data[c][final_data[c].duplicated()])
    )
    print('-----')

participant
4
-----
participant_code:1
4
-----
Welcome_Screen:1
4
-----
wcst_task:1
5
-----
n_back_task:1
5
-----
corsi_block_span_task:1
5
-----
fitts_law:1
5
-----
navon_task:1
5
-----
TIME_start
4
-----
TIME_end
5
-----
TIME_total
5
-----
path
6
-----
beta
7
-----


In [94]:
final_data.shape
final_data_times
data_times

Unnamed: 0,participant,participant_code,Welcome_Screen,wcst_task,n_back_task,corsi_block_span_task,fitts_law,navon_task,INFOSCREEN,TIME_start,TIME_end,TIME_total,path,beta
0,s.32ff642a-efe0-436f-8075-fa703d677fed.txt,,,,,,,,0,2021-05-01-15-50,2021-05-01-16-09,19.0,../data/all data files/data 12 May (beta),True
1,s.c6ad6698-4f59-4753-9ae9-6f8d562c5fe1.txt,,,,,,,,0,2021-05-12-15-03,,,../data/all data files/data 12 May (beta),True
2,s.d3b74af9-3b24-4820-83a0-67986b3ec0bf.txt,,,,,,,,0,2021-05-10-06-39,2021-05-10-06-57,18.0,../data/all data files/data 12 May (beta),True
