# Data preparation
This notebooks gives a quick overview on the data preparation approach we are taking.

The approach we will take is to have two dataframes:

## Dataframe 1: Experiment-level meta data
We will have a second experiment-level dataframe containing meta data about the experiment. Each row will correspond to one datafile, and will have columns including the filename, but also experiment level information such as condition, or other participant and experiment data.

## Dataframe 2: Raw behavioural data
All the raw behaviour data from the raw `.txt` or `.csv` files will be imported and appended into one giant dataframe consisting of all trials from all observers.

First, some basic boilerplate setup code

In [1]:
from glob import glob
import pandas as pd

Import toolbox code

In [2]:
# autoreload imported modules. Convenient while I'm developing the code.
%load_ext autoreload
%autoreload 2

In [3]:
from df_data import *

# Dataframe 1: Experiment-level meta data
First we create a list of all the raw data files we want to examine, then use that to create an initial experiment level dataframe.

In [4]:
files = glob('data/non_parametric/*.txt')
expt_data = pd.DataFrame({'filename': files})
expt_data

Unnamed: 0,filename
0,data/non_parametric/AH-gain-2016Oct31-12.53.txt
1,data/non_parametric/AH-loss-2016Oct31-12.42.txt
2,data/non_parametric/AJE-gain-2016Oct31-12.03.txt
3,data/non_parametric/AM-loss-2016Nov21-11.09.txt
4,data/non_parametric/AMB-gain-2016Nov24-09.50.txt
5,data/non_parametric/AVT-gain-2016Nov16-13.09.txt
6,data/non_parametric/AVT-loss-2016Nov16-12.56.txt
7,data/non_parametric/BRS-gain-2016Nov03-12.10.txt
8,data/non_parametric/CA-gain-2016Jun22-14.03.txt
9,data/non_parametric/CD-gain-2016Nov14-09.46.txt


Now we will parse the filenames to extract relevant information, and add this into new columns.

In [5]:
import os as os
def parse_filename(fname):
    """Extract initials from provided filename"""
    path, file = os.path.split(fname)
    initials = file.split('-')[0]
    return initials

In [6]:
expt_data['id'] = pd.Series([parse_filename(fname) for fname in files],
                           index=expt_data.index)
expt_data

Unnamed: 0,filename,id
0,data/non_parametric/AH-gain-2016Oct31-12.53.txt,AH
1,data/non_parametric/AH-loss-2016Oct31-12.42.txt,AH
2,data/non_parametric/AJE-gain-2016Oct31-12.03.txt,AJE
3,data/non_parametric/AM-loss-2016Nov21-11.09.txt,AM
4,data/non_parametric/AMB-gain-2016Nov24-09.50.txt,AMB
5,data/non_parametric/AVT-gain-2016Nov16-13.09.txt,AVT
6,data/non_parametric/AVT-loss-2016Nov16-12.56.txt,AVT
7,data/non_parametric/BRS-gain-2016Nov03-12.10.txt,BRS
8,data/non_parametric/CA-gain-2016Jun22-14.03.txt,CA
9,data/non_parametric/CD-gain-2016Nov14-09.46.txt,CD


The above method is what we'll do when we have experiment level information encoded in the filenames of the rate data files.

In some situations this will not be appropriate. For example if we have a lot of experimental measures encoding that into filenames will be clunky, particularly if they involve continuous valued measures. In this situation we'd want to create an external experiment level data file as a `.csv` and then simply import it as a pandas dataframe. If we do this it should have the same basic structure as the dataframe above. Ie.
- each row corresponds to an experiment.
- it _must_ have a filename column with the path to the raw data.
- any other additional experiment metadata can be encoded in other columns in the file.

# Dataframe 2: Raw behavioural data
The goal here is to import a list of user-specified raw data files and bundle them up into a pandas dataframe.

In [7]:
# We already have a list of filenames, but we can extract from expt_data
filenames = expt_data['filename']

raw_data = import_raw_data(filenames)
raw_data.head()

Unnamed: 0,A,DA,B,DB,R,file_index,filename
0,74,0,100,90,1,0,data/non_parametric/AH-gain-2016Oct31-12.53.txt
1,105,0,100,30,0,0,data/non_parametric/AH-gain-2016Oct31-12.53.txt
2,98,0,100,180,0,0,data/non_parametric/AH-gain-2016Oct31-12.53.txt
3,95,0,100,365,0,0,data/non_parametric/AH-gain-2016Oct31-12.53.txt
4,343,0,100,90,0,0,data/non_parametric/AH-gain-2016Oct31-12.53.txt


If we like, we can extract list of unique participant names:

    participant_list = list(expt_data.id.unique())
    print(participant_list)
    
Otherwise, we are now free to use `raw_data` and `expt_data` to pass into the `fit` function in order to do parameter estimation and model comparison. This is outlined in other notebooks.