<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NB-description" data-toc-modified-id="NB-description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NB description</a></span></li><li><span><a href="#The-dotsPositions.csv-data" data-toc-modified-id="The-dotsPositions.csv-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The dotsPositions.csv data</a></span><ul class="toc-item"><li><span><a href="#Mapping-dots-to-trials" data-toc-modified-id="Mapping-dots-to-trials-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Mapping dots to trials</a></span></li></ul></li><li><span><a href="#Write-a-dotsDB-HDF5-file" data-toc-modified-id="Write-a-dotsDB-HDF5-file-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Write a dotsDB HDF5 file</a></span><ul class="toc-item"><li><span><a href="#Write-HDF5-file" data-toc-modified-id="Write-HDF5-file-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Write HDF5 file</a></span></li></ul></li></ul></div>

# NB description
date: 03 Dec 2019  
This notebook contains code that:
- builds an HDF5 dotsDB database off of dotsPositions.csv files from the Fall 2019 data (subjects 10-13)

In [1]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pprint
import seaborn as sns
import h5py     
import os.path

# add location of custom modules to path
sys.path.insert(0,'../modules/')
sys.path.insert(0,'../modules/dots_db/dotsDB/')

# custom modules
import dotsDB as ddb
import motionenergy as kiani_me
import stimulus as stim
import ME_functions as my_me

# The dotsPositions.csv data
With the current pipeline, on the day of the session, a `.csv` file is written to disk with the FIRA data. Then, in the `motion_energy_Adrian` repo, I have MATLAB functions `reproduce_dots` and `batch_reproduce_dots` that write `_dotsPositions.csv` files to disk (one file per session).

The first step here is to loop through the completed sessions, perform a `join` of the dots and fira data, and update a global `.csv` file, as well as session-specific `.csv` files, called `labeled_dots_<timestamp>.csv`.

In [2]:
!find /home/adrian/SingleCP_DotsReversal/Fall2019/raw -name "*dotsPositions.csv" -print

/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_06_12_43/2019_11_06_12_43_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_26_13_11/2019_11_26_13_11_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_16_19/2019_11_05_16_19_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_20_15_34/2019_11_20_15_34_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_10_27/2019_11_05_10_27_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_19_13_15/2019_11_19_13_15_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_13_18/2019_11_05_13_18_dotsPositions.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_25_16_12/2019_11_25_16_12_dotsPositions.csv


In [3]:
!find /home/adrian/SingleCP_DotsReversal/Fall2019/raw -name "completed*100*.csv" -print

/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_06_12_43/completed4AFCtrials_task100_date_2019_11_06_12_43.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_26_13_11/completed4AFCtrials_task100_date_2019_11_26_13_11.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_16_19/completed4AFCtrials_task100_date_2019_11_05_16_19.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_20_15_34/completed4AFCtrials_task100_date_2019_11_20_15_34.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_10_27/completed4AFCtrials_task100_date_2019_11_05_10_27.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_19_13_15/completed4AFCtrials_task100_date_2019_11_19_13_15.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_05_13_18/completed4AFCtrials_task100_date_2019_11_05_13_18.csv
/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_25_16_12/completed4AFCtrials_task100_date_2019_11_25_16_12.csv


In [None]:
def label_dots(timestamps, global_labeled_dots_filename, data_folder):
    """
    fetches dots data outputted by MATLAB (the _dotsPositions.csv files) for specified session timestamps, adds
    relevant fira data (join operation) and appends resulting 'labeled_dots' dataframe to the 
    global_labeled_dots_filename.
    :param timestamps: list or tuple of strings of the form '2019_11_05_16_19'
    :param global_labeled_dots_filename: string with full path and filename for global .csv file to write to
    :param data_folder: string with path to folder '.../raw/' where fira and dotsPositions .csv data reside.
    :return: None, but writes to file
    """
    list_of_labeled_dots_dataframes = []
    for ts in timestamps:
        folder = data_folder + ts + '/'
        fira = pd.read_csv(folder + 'completed4AFCtrials_task100_date_' + ts + '.csv')
        dots = pd.read_csv(folder + ts + '_dotsPositions.csv')
        dots = dots[dots['isActive'] == 1]
        del dots['isActive'], dots['taskID'], dots['isCoherent']
        try:
            assert fira.index.min() == 0 and fira.index.max() == 819 and len(fira.index) == 820
            assert dots['trialIx'].min() == 0 and dots['trialIx'].max() == 819
        except AssertionError:
            print(f'assert failed with timestamp {ts}')
            continue
        labeled_dots = dots.join(fira, on="trialIx")
        labeled_dots['trueVD'] = labeled_dots['dotsOff'] - labeled_dots['dotsOn']
        labeled_dots['presenceCP'] = labeled_dots['reversal'] > 0
        to_drop = ['trialIndex', 'RT', 'cpRT', 'dirCorrect', 'cpCorrect', 
            'randSeedBase', 'fixationOn', 'fixationStart', 'targetOn',
            'choiceTime', 'cpChoiceTime', 'blankScreen', 'feedbackOn', 
            'cpScreenOn', 'dummyBlank', 'finalDuration', 'dotsOn', 'dotsOff']
        labeled_dots.drop(columns=to_drop, inplace=True)
        to_rename = {
            'duration': 'viewingDuration',
            'direction': 'initDirection',
        }
        labeled_dots.rename(columns=to_rename, inplace=True)
        labeled_dots.dropna(subset=['dirChoice'], inplace=True)
        list_of_labeled_dots_dataframes.append(labeled_dots)
        
    full_labeled_dots = pd.concat(list_of_labeled_dots_dataframes)
    if os.path.exists(global_labeled_dots_filename):
        full_labeled_dots.to_csv(global_labeled_dots_filename, index=False, mode='a+', header=False)
    else:
        full_labeled_dots.to_csv(global_labeled_dots_filename, index=False, mode='a+', header=True)
        
    return None

In [4]:
TIMESTAMPS = (
    '2019_11_06_12_43',
#     '2019_11_05_16_19',
#     '2019_11_20_15_34',
#     '2019_11_05_10_27',
#     '2019_11_19_13_15',
#     '2019_11_05_13_18',
#     '2019_11_26_13_11',
#     '2019_11_25_16_12',
)

In [5]:
DATA_FOLDER = '/home/adrian/SingleCP_DotsReversal/Fall2019/raw/'

The fira dataframe has index ranging from 0 to 819. The next step is to create a "foreign key" to this index into the dots dataframe.

In [None]:
DOTS_LABELED = '/home/adrian/SingleCP_DotsReversal/processed/dots_fall_2019_v1.csv'

In [None]:
label_dots(TIMESTAMPS, DOTS_LABELED, DATA_FOLDER)

In [16]:
# DOTS_DATA = '/home/adrian/SingleCP_DotsReversal/processed/dots_pilot_summer_2019.csv'

# a=pd.read_csv(DOTS_DATA)
# b=pd.read_csv(DOTS_LABELED)
# b.shape[0] / 5
# c=pd.read_csv('/home/adrian/SingleCP_DotsReversal/Fall2019/raw/2019_11_06_12_43/2019_11_06_12_43_dotsPositions.csv')
# c.head()
# a.head()
# b.head()

In [None]:
def inspect_csv(df):
    """df is a pandas.DataFrame"""
    print(df.head())
    print(len(df))
    print(np.unique(df['taskID']))
    try:
        print(np.unique(df['pilotID']))
    except KeyError:
        print(np.unique(df['subject']))

In [None]:
# a = pd.read_csv('/home/adrian/SingleCP_DotsReversal/raw/2019_06_25_13_24/2019_06_25_13_24_dotsPositions.csv')
# inspect_csv(a)

# b = pd.read_csv('/home/adrian/SingleCP_DotsReversal/raw/2019_07_03_15_03/2019_07_03_15_03_dotsPositions.csv')
# inspect_csv(b)

# c = pd.read_csv('/home/adrian/SingleCP_DotsReversal/raw/2019_07_10_17_19/2019_07_10_17_19_dotsPositions.csv')
# inspect_csv(c)

In [None]:
# files = [
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_25_13_24/2019_06_25_13_24_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_07_03_15_03/2019_07_03_15_03_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_24_13_31/2019_06_24_13_31_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_24_13_06/2019_06_24_13_06_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_07_17_17_17/2019_07_17_17_17_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_07_10_12_18/2019_07_10_12_18_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_20_13_27/2019_06_20_13_27_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_24_12_38/2019_06_24_12_38_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_07_10_17_19/2019_07_10_17_19_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_20_12_54/2019_06_20_12_54_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_06_21_13_08/2019_06_21_13_08_dotsPositions.csv',
#     '/home/adrian/SingleCP_DotsReversal/raw/2019_07_12_11_11/2019_07_12_11_11_dotsPositions.csv'
# ]

In [None]:
# pandas = [pd.read_csv(f) for f in files]
# total = pd.concat(pandas)
# inspect_csv(total)

In [None]:
# len(files)

In [None]:
# final = total.loc[total['isActive'] == 1,:]
# inspect_csv(final)

In [None]:
# write_to_file = False
# if write_to_file:
#     final.to_csv('dots_pilot_summer_2019.csv', index=False)

# Write a dotsDB HDF5 file
Now that all the dotsPositions.csv data is collected into a single global .csv file, I wish to dump it all into an hdf5 database.

Several actions need to be implemented.
1. For each trial in the dotsPositions.csv data, I need to know: _coherence_, _viewing duration_, _presenceCP_, _direction_, _subject_, _block_ (_probCP_). For this, I will assume that the `trialEnd` (from FIRA) and `seqDumpTime` (from dotsPositions) timestamps are in the same unit.
2. I need to decide how to organize my dotsDB hierarchically. Example is `subj15/probCP0.1/coh0/ansleft/CPno/VD100`

In [None]:
trials_df = pd.read_csv('/home/adrian/SingleCP_DotsReversal/processed/all_valid_data.csv')
# dots = pd.read_csv(DOTS_DATA)

In [None]:
# trials = np.unique(trials_df['trialEnd'])

In [None]:
def get_trial_params(df):
    """coherence, viewing duration, presenceCP, direction, subject, block, probCP"""
    coh = df['coherence'].values[0]
    vd = df['viewingDuration'].values[0]
    pcp = df['presenceCP'].values[0]
    idir = df['initDirection'].values[0]
    subj = df['subject'].values[0]
    block = df['block'].values[0]
    Pcp = df['probCP'].values[0]
    return coh, vd, pcp, idir, subj, block, Pcp

In [None]:
def get_trial_from_dots_ts(dot_ts, trials_ts, trials_df):
    trial_dump_time = np.min(trials_ts[trials_ts>dot_ts])
    assert trial_dump_time - dot_ts < .5, 'trialEnd occurs more than 0.5 sec after seqDumpTime'
    return trials_df[trials_df['trialEnd'] == trial_dump_time]

In [None]:
def add_trial_params(row, t, trials):
    """
    function that adds appropriate values to trial parameter columns in dots dataframe
    :param row: row from dataframe
    :param t: dataframe with FIRA data
    :param trials: numpy array of trialEnd timestamps (scalars)
    """
    time = row['seqDumpTime']
    try:
        trial = get_trial_from_dots_ts(time, trials, t)
    except AssertionError:
        print(f'0.5 sec margin failed at row {row.name}')
        return row
    c,v,p,i,s,b,P = get_trial_params(trial)
    row['coherence'] = c
    row['viewingDuration'] = v
    row['presenceCP'] = p
    row['initDirection'] = i
    row['subject'] = s
    row['block'] = b
    row['probCP'] = P
    return row

def set_nans(df):
    if 'isActive' in df:
        del df['isActive']
    df['coherence'] = np.nan
    df['viewingDuration'] = np.nan
    df['presenceCP'] = np.nan
    df['initDirection'] = np.nan
    df['subject'] = np.nan
    df['block'] = np.nan
    df['probCP'] = np.nan
    return df

So far so good, for a given `seqDumpTime` value, I am able to recover the trial's parameters. All that remains to do is to add columns to the dots dataframe (and remove the `isActive` one).

In [None]:
# inspect_csv(dots)

**Following cell is SLOW! Around 30 min**

In [None]:
if False:
    dots = set_nans(dots)
    dots = dots.apply(add_trial_params, axis=1, args=(trials_df, trials))

In [None]:
if False:
    dots_light = dots.copy()
    dots_light.dropna(inplace=True)
    dots.to_csv('dots_summer_2019_upgraded.csv')
    dots_light.to_csv('dots_summer_2019_upgraded_light.csv')

I forgot to record true duration and subject's choice!
Rebelote...

In [None]:
# ALL THIS HAS BEEN DONE AND YIELDED 'dots_summer_2019_upgraded_light_v2.csv'
# dots = pd.read_csv(DOTS_LABELED)
# del dots['Unnamed: 0']
# gb = dots.groupby('seqDumpTime')

# # Let's build a dataframe with four columns: seqDumpTime, cpChoice, trueVD and trialEnd
# extras = pd.DataFrame(index=gb.groups.keys(), columns=['cpChoice', 'trueVD', 'trialEnd'])

In [None]:
# def add_extras(row, t, trials):
#     """
#     function that adds appropriate values to trial parameter columns in dots dataframe
#     :param row: row from dataframe
#     :param t: dataframe with FIRA data
#     :param trials: numpy array of trialEnd timestamps (scalars)
#     """
#     time = row.name
#     try:
#         trial = get_trial_from_dots_ts(time, trials, t)
#     except AssertionError:
#         print(f'0.5 sec margin failed at row {row.name}')
#         return row
#     c, v, te = trial['cpChoice'].values[0], trial['dotsOff'].values[0] - trial['dotsOn'].values[0], trial['trialEnd'].values[0]
#     row['cpChoice'] = c  # could still be NA if block didn't require CP report
#     row['trueVD'] = v
#     row['trialEnd'] = te
#     return row

In [None]:
# extras = pd.DataFrame(index=gb.groups.keys(), columns=['dirChoice', 'trueVD', 'trialEnd'])
# extras = extras.apply(add_extras, axis=1, args=(trials_df, trials))

In [None]:
# extras.head()

In [None]:
# # join on seqDumpTime
# dots = dots.join(extras, on='seqDumpTime')

In [None]:
# dots.head()

I should have added `dirChoice` instead of `cpChoice`!!!

In [None]:
# # DONE
# dots = pd.read_csv(DOTS_LABELED)
# trials_df.set_index('trialEnd', inplace=True)
# dots = dots.join(trials_df[['dirChoice']], on='trialEnd')
# dots.to_csv('dots_summer_2019_upgraded_light_v3.csv', index=False)

Now, it turns out some rows have a `nan` value in the `dirChoice` column!
I will drop them!

In [None]:
# dots = pd.read_csv(DOTS_LABELED)
# old_length = len(dots)
# dots.dropna(subset=['dirChoice'],inplace=True)
# assert len(dots) < old_length
# dots.to_csv('dots_summer_2019_upgraded_light_v4.csv', index=False)

## Write HDF5 file
Now, I need to write an HDF5 file with the structure:
`subj15/probCP0.1/coh0/ansleft/CPno/VD100`.

I need to:
- loop through the trials contained in the dots DF
- port the dots data to dotsDB format
- write to file

In [17]:
dots = pd.read_csv(DOTS_LABELED)

In [None]:
dots.head()

In [None]:
gb = dots.groupby('seqDumpTime')  # recall gb.get_group() and gb['frameIdx'].max()

At this stage, I would like to know the max value of `frameIdx` in each trial.

In [None]:
# Needs to be re-written
def get_frames(df):
    """
    get the dots data as a list of numpy arrays, as dotsDB requires them
    """
    # (could/should probably be re-written with groupby and apply...)
    num_frames = np.max(df["frameIdx"]).astype(int)
    assert not np.isnan(num_frames), 'NaN num_frames'
    list_of_frames = []
    for fr in range(num_frames):
        frame_data = df[df["frameIdx"] == (fr+1)]
        list_of_frames.append(np.array(frame_data[['ypos','xpos']]))  # here I swap xpos with ypos for dotsDB
    return list_of_frames

def get_group_name(df):
    """
    get the trial's parameters, and therefore the HDF5 group where the data should be appended
    """       
    # get HDF5 group name
    
    def choice(c):
        if c == 1:
            return '/ansright' 
        elif c == 0:
            return '/ansleft'
        else:
            raise ValueError(f'unexpected choice value {c}')
            
    def chgepoint(c):
        return '/CPyes' if c else '/CPno'
    
    def viewdur(v):
        return '/VD' + str(int(1000*v))
    
    def direction(d):
        return 'left' if d else 'right'
    
    ss, pp, cc, ch, cp, vd, di= df[['subject', 'probCP', 'coherence', 'dirChoice', 'presenceCP', 
                               'viewingDuration', 'initDirection']].values[0,:]
    
    group_name = '/subj' + ss + \
                 '/probCP' + str(pp) + \
                 '/coh' + str(cc) + \
                 choice(ch) + chgepoint(cp) + viewdur(vd) + '/' + direction(di)
                 
    vals = {'coh': cc, 
            'subject': ss, 
            'probCP': pp, 
            'dirChoice': ch,
            'presenceCP': cp,
            'viewingDuration': vd, 
            'initDirection': direction(di)}
    
    return group_name, vals

def write_dots_to_file(df, hdf5_file):
    """
    The aim of this function is to write the dots info contained in the pandas.DataFrame df to a dotsDB HDF5 file.
    df should only contain data about a single trial. 
    
    head on df looks like this
    xpos	ypos	isCoherent	frameIdx	seqDumpTime	pilotID	taskID	coherence	viewingDuration	presenceCP	initDirection	subject	block	probCP	cpChoice	trueVD	trialEnd	dirChoice
	0.722093	0.416122	1.0	1.0	1069.27719	2.0	3.0	48.5	0.3	0.0	180.0	S1	Block2	0.0	NaN	0.318517	1069.535562	0.0
	0.681785	0.356234	1.0	1.0	1069.27719	2.0	3.0	48.5	0.3	0.0	180.0	S1	Block2	0.0	NaN	0.318517	1069.535562	0.0
	0.445828	0.914470	1.0	1.0	1069.27719	2.0	3.0	48.5	0.3	0.0	180.0	S1	Block2	0.0	NaN	0.318517	1069.535562	0.0
	0.833181	0.112126	1.0	1.0	1069.27719	2.0	3.0	48.5	0.3	0.0	180.0	S1	Block2	0.0	NaN	0.318517	1069.535562	0.0
	0.013516	0.354543	1.0	1.0	1069.27719	2.0	3.0	48.5	0.3	0.0	180.0	S1	Block2	0.0	NaN	0.318517	1069.535562	0.0
    """
    frames = get_frames(df)
    gn, params = get_group_name(df)
    
    # exit function if number of frames too different from theoretical one 
    vd = params['viewingDuration']
    num_frames = len(frames)
    if abs(num_frames-vd*60) > 5:
        tr = df['seqDumpTime'].values[0]
        print(f'trial {tr} not written; discrepancy num_frames {num_frames} and VD {vd}')
        return None
    
    cptime = 0.2 if params['presenceCP'] else None
    parameters = dict(speed=5, 
                      density=90, 
                      coh_mean=params['coh'], 
                      coh_stdev=10, 
                      direction=params['initDirection'],
                      num_frames=np.max(df["frameIdx"]).astype(int),
                      diameter=5, 
                      pixels_per_degree=(55.4612 / 2), 
                      dot_size_in_pxs=3, 
                      cp_time=cptime)
    
    stimulus = ddb.DotsStimulus(**parameters)
    
    ddb.write_stimulus_to_file(stimulus, 1, hdf5_file, 
                               pre_generated_stimulus=[frames],
                               group_name=gn, append_to_group=True, max_trials=50)

In [None]:
# # get the first two seqDumpTime values for toy example
# counter = 0
# for ix in gb.groups.keys():
#     counter += 1
#     if counter == 6:
#         break
#     write_dots_to_file(dots[dots['seqDumpTime']==ix], 'test_pilot.h5')

Following cell takes a bit under 5 min

In [None]:
# Recall func is called twice the first time!

# _ = gb.apply(write_dots_to_file, 'pilot_v3.h5')

# no need to go in manually and delete the first entry in the dataset corresponding to 
# the first group element gb.groups.keys()[0]