# Documentation
The purpose of this notebook is to describe the files available in the data repository and their structure. 

For version control purposes this file should be commited without output and only run locally.

In [None]:
from pathlib import Path
from pprint import pprint


import numpy as np
import pandas as pd

In [None]:
def read_matlab_file(file_path):
    file_path = str(file_path)

    try:
        mat_file = loadmat_mat4py(str(file_path))
        mat_file['read'] = 'mat4py'
    except:
        try:
            mat_file = loadmat_mat73(file_path)
            mat_file['read'] = 'mat73'
        except:
            mat_file = loadmat_scipy(file_path)
            mat_file['read'] = 'scipy'
    return mat_file

# Data directory structure

In [None]:
base_path = Path('/shared/catalystneuro/Buzsaki/TingleyD')  # Change this with the right location

## Subjects 
First, let's get the different subjects and filter the ones that we decide shoule be included in this transformation

In [None]:
subjects_in_study = ['DT2', 'DT5', 'DT7', 'DT8', 'DT9']
subject_path_dic = {p.stem:p for p in base_path.iterdir() if p.is_dir() and p.name in subjects_in_study}
pprint(subject_path_dic.keys())

The results of this should be all the subjects in this paper
['DT2', 'DT5', 'DT7', 'DT8', 'DT9'] and the dictionary values are the paths for each of the subjects folder

## Sessions
Now let's see how the sessions are structure. For this dataset we have that each subject contains multiple files, tentatively belonging to different seessions and some top level files

### Session folders

In [None]:
subject = "DT8" 
sessions_path_in_subject = {p.stem:p for p in subject_path_dic[subject].iterdir() if p.is_dir()}
session_names_list = list(sessions_path_in_subject.keys())
session_names_list.sort()
pprint(session_names_list, compact=False)

The results should be something like:
    
    ['20170124_0um_0um_170124_151501',
     '20170125_0um_0um_merge',
     '20170126_0um_0um_merge',
     '20170127_0um_0um_170127_121658',
     '20170128_0um_72um_merge',
     '20170129_0um_72um_170129_130356',
     '20170130_0um_144um_170130_143246',
     '20170131_72um_432um_merge',
     '20170201_72um_432um_merge',
     '20170202_72um_576um_170202_092033',
     '20170203_72um_648um_170203_105608',
     '20170206_72um_1872um_merge',
     '20170207_144um_1944um_merge',
     '20170208_144um_1944um_merge',
     '20170209_144um_1944um_merge',
     '20170210_144um_1944um_170210_112254',
     '20170211_144um_1944um_merge',
     '20170212_144um_1944um_170212_154634',
     '20170213_144um_1944um_merge',
     '20170214_216um_1944um_170214_102804',
     '20170215_216um_1944um_170215_140739',
     '20170216_216um_1944um_merge',
     '20170217_216um_1944um_merge',
     '20170218_216um_1944um_170218_114724',
     '20170220_216um_1944um_170220_192456',
     '20170221_324um_1944um_170221_101929',
     '20170222_324um_2088um_170222_113940',
     '20170223_324um_2088um_170223_103741',
     '20170224_324um_2088um_170224_101710',
     '20170227_324um_2088um_170227_105926',
     '20170228_324um_2088um_merge',
     '20170229_324um_2088um_merge',
     '20170303_324um_2088um_170303_114804',
     '20170305_468um_2088um_170305_135233',
     '20170306_468um_2088um_170306_113628',
     '20170307_540um_2088um_merge',
     '20170308_612um_2088um_merge',
     '20170309_612um_2088um_170309_093245',
     '20170310_612um_2088um_170310_140825',
     '20170311_684um_2088um_170311_134350',
     '20170316_828um_2088um_170316_093559',
     '20170318_828um_2088um_170318_201151',
     '20170320_828um_2088um_170320_200023']

Which seem to be the different recording sessions with the following format:
 * {date}_{x}um_{date_time |merge}

## Structure of the session directory

Now let's look inside the directory of those directories inside the session path. 
Let's show some examples.

In [None]:
session_name = "20170124_0um_0um_170124_151501" 
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort(key=lambda x:int(x))
pprint(directories_in_session)

In [None]:
session_name = "20170224_324um_2088um_170224_101710"
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

In [None]:
session_name = "20170307_540um_2088um_merge" 
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

In [None]:
session_name ="20170308_612um_2088um_merge"  
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

The output of the last example should be something like this:

    ['1',
     '2',
     '3',
     '4',
     '5',
     '6',
     '7',
     '8',
     '9',
     '10',
     '20170308_612um_2088um_170308_115428', # these are available on merge sessions
     '20170308_612um_2088um_170308_144851', # there are available on merge sessions
     'Session 2017-03-08', # not available in every session
     'StateScoreFigures'   # not available in every session
     ]

The general structure seem to be the following. Each of the sessions contains ten folders numerated from 1 to 10. Merge sessions contain further folders wher the specific time-stamps of the merged sessions can be found and finally sometimes there is a folder with the naming "Session {date}" 

### Files in each of the sessions

Here a brief description of what type of files are each of the levels. A more extended discussion can be vound in the Data section below

In the session directory top-most level the following formats can be found:
* evt
* clu
* fet
* pos
* res
* lfp
* spk
* nrs
* xml
* mat

This seems to be where the main data for the session is, plus the mat files whereas both processing and behavioral might be.

In the directories that are named with numbers the following formats can be found:
* kwd 
* kwik
* log
* clu
* fet
* fmask
* klg
* prm
* kvlog

This seems the results of `klusta` suite for spike sorting and analysis

In merge session when merge directories are available the following files can be found on them:
* info.rhd
* amplifiers.nrs
* amplifiers.xlm

Finally, when the directory "Sesion {date}" is available, it contains csv files with a time stamp as title and some `.tak` files which seem to be audio

## Subject top-level files
Moreover, we also find some files in the topmost directory. We describe them in more detail in the data section of this document

In [None]:
files_not_in_session = {p.name:p for p in subject_path_dic[subject].iterdir() if p.is_file()}
files_names = list(files_not_in_session.keys())
files_names.sort()
pprint(files_names)

The result should be something like this:

    ['.DS_Store',
     'DT8_current_dataset_parietal_theta.mat',
     'behav.mat',
     'groupRecordings.m']
     
 They seem to be behavioral files and we will come back to them when we discuss the specific files below

# Data 

## An overview of the available data
We discuss now what formats are present before describing the files in more detail.

In [None]:
not_data_formats = [".jpg", ".png", ".pdf", ".svg", ".fig", ".py", ".m"]

format_list = [p.suffixes for p in subject_path_dic[subject].rglob('*') if not p.is_dir()]
format_list = list({_ for suffixes in format_list for _ in suffixes}) 
format_list = [_ for _ in format_list if 3 < len(_) < 7 and " " not in _]  # Eliminate names, numbers and dates
format_list = [p for p in format_list if p not in not_data_formats]
format_list.sort()
pprint(format_list, compact=True)

The output of this should give us the formats available for this subject and should look something like this:

    ['.LOG', '.WAV', '.alg', '.cat', '.clu', '.csv', '.evt', '.fet', '.fmask',
     '.high', '.klg', '.kvlog', '.kwd', '.kwik', '.lfp', '.log', '.low', '.mat',
     '.nrs', '.out', '.pos', '.prb', '.prm', '.raw', '.res', '.rhd', '.spk', '.tak',
     '.txt', '.xml']
     

Descripton of the files

* `LOG` : probably log files. 
* `WAV` : audio format. 
* `alg` : Not likely to have anything useful.
* `cat` :
* `clu` : usually associated to the neuroscope sorting format.
* `csv` : comma separated value, plain text.
* `klg`, kvlog, kwik: files of the klusta spike sorting suit.
* `lfp` : This is the LFP, this is read with Neuroscope.
* `log` : sometimes log files, sometimes associated with the Phy sorting program. 
* `low` :
* `mat` : matlab files.
* `nrs` : 
* `out` : output of some shell process usually.
* `pos` :
* `prm` :
* `raw` :
* `res` : usually associated to the neuroscope sorting format.
* `rhd` :
* `spk` :
* `tak` : audio file. This could be video. Double check and confirm.
* `txt` : plain text file.
* `xml` : xml.

#### Sizes of the files

In [None]:
{p.suffix:p for p in subject_path_dic[subject].rglob('*') if not p.is_dir() and "raw" in '.'.join(p.suffixes)}

In [None]:
not_data_formats = [".jpg", ".png", ".pdf", ".svg", ".fig", ".py", ".m"]

format_dic = {:p for p in subject_path_dic[subject].rglob('*') if not p.is_dir()}

In [None]:
Path(file).stat().st_size * 70

## Matlab files
As the matlab files are the ones usually associated with behavioral data they will be describe first. We will see what files are available on a sesion.
Here we use a session with merger to have a picture of the most complicated case

In [None]:
session_name = "20170308_612um_2088um_merge"  
mat_files_paths = {p.name:p for p in sessions_path_in_subject[session_name].rglob('*') if not p.is_dir() and 'mat' in ''.join(p.suffixes)}
mat_files_names = list(mat_files_paths.keys())
mat_files_names.sort()
pprint(mat_files_names)

The file above should produce something like:

    ['20170308_612um_2088um_merge.behavior.mat',
     '20170308_612um_2088um_merge.firingMaps.cellinfo.mat',
     '20170308_612um_2088um_merge.isolationMetrics.cellinfo.mat',
     '20170308_612um_2088um_merge.olypherInfo.cellinfo.mat',
     '20170308_612um_2088um_merge.phaseMaps.cellinfo.mat',
     '20170308_612um_2088um_merge.placeFields.01_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.05_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.10_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.20_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.40_pctThresh.mat',
     '20170308_612um_2088um_merge.positionDecodingGLM_binnedspace_box.cellinfo.mat',
     '20170308_612um_2088um_merge.positionDecodingGLM_binnedspace_box_median.cellinfo.mat',
     '20170308_612um_2088um_merge.positionDecodingMaxCorr_binned_box_median.cellinfo.mat',
     '20170308_612um_2088um_merge.sessionInfo.mat',
     '20170308_612um_2088um_merge.spikes.cellinfo.mat']
     
There are similar files in every section that look like {date}_{x}um_{x}.{name}.mat. We now look at all the files available for the subject disambiguation everything but the name at the end to see what is the variety of mat files available.

In [None]:
mat_files_names_simple = [name for name in mat_files_names if len(name.split(".")) == 2]
mat_files_names_composed = ['.'.join(name.split('.')[1:]) for name in mat_files_names if len(name.split(".")) != 2]

mat_files_names_formatted = list(set( mat_files_names_composed))
mat_files_names_formatted += mat_files_names_simple
mat_files_names_formatted.sort()
pprint(mat_files_names_formatted)

This should produce something that looks like this:

    ['DT8_current_dataset_parietal_theta.mat',
    'PhaseLockingData.cellinfo.mat',
    'assembliesCrossRegionData.mat',
    'assembliesCrossRegionData_w_theta_sin_cos_coord_vel.mat',
    'assembliesCrossRegion_split_w_theta.mat',
    'assembliesWithinRegionData_w_theta_sin_cos_coord_vel.mat',
    'behav.mat',
    'behav_temp.mat',
    'behavior.mat',
    'firingMaps.cellinfo.mat',
    'isolationMetrics.cellinfo.mat',
    'ls_RipplePhaseModulation.cellinfo.mat',
    'meta.mat',
    'noiseCorrs.mat',
    'olypherInfo.cellinfo.mat',g. Most likely we will ignore them.
spk <-
    'olypherInfo_w_disc.cellinfo.mat',
    'phaseMaps.cellinfo.mat',
    'placeFields.01_pctThresh.mat',
    'placeFields.05_pctThresh.mat',
    'placeFields.10_pctThresh.mat',
    'placeFields.20_pctThresh.mat',
    'placeFields.40_pctThresh.mat',
    'positionDecodingGLM.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box_median.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box_nozero.cellinfo.mat',
    'positionDecodingGLM_binnedspace_gauss.cellinfo.mat',
    'positionDecodingGLM_box.cellinfo.mat',
    'positionDecodingGLM_gaussian.cellinfo.mat',
    'positionDecodingMaxCorr_binned_box_median.cellinfo.mat',
    'referenceFrames.mat',
    'sessionInfo.mat',
    'spikes.cellinfo.mat']
    
Which gives us an idea of all the files available. I will describe them in more details

### Description overview (To-do)

* `DT8_current_dataset_parietal_theta.mat` :
* `PhaseLockingData.cellinfo.mat` : 
* `assembliesCrossRegionData.mat` :  <- Check this
* `assembliesCrossRegionData_w_theta_sin_cos_coord_vel.mat` :
* `assembliesCrossRegion_split_w_theta.mat` :
* `assembliesWithinRegionData_w_theta_sin_cos_coord_vel.mat` :
* `behav.mat` : <- Check this
* `behav_temp.mat` :
* `behavior.mat` :
* `firingMaps.cellinfo.mat` :
* `isolationMetrics.cellinfo.mat` :
* `ls_RipplePhaseModulation.cellinfo.mat` :
* `meta.mat` :
* `noiseCorrs.mat` : <Ignore this
* `olypherInfo.cellinfo.mat` :
* `olypherInfo_w_disc.cellinfo.mat` :
* `phaseMaps.cellinfo.mat` :
* `placeFields.01_pctThresh.mat` :
* `placeFields.05_pctThresh.mat` :
* `placeFields.10_pctThresh.mat` :
* `placeFields.20_pctThresh.mat` :
* `placeFields.40_pctThresh.mat` :
* `positionDecodingGLM.cellinfo.mat` : Ignore all of these guys -we care about the ACTUAL position not the processing of it. Maybe the od have the x,y z, positions but more likely they are processing parameters for the anlalysis. 
* `positionDecodingGLM_binnedspace_box.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_box_median.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_box_nozero.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_gauss.cellinfo.mat` :
* `positionDecodingGLM_box.cellinfo.mat` :
* `positionDecodingGLM_gaussian.cellinfo.mat` :
* `positionDecodingMaxCorr_binned_box_median.cellinfo.mat` :
* `referenceFrames.mat` :
* `sessionInfo.mat` : Initial the cell explorer sorter extrator in the folder paht where this file is. This header was missing from the Yuta file. It will not process the customer 
* `spikes.cellinfo.mat` :

## CSV files - To do

# Temporary

In [None]:
import pandas as pd

In [None]:
file_path = Path("/home/jovyan/globus_data/TingleyD/DT8/20170217_216um_1944um_merge/Session 2017-02-17/")
file_path.is_dir()