# Documentation
The purpose of this notebook is to describe the files available in the data repository and their structure. 

For version control purposes this file should be commited without output and only run locally.

In [1]:
from pathlib import Path
from pprint import pprint


import numpy as np
import pandas as pd

In [2]:
def read_matlab_file(file_path):
    file_path = str(file_path)

    try:
        mat_file = loadmat_mat4py(str(file_path))
        mat_file['read'] = 'mat4py'
    except:
        try:
            mat_file = loadmat_mat73(file_path)
            mat_file['read'] = 'mat73'
        except:
            mat_file = loadmat_scipy(file_path)
            mat_file['read'] = 'scipy'
    return mat_file

# Data directory structure

In [3]:
base_path = Path('/shared/catalystneuro/Buzsaki/TingleyD')  # Change this with the right location

## Subjects 
First, let's get the different subjects and filter the ones that we decide shoule be included in this transformation

In [4]:
subjects_in_study = ['DT2', 'DT5', 'DT7', 'DT8', 'DT9']
subject_path_dic = {p.stem:p for p in base_path.iterdir() if p.is_dir() and p.name in subjects_in_study}
pprint(subject_path_dic.keys())

dict_keys(['DT8', 'DT2', 'DT9', 'DT5', 'DT7'])


The results of this should be all the subjects in this paper
['DT2', 'DT5', 'DT7', 'DT8', 'DT9'] and the dictionary values are the paths for each of the subjects folder. 

Now, it happens that this particular project has a large amount of data. We calculate the folder of each directory

In [21]:
subject_sizes_pairs = [(subject, sum([p.stat().st_size for p in path.rglob('*')])) for (subject, path) in subject_path_dic.items()]

for subject, size in subject_sizes_pairs:
    print(f"{subject} directory size is {size / (1000 ** 4) :2.2f} TB and {size / (1025 ** 4) :2.2f} TiB")

The code above should output this:

    DT8 directory size is 0.54 TB and 0.49 TiB
    DT2 directory size is 4.18 TB and 3.79 TiB
    DT9 directory size is 0.83 TB and 0.75 TiB
    DT5 directory size is 1.83 TB and 1.66 TiB
    DT7 directory size is 1.08 TB and 0.98 TiB

The directory associated to subject `DT2` for example is 4.18 terabytes.

## Sessions
Now let's see how the sessions are structure. For this dataset we have that each subject contains multiple files, tentatively belonging to different seessions and some top level files

### Session folders

In [37]:
subject = "DT8" 
sessions_path_in_subject = {p.stem:p for p in subject_path_dic[subject].iterdir() if p.is_dir()}
session_names_list = list(sessions_path_in_subject.keys())
session_names_list.sort()
pprint(session_names_list, compact=False)

['DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712',
 'DT2_rPPC_rCCG_144um_72um_20160207_160207_170623',
 'DT2_rPPC_rCCG_2016um_290um_20160214_160214_105540',
 'DT2_rPPC_rCCG_218um_218um_20160208_160208_142910',
 'DT2_rPPC_rCCG_3036um_362um_20160215_160215_122710',
 'DT2_rPPC_rCCG_3108um_434um_20160215_160215_161846',
 'DT2_rPPC_rCCG_3324um_1000um_20160217_160217_103121',
 'DT2_rPPC_rCCG_3324um_1072um_20160218_160218_133625',
 'DT2_rPPC_rCCG_3324um_1144um_20160219_160219_130435',
 'DT2_rPPC_rCCG_3324um_650um_20160216_160216_132620',
 'DT2_rPPC_rCCG_3324um_794um_20160216_160216_163304',
 'DT2_rPPC_rCCG_3396um_1180um_20160220_160220_141839',
 'DT2_rPPC_rCCG_3396um_1180um_20160221_160221_135947',
 'DT2_rPPC_rCCG_3468um_1216um_20160222_160222_120902',
 'DT2_rPPC_rCCG_3468um_1216um_20160223_160223_121321',
 'DT2_rPPC_rCCG_3468um_1216um_20160223_160223_173937',
 'DT2_rPPC_rCCG_3468um_1216um_20160224_160224_133121',
 'DT2_rPPC_rCCG_3468um_1216um_20160225_160225_130701',
 'DT2_rPPC_rCCG_3468u

The results should be something like:
    
    ['20170124_0um_0um_170124_151501',
     '20170125_0um_0um_merge',
     '20170126_0um_0um_merge',
     '20170127_0um_0um_170127_121658',
     '20170128_0um_72um_merge',
     '20170129_0um_72um_170129_130356',
     '20170130_0um_144um_170130_143246',
     '20170131_72um_432um_merge',
     '20170201_72um_432um_merge',
     '20170202_72um_576um_170202_092033',
     '20170203_72um_648um_170203_105608',
     '20170206_72um_1872um_merge',
     '20170207_144um_1944um_merge',
     '20170208_144um_1944um_merge',
     '20170209_144um_1944um_merge',
     '20170210_144um_1944um_170210_112254',
     '20170211_144um_1944um_merge',
     '20170212_144um_1944um_170212_154634',
     '20170213_144um_1944um_merge',
     '20170214_216um_1944um_170214_102804',
     '20170215_216um_1944um_170215_140739',
     '20170216_216um_1944um_merge',
     '20170217_216um_1944um_merge',
     '20170218_216um_1944um_170218_114724',
     '20170220_216um_1944um_170220_192456',
     '20170221_324um_1944um_170221_101929',
     '20170222_324um_2088um_170222_113940',
     '20170223_324um_2088um_170223_103741',
     '20170224_324um_2088um_170224_101710',
     '20170227_324um_2088um_170227_105926',
     '20170228_324um_2088um_merge',
     '20170229_324um_2088um_merge',
     '20170303_324um_2088um_170303_114804',
     '20170305_468um_2088um_170305_135233',
     '20170306_468um_2088um_170306_113628',
     '20170307_540um_2088um_merge',
     '20170308_612um_2088um_merge',
     '20170309_612um_2088um_170309_093245',
     '20170310_612um_2088um_170310_140825',
     '20170311_684um_2088um_170311_134350',
     '20170316_828um_2088um_170316_093559',
     '20170318_828um_2088um_170318_201151',
     '20170320_828um_2088um_170320_200023']

Which seem to be the different recording sessions with the following format:
 * `{date}_{x}um_{date_time |merge}`
 
 The only subject that does not adapt to this format is subject `DT2`. Here the files have a different format:
 
     ['DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712',
     'DT2_rPPC_rCCG_144um_72um_20160207_160207_170623',
     'DT2_rPPC_rCCG_2016um_290um_20160214_160214_105540',
     'DT2_rPPC_rCCG_218um_218um_20160208_160208_142910',
     'DT2_rPPC_rCCG_3036um_362um_20160215_160215_122710',
     'DT2_rPPC_rCCG_3108um_434um_20160215_160215_161846',
     'DT2_rPPC_rCCG_3324um_1000um_20160217_160217_103121',
     'DT2_rPPC_rCCG_3324um_1072um_20160218_160218_133625',
     'DT2_rPPC_rCCG_3324um_1144um_20160219_160219_130435',
     'DT2_rPPC_rCCG_3324um_650um_20160216_160216_132620',
     'DT2_rPPC_rCCG_3324um_794um_20160216_160216_163304',
     'DT2_rPPC_rCCG_3396um_1180um_20160220_160220_141839',
     'DT2_rPPC_rCCG_3396um_1180um_20160221_160221_135947',
     'DT2_rPPC_rCCG_3468um_1216um_20160222_160222_120902',
     'DT2_rPPC_rCCG_3468um_1216um_20160223_160223_121321',
     'DT2_rPPC_rCCG_3468um_1216um_20160223_160223_173937',
     'DT2_rPPC_rCCG_3468um_1216um_20160224_160224_133121',
     'DT2_rPPC_rCCG_3468um_1216um_20160225_160225_130701',
     'DT2_rPPC_rCCG_3468um_1252um_20160226_160226_124132',
     'DT2_rPPC_rCCG_3540um_1288um_20160227_160227_121226',
     'DT2_rPPC_rCCG_3540um_1288um_20160228_160228_135001',
     'DT2_rPPC_rCCG_3612um_1360um_20160302_160302_134841',
     'DT2_rPPC_rCCG_3612um_1360um_20160303_160303_084915',
     'DT2_rPPC_rCCG_3612um_1360um_20160304_160304_150220',
     'DT2_rPPC_rCCG_3612um_1360um_20160305_160305_141507',
     'DT2_rPPC_rCCG_362um_218um_20160209_160209_183610',
     'DT2_rPPC_rCCG_650um_218um_20160210_160210_120100',
     'DT2_rPPC_rCCG_794um_290um_20160210_160210_225301',
     'DT2_rPPC_rCCG_938um_290um_20160212_160212_122801',

 It seems that the general format is `DT2_rPPC_rCCG_{x}_um_{y}_um_{date}_{time_stamp}_{time_stamp}`. It is not clear yet to me what is the meaning of the second timestamp. Possibility is a notation for an interval and the second one indicates the duration. That is, a (start, duration) interval. 

## Structure of the session directory

Now let's look inside the directory of those directories inside the session path. 
Let's show some examples taken from the list above for subject DT8

In [27]:
session_name = "20170124_0um_0um_170124_151501" 
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort(key=lambda x:int(x))
pprint(directories_in_session)

KeyError: '20170124_0um_0um_170124_151501'

In [None]:
session_name = "20170224_324um_2088um_170224_101710"
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

In [None]:
session_name = "20170307_540um_2088um_merge" 
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

In [None]:
session_name ="20170308_612um_2088um_merge"  
path_of_dirs_in_session = {p.name:p for p in sessions_path_in_subject[session_name].iterdir() if p.is_dir()}
directories_in_session = list(path_of_dirs_in_session.keys())
directories_in_session.sort()
pprint(directories_in_session)

The output of the last example should be something like this:

    ['1',
     '2',
     '3',
     '4',
     '5',
     '6',
     '7',
     '8',
     '9',
     '10',
     '20170308_612um_2088um_170308_115428', # these are available on merge sessions
     '20170308_612um_2088um_170308_144851', # there are available on merge sessions
     'Session 2017-03-08', # not available in every session
     'StateScoreFigures'   # not available in every session
     ]

The general structure seem to be the following. Each of the sessions contains ten folders numerated from 1 to 10. Merge sessions contain further folders wher the specific time-stamps of the merged sessions can be found and finally sometimes there is a folder with the naming "Session {date}" 

Importantly, while subject `DT5` has the same number, 10, of numbered folders the rest of the subject do not. `DT7` has 7 folders whereas `DT9` and `DT2` have 12 and 13 folders respectively. 

### Files in each of the sessions

Here a brief description of what type of files are each of the levels. A more extended discussion can be vound in the Data section below

In the session directory top-most level the following formats can be found:
* evt
* clu
* fet
* pos
* res
* lfp
* spk
* nrs
* xml
* mat

This seems to be where the main data for the session is, plus the mat files whereas both processing and behavioral might be.

In the directories that are named with numbers the following formats can be found:
* kwd 
* kwik
* log
* clu
* fet
* fmask
* klg
* prm
* kvlog

This seems the results of `klusta` suite for spike sorting and analysis

In merge session when merge directories are available the following files can be found on them:
* info.rhd
* amplifiers.nrs
* amplifiers.xlm

Finally, when the directory "Sesion {date}" is available, it contains csv files with a time stamp as title and some `.tak` files which seem to be audio

## Subject top-level files
Moreover, we also find some files in the topmost directory. We describe them in more detail in the data section of this document

In [None]:
files_not_in_session = {p.name:p for p in subject_path_dic[subject].iterdir() if p.is_file()}
files_names = list(files_not_in_session.keys())
files_names.sort()
pprint(files_names)

The result should be something like this:

    ['.DS_Store',
     'DT8_current_dataset_parietal_theta.mat',
     'behav.mat',
     'groupRecordings.m']
     
 They seem to be behavioral files and we will come back to them when we discuss the specific files below

# Data 

## An overview of the available data
We discuss now what formats are present before describing the files in more detail.

In [15]:
not_data_formats = [".jpg", ".png", ".pdf", ".svg", ".fig", ".py", ".m", '.py']

format_list = [p.suffixes for p in base_path.rglob("*") if p.is_file()]
format_list = list({_ for suffixes in format_list for _ in suffixes}) 
format_list = [_ for _ in format_list if len(_)==4 and " " not in _]  # Only standar three letter formats.
format_list = [_ for _ in format_list if not any(map(str.isdigit, _))]  # Remove numbers 
format_list = [_ for _ in format_list if _ not in not_data_formats] # remove data formts
format_list.sort(key=lambda x : x.lower())  # sort by lower case

pprint(format_list, compact=True)

['.alg', '.avi', '.cat', '.clu', '.com', '.csv', '.dat', '.EMG', '.evt', '.fet',
 '.klg', '.kwd', '.kwx', '.led', '.lfp', '.LFP', '.lnk', '.log', '.LOG', '.low',
 '.mat', '.nrs', '.out', '.pos', '.prb', '.prm', '.raw', '.res', '.rhd', '.rLS',
 '.rls', '.spk', '.tak', '.txt', '.url', '.WAV', '.xml']


The output of this should give us the formats available for this subject and should look something like this:

    ['.alg', '.avi', '.cat', '.clu', '.com', '.csv', '.dat', '.EMG', '.evt', '.fet',
     '.klg', '.kwd', '.kwx', '.led', '.lfp', '.LFP', '.lnk', '.log', '.LOG', '.low',
     '.mat', '.nrs', '.out', '.pos', '.prb', '.prm', '.raw', '.res', '.rhd', '.rLS',
     '.rls', '.spk', '.tak', '.txt', '.url', '.WAV', '.xml']
     

Descripton of the files

* `.alg` :
* `.avi` : video format.
* `.cat` :
* `.clu` : usually associated with the neuroscope sorting format
* `.com` :
* `.csv` : comma separated values.
* `.dat` : usually the raw data.
* `.EMG` : 
* `.evt` :
* `.fet` :
* `.klg` : files related to the klusta spike sorting suit.
* `.kwd` : files related to the klusta spike sorting suit.
* `.kwx` : files related to the klusta spike sorting suit.
* `.led` :
* `.lfp` : local field potential data.
* `.LFP` : local field potential data
* `.lnk` :
* `.log` : log files, usually not useful. Sometimes associated with the Phy sorting program.
* `.LOG` : log files, usually not useful.
* `.low` : 
* `.mat` : data structures form matlab
* `.nrs` :
* `.out` :
* `.pos` :
* `.prb` :
* `.prm` : 
* `.raw` :
* `.res` : usually associated with the neuroscope sorting format
* `.rhd` :
* `.rLS` :
* `.rls` :
* `.spk` : 
* `.tak` : 
* `.txt` : plain text files
* `.url` : a url address 
* `.WAV` : audio format.
* `.xml` : xml files. 

## Dat file
These files are usually raw data. However, for this project it seems that most files do not have the raw data and only have the lfp.

In [33]:
format_to_search = ".dat"
path_and_size_pairs = [('/'.join(str(p).split('/')[5:]), f"{p.stat().st_size/1000**2:2.2f} MB") for p in base_path.rglob(f"*{format_to_search}") if p.is_file()]
pprint(path_and_size_pairs[:10], compact=True)

[('DT2/DT2_rPPC_rCCG_3612um_1360um_20160302_160302_134841/DT2_rPPC_rCCG_3612um_1360um_20160302_160302_134841.dat',
  '76435.85 MB'),
 ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/analogin.dat',
  '579.90 MB'),
 ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/supply.dat',
  '1159.80 MB'),
 ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/time.dat',
  '1159.80 MB'),
 ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712.dat',
  '74226.95 MB'),
 ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/auxiliary.dat',
  '3479.39 MB'),
 ('DT2/z_Intruder_test_160304_152951/z_Intruder_test_160304_152951.dat',
  '51600.94 MB'),
 ('DT2/z_Intruder_test_160304_152951/extras/analogin.dat', '403.13 MB'),
 ('DT2/z_Intruder_test_160304_152951/extras/supply.dat', '806.26 MB'),
 ('DT2/z_Intruder_test_160304_152951/extras/time.dat', '806.26 MB')]


The output should be something like this:

    [('DT2/DT2_rPPC_rCCG_3612um_1360um_20160302_160302_134841/DT2_rPPC_rCCG_3612um_1360um_20160302_160302_134841.dat',
      '76435.85 MB'),
     ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/analogin.dat',
      '579.90 MB'),
     ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/supply.dat',
      '1159.80 MB'),
     ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/time.dat',
      '1159.80 MB'),
     ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712.dat',
      '74226.95 MB'),
     ('DT2/DT2_rPPC_rCCG_1226um_290um_20160211_160213_120712/auxiliary.dat',
      '3479.39 MB'),
     ('DT2/z_Intruder_test_160304_152951/z_Intruder_test_160304_152951.dat',
      '51600.94 MB'),
     ('DT2/z_Intruder_test_160304_152951/extras/analogin.dat', '403.13 MB'),
     ('DT2/z_Intruder_test_160304_152951/extras/supply.dat', '806.26 MB'),
     ('DT2/z_Intruder_test_160304_152951/extras/time.dat', '806.26 MB')]

Exploring the dat files above, we see that not all sessions have a .dat file, and that there are some .dat files that do not seem to correspond to sessions.

## Matlab files
As the matlab files are the ones usually associated with behavioral data they will be describe first. We will see what files are available on a sesion.
Here we use a session with merger to have a picture of the most complicated case

In [35]:
session_name = "20170308_612um_2088um_merge"  
mat_files_paths = {p.name:p for p in sessions_path_in_subject[session_name].rglob('*') if not p.is_dir() and 'mat' in ''.join(p.suffixes)}
mat_files_names = list(mat_files_paths.keys())
mat_files_names.sort()
pprint(mat_files_names)

NameError: name 'sessions_path_in_subject' is not defined

The file above should produce something like:

    ['20170308_612um_2088um_merge.behavior.mat',
     '20170308_612um_2088um_merge.firingMaps.cellinfo.mat',
     '20170308_612um_2088um_merge.isolationMetrics.cellinfo.mat',
     '20170308_612um_2088um_merge.olypherInfo.cellinfo.mat',
     '20170308_612um_2088um_merge.phaseMaps.cellinfo.mat',
     '20170308_612um_2088um_merge.placeFields.01_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.05_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.10_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.20_pctThresh.mat',
     '20170308_612um_2088um_merge.placeFields.40_pctThresh.mat',
     '20170308_612um_2088um_merge.positionDecodingGLM_binnedspace_box.cellinfo.mat',
     '20170308_612um_2088um_merge.positionDecodingGLM_binnedspace_box_median.cellinfo.mat',
     '20170308_612um_2088um_merge.positionDecodingMaxCorr_binned_box_median.cellinfo.mat',
     '20170308_612um_2088um_merge.sessionInfo.mat',
     '20170308_612um_2088um_merge.spikes.cellinfo.mat']
     
There are similar files in every section that look like {date}_{x}um_{x}.{name}.mat. We now look at all the files available for the subject disambiguation everything but the name at the end to see what is the variety of mat files available.

In [None]:
mat_files_names_simple = [name for name in mat_files_names if len(name.split(".")) == 2]
mat_files_names_composed = ['.'.join(name.split('.')[1:]) for name in mat_files_names if len(name.split(".")) != 2]

mat_files_names_formatted = list(set( mat_files_names_composed))
mat_files_names_formatted += mat_files_names_simple
mat_files_names_formatted.sort()
pprint(mat_files_names_formatted)

This should produce something that looks like this:

    ['DT8_current_dataset_parietal_theta.mat',
    'PhaseLockingData.cellinfo.mat',
    'assembliesCrossRegionData.mat',
    'assembliesCrossRegionData_w_theta_sin_cos_coord_vel.mat',
    'assembliesCrossRegion_split_w_theta.mat',
    'assembliesWithinRegionData_w_theta_sin_cos_coord_vel.mat',
    'behav.mat',
    'behav_temp.mat',
    'behavior.mat',
    'firingMaps.cellinfo.mat',
    'isolationMetrics.cellinfo.mat',
    'ls_RipplePhaseModulation.cellinfo.mat',
    'meta.mat',
    'noiseCorrs.mat',
    'olypherInfo.cellinfo.mat',g. Most likely we will ignore them.
    `spk`
    'olypherInfo_w_disc.cellinfo.mat',
    'phaseMaps.cellinfo.mat',
    'placeFields.01_pctThresh.mat',
    'placeFields.05_pctThresh.mat',
    'placeFields.10_pctThresh.mat',
    'placeFields.20_pctThresh.mat',
    'placeFields.40_pctThresh.mat',
    'positionDecodingGLM.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box_median.cellinfo.mat',
    'positionDecodingGLM_binnedspace_box_nozero.cellinfo.mat',
    'positionDecodingGLM_binnedspace_gauss.cellinfo.mat',
    'positionDecodingGLM_box.cellinfo.mat',
    'positionDecodingGLM_gaussian.cellinfo.mat',
    'positionDecodingMaxCorr_binned_box_median.cellinfo.mat',
    'referenceFrames.mat',
    'sessionInfo.mat',
    'spikes.cellinfo.mat']
    
Which gives us an idea of all the files available. I will describe them in more details

### Description overview (To-do)

* `DT8_current_dataset_parietal_theta.mat` :
* `PhaseLockingData.cellinfo.mat` : 
* `assembliesCrossRegionData.mat` :  <- Check this
* `assembliesCrossRegionData_w_theta_sin_cos_coord_vel.mat` :
* `assembliesCrossRegion_split_w_theta.mat` :
* `assembliesWithinRegionData_w_theta_sin_cos_coord_vel.mat` :
* `behav.mat` : <- Check this
* `behav_temp.mat` :
* `behavior.mat` :
* `firingMaps.cellinfo.mat` :
* `isolationMetrics.cellinfo.mat` :
* `ls_RipplePhaseModulation.cellinfo.mat` :
* `meta.mat` :
* `noiseCorrs.mat` : <Ignore this
* `olypherInfo.cellinfo.mat` :
* `olypherInfo_w_disc.cellinfo.mat` :
* `phaseMaps.cellinfo.mat` :
* `placeFields.01_pctThresh.mat` :
* `placeFields.05_pctThresh.mat` :
* `placeFields.10_pctThresh.mat` :
* `placeFields.20_pctThresh.mat` :
* `placeFields.40_pctThresh.mat` :
* `positionDecodingGLM.cellinfo.mat` : Ignore all of these guys -we care about the ACTUAL position not the processing of it. Maybe the od have the x,y z, positions but more likely they are processing parameters for the anlalysis. 
* `positionDecodingGLM_binnedspace_box.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_box_median.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_box_nozero.cellinfo.mat` :
* `positionDecodingGLM_binnedspace_gauss.cellinfo.mat` :
* `positionDecodingGLM_box.cellinfo.mat` :
* `positionDecodingGLM_gaussian.cellinfo.mat` :
* `positionDecodingMaxCorr_binned_box_median.cellinfo.mat` :
* `referenceFrames.mat` :
* `sessionInfo.mat` : Initial the cell explorer sorter extrator in the folder paht where this file is. This header was missing from the Yuta file. It will not process the customer 
* `spikes.cellinfo.mat` :

## CSV files - To do

In [38]:
format_to_search = ".csv"
path_and_size_pairs = [(p.stem, f"{p.stat().st_size/1000**2:2.2f} MB") for p in base_path.rglob(f"*{format_to_search}") if p.is_file()]
pprint(path_and_size_pairs[:10], compact=True)

[('Take 2017-02-20 07.26.44 PM', '24.45 MB'),
 ('Take 2017-02-01 05.02.41 PM', '39.27 MB'),
 ('Take 2017-03-07 03.11.44 PM_batch', '14.27 MB'),
 ('Take 2017-03-07 02.29.11 PM_batch', '30.21 MB'),
 ('Take 2017-03-07 03.11.44 PM', '13.43 MB'),
 ('Take 2017-03-07 02.29.11 PM', '28.50 MB'),
 ('Take 2017-02-17 01.48.32 PM', '64.16 MB'),
 ('Take 2017-02-08 01.04.32 PM', '25.19 MB'),
 ('Take 2017-02-08 01.47.29 PM', '15.13 MB'),
 ('Take 2017-02-08 01.42.33 PM', '1.76 MB')]


The output of the file above should be something like this:

    [('Take 2017-02-20 07.26.44 PM', '24.45 MB'),
     ('Take 2017-02-01 05.02.41 PM', '39.27 MB'),
     ('Take 2017-03-07 03.11.44 PM_batch', '14.27 MB'),
     ('Take 2017-03-07 02.29.11 PM_batch', '30.21 MB'),
     ('Take 2017-03-07 03.11.44 PM', '13.43 MB'),
     ('Take 2017-03-07 02.29.11 PM', '28.50 MB'),
     ('Take 2017-02-17 01.48.32 PM', '64.16 MB'),
     ('Take 2017-02-08 01.04.32 PM', '25.19 MB'),
     ('Take 2017-02-08 01.47.29 PM', '15.13 MB'),
     ('Take 2017-02-08 01.42.33 PM', '1.76 MB')]
     
 Those files are very similar among themselves and as discussed previously most of them are located in the folders named "Session {date}" 
 
 Let's open a random path to see what file to see what kind of information they have inside.

In [43]:
format_to_search = ".csv"
path_to_csv_list = [p for p in base_path.rglob(f"*{format_to_search}") if p.is_file()]


In [48]:
np.random.seed(0)
p = np.random.choice(path_to_csv_list)
p

PosixPath('/shared/catalystneuro/Buzsaki/TingleyD/DT9/20170522_900um_936um_170522_151132/Session 2017-05-22/Take 2017-05-22 03.37.41 PM_batch.csv')

This should be a path object indicating the complete path to the file:

`'/shared/catalystneuro/Buzsaki/TingleyD/DT9/20170522_900um_936um_170522_151132/Session 2017-05-22/Take 2017-05-22 03.37.41 PM_batch.csv'`

In [55]:
df_csv = pd.read_csv(p)
df_csv.head()

Unnamed: 0,0,0.000000,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,1,0.008333,-0.120431,0.87929,0.033715,0.459574,328.160767,863.285339,622.358459,0.567362
1,2,0.016667,-0.117984,0.881127,0.036669,0.456455,328.562561,862.566406,622.785278,0.532845
2,3,0.025,-0.114739,0.882404,0.0384,0.454669,328.857544,861.887329,623.203125,0.566924
3,4,0.033333,-0.111382,0.883373,0.039152,0.453557,328.82489,861.194885,623.96759,0.546953
4,5,0.041667,-0.109384,0.883821,0.042722,0.452846,328.706787,860.477905,624.920898,0.542314


The initial file does seem to have some columns. We can re-run the following cells to get an idea of how these files look like:

In [65]:
p = np.random.choice(path_to_csv_list)
print(p)
df_csv = pd.read_csv(p)
df_csv.head()

/shared/catalystneuro/Buzsaki/TingleyD/DT7/20170327_864um_360um_170327_115803/Session 2017-03-27/Take 2017-03-27 03.45.19 PM_batch.csv


Unnamed: 0,0,0.000000,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,1,0.008333,0.221911,-0.215985,-0.112192,-0.944203,187.646439,848.022705,41.52462,0.34271
1,2,0.016667,0.217375,-0.223007,-0.118962,-0.942796,188.166672,848.171204,41.057453,0.302789
2,3,0.025,0.216126,-0.225629,-0.12217,-0.942049,188.228455,848.195984,41.224876,0.287108
3,4,0.033333,0.214703,-0.223274,-0.124896,-0.942578,187.707886,848.144836,41.479603,0.717094
4,5,0.041667,0.210819,-0.215694,-0.128672,-0.944709,187.569504,847.715881,41.604931,0.24515


We find that some files seem to be list of positions. The following file is one of those examples

In [67]:
file_path = Path("/home/jovyan/globus_data/TingleyD/DT8/20170217_216um_1944um_merge/Session 2017-02-17/Take 2017-02-17 01.48.32 PM.csv")
df_csv = pd.read_csv(file_path)
df_csv.head()

Unnamed: 0,Format Version,1.21,Take Name,Take 2017-02-17 01.48.32 PM,Capture Frame Rate,120.000000,Export Frame Rate,120.000000.1,Capture Start Time,2017-02-17 01.48.32 PM,Total Frames in Take,679616,Total Exported Frames,679616.1,Rotation Type,Quaternion,Length Units,Meters,Coordinate Space,Global
0,,,Rigid Body,Rigid Body,Rigid Body,Rigid Body,Rigid Body,Rigid Body,Rigid Body,Rigid Body,,,,,,,,,,
1,,,RigidBody 2,RigidBody 2,RigidBody 2,RigidBody 2,RigidBody 2,RigidBody 2,RigidBody 2,RigidBody 2,,,,,,,,,,
2,,,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,8392B20CF49E35732,,,,,,,,,,
3,,,Rotation,Rotation,Rotation,Rotation,Position,Position,Position,Error Per Marker,,,,,,,,,,
4,Frame,Time,X,Y,Z,W,X,Y,Z,,,,,,,,,,,
