# Files documentation
The workflow here is to have this notebook to describe in more details the files that are available. For version control
purposes this file should be commited without output and only run locally.

In [None]:
from pathlib import Path
from pprint import pprint

import numpy as np
import scipy as sp
import pandas as pd
import h5py
import mat73
from scipy.io import loadmat

# Data loading
Here we load our base path:

In [None]:
data_location = '/home/heberto/globus_data'  # Change this with the right location
data_path = Path(data_location)
author_path = Path("SenzaiY")
base_path = data_path.joinpath(author_path)

Now, this data sets is organized with one folder per subject. Let's peak inside of  `base_path`: 

In [None]:
subject_path_dic = {p.stem:p for p in base_path.iterdir() if p.is_dir()}
subject_path_dic.keys()

The output should be something like ['YMV01', 'YMV02', ...] indicating the different subjects

Inside each of the subjects we can find a folder per sesion:

In [None]:
subject = 'YMV04'
sessions_path_dic = {p.stem:p for p in subject_path_dic[subject].iterdir() if p.is_dir()}
sessions_path_dic.keys()

The ouput of this should be: `YMV01_170818`. 

The name of the sessions fits the following pattern `{subject}_{date}`.

Let's gather all the available sessions in one dic for convenience

In [None]:
data_path = Path("/home/heberto/globus_data")
author_path = Path("SenzaiY")
base_path = data_path.joinpath(author_path)

session_list = [
    session
    for subject in base_path.iterdir()
    if subject.is_dir() and "YMV" in subject.name
    for session in subject.iterdir()
]
session_path_dic = {session.stem:session for session in session_list if session.is_dir()}
session_path_dic

The output here should be a combination of session:path for all the sessions

# An overview of the available data
Let's find out which data types are available. The files with formats `.jpg`, `.png`, `.fig`, `.pdf`, `.svg` are either photos, vector or documents and we will not be concerned about them so we remove them. We  focus here on the first session on the index:

In [None]:
not_data_formats = ['.jpg', '.png', '.pdf', '.svg', '.fig', '.py']

subject = 'YMV04'
date = '170907'
session = f"{subject}_{date}"
session_path = session_path_dic[session]

format_list = list({p.suffix for p in session_path.rglob('*') if not p.is_dir()})
format_list.sort()
format_list = [p for p in format_list if p not in not_data_formats]
pprint(format_list, compact=True)

The output should be something like this:

    ['', '.1', '.dat', '.eeg', '.json', '.log', '.mat', '.npy', '.nrs',
    '.pkl', '.tsv', '.xml']

The goal of this document is to explore the data available on the rest of the formats and we will do so the following sections. Meanwhile, for orientation purposes, here is a brief description of the available formats and the files associated with them

1. First we have the format '.l' which are actually two formats `.res.1` and `.clu.1`. These are plain files related to the neuroscope sorting format.

2. Then we have the typical '.dat' and '.egg' formats that account for the raw data and the local field potential respectively

3. The `.json` seem to be associated with hidden files corresponding to the `.phy` format. This is related to spike sorting.

4. The `.log` extension is the log file that corresponds to the `phy` program.

5. There is a variety of `.mat` files:

6. There is a varety of `.npy` files.

7. `.nrs`

8. `.pkl` pickled file

9. `.tsv` tabular separated data.

10. `.xml` an xml file



# Neuroscope res and clu
These files have a name ofr hte format `{session}.res` and `{session}.clu`. Those should be the keys of the 
following dics

In [None]:
sorting_files_dic = {p.stem:p for p in session_path.rglob('*') if p.suffix == '.1'}
sorting_files_dic.keys()

These are plain text files and can be opened with pandas as a data frame

In [None]:
clu_file_name = f"{session}.clu"
res_file_name = f"{session}.res"

clu_df = pd.read_csv(sorting_files_dic[clu_file_name], header=None, names=['unit'])
res_df = pd.read_csv(sorting_files_dic[res_file_name], header=None, names=['times'])
res_df.shape, clu_df.shape

The files should have the same shape. As mentioned those are related to spike sorting. `.clu` contains the units and `.res` the times.
We can concatenat them to have the associated ready

In [None]:
df_sorting = pd.concat([clu_df, res_df], axis=1)
df_sorting.head()

In [None]:
df_sorting['unit'].max()

This indicates that 

In [None]:
xml_files_dic = {p.stem:p for p in session_path.rglob('*') if p.suffix == '.xml'}
xml_files_dic

In [None]:
from spikeextractors import NeuroscopeSortingExtractor 

sorting = NeuroscopeSortingExtractor(resfile_path=sorting_files_dic[res_file_name], 
clufile_path=sorting_files_dic[clu_file_name], keep_mua_units=False)

In [None]:
pprint(sorting.get_unit_ids(), compact=True)

In [None]:
from spikeextractors import PhySortingExtractor
sorting_phy = PhySortingExtractor(folder_path=session_path, exclude_cluster_groups=['noise', 'mua'])

In [None]:
pprint(sorting_phy.get_unit_ids(), compact=True)

In [None]:
len(sorting_phy.get_unit_ids())

#### Let's compare spikes

In [None]:
len(sorting_phy.get_unit_spike_train(unit_id=15)), len(sorting.get_unit_spike_train(unit_id=1))

In [None]:
phy_unit_list = sorting_phy.get_unit_ids()  
spikes_number_phy = [len(sorting_phy.get_unit_spike_train(unit_id=unit_id)) for unit_id in phy_unit_list]

In [None]:
neuroscope_unit_list = sorting.get_unit_ids()
spikes_number_neuro = [len(sorting.get_unit_spike_train(unit_id=unit_id)) for unit_id in neuroscope_unit_list]

In [None]:
spikes_number_phy.sort()
spikes_number_neuro.sort()
[(x, y) for (x, y) in zip(spikes_number_phy, spikes_number_neuro)]

We should use the phy by default as we have shown here that they have the same information (removing 'noise' and 'mua'). 

# Json files

In [None]:
json_files_dic= {p.stem:p for p in session_path.rglob('*') if p.suffix == '.json'}
json_files_dic

These files correspond to some meta data of the `phy` software

# Mat files
Let's gather all the mat files

In [None]:
mat_files_dic = {p.stem:p for p in session_path.iterdir() if p.suffix=='.mat'}

As there are many files available we will sort them out

In [None]:
mat_files_list = list(mat_files_dic.keys())
mat_files_list.sort()
pprint(mat_files_list, compact=True)

We find the following files:

    ['YMV01_170818--InterpDownLFP_params', 'YMV01_170818--InterpUpDownLFP_params',
    'YMV01_170818--LFPbasedLayer', 'YMV01_170818-DownUpAlignedLFP-CSD',
    'YMV01_170818-MonoSynConvClick', 'YMV01_170818-UnitPhaseMod',
    'YMV01_170818.EMGFromLFP.LFP', 'YMV01_170818.SleepScoreLFP.LFP',
    'YMV01_170818.SleepScoreMetrics.LFP', 'YMV01_170818.SleepState.states',
    'YMV01_170818.SlowWaves.events', 'YMV01_170818.StatePlotMaterials',
    'YMV01_170818.cell_metrics.cellinfo', 'YMV01_170818.chanCoords.channelInfo',
    'YMV01_170818.eegstates', 'YMV01_170818.mono_res.cellinfo',
    'YMV01_170818.noiseLevel.channelInfo', 'YMV01_170818.session',
    'YMV01_170818.spikes.cellinfo',
    'YMV01_170818.waveform_filter_metrics.cellinfo', 'YMV01_170818_UnitFeature',
    'YMV01_170818_meanWaveforms', 'YMV01_170818_wavelet_NREM_8_300Hz',
    'YMV01_170818_wavelet_NREM_8_300Hz--Whiten',
    'YMV01_170818_wavelet_REM_8_300Hz', 'YMV01_170818_wavelet_REM_8_300Hz--Whiten',
    'YMV01_170818_wavelet_WAKE_8_300Hz',
    'YMV01_170818_wavelet_WAKE_8_300Hz--Whiten', 'autoclusta_params',
    'cell_metrics', 'chanMap', 'depthsort_parameter_1', 'meanWaveforms', 'rez',
    'session']


Ignore:
Anything with param in the name (.e.g depthsort_parameter_1)
Anything that has plot in the name (e.g. stateplot_materials)

Temporary note here: It is important to note that we add the phy data we should exclude noise and mua (mult-unit activity).

"As a general rule if something can be made by the state data that we use, then it should not be included" 


* noiselevel_channel_info . This can be added as an electrodes property.
* chanCoords.channel_info . This is duplicated information from chanmap.
* SlowWaves.events This can be considered processed data involving up-down intervals. This can be include as process data.
* DownUp alignment . Duplication with the LFP. Aligned with specific events. This is for analysis for our concerns this is duplications because we have the base LFP data.
* LFPbase layer. It is unclear how to assign this a specific channel. So this is unclear if this is duplicated or analysis.
* Sleepscore LFP. Means that specific channels where used for sleep detection. These are the channels that were used for doing some analysis. We can add this as boolean flags to indicate that it was use for . So we will use the channel ID. 
* Sleepscore metric. Ignore this. Is a part of they made the state classification (REM vs nonRem). But it seems that the information there can be gotten from other pieces of information. It seems that they have the time series for the slow-wave and the theta activity. They assigned some ratio value and then threshold. That is, that's analysis, for the ultimate state that we will add as behavioral data. 
* EMG for LFP. This we haven seen in previous work but we have not included. Normally EMG that is electromyaography-... is a separated recording usually. As this is used for the state classification we will ignore it.
* UnitPhasemod is analysis so we will ignore it. 
* Eeg states is related EMG and the state classifier and we will ignore it. 

* Unitfeature contains additional ad-hoc unit properties not covered by `cell_metrics`. 


We have the three files that correspond to the cell explorer format / interface:
* metric_cell_info
* mono_res_cellinfo
* spikes.cell_info

To-do:
1) Check if the number of units in the cell-explorer is consitent with either phy or neuroscope. 
2) 

* 

Let's focus on the last files:
* `session` : contains behavioral info and general information related to the session such as the experimenter, the species, the strain and timestamps for the creation of the session.
* `cell_metrics` : here we find important information concerning the cells as well as some of the session information duplicated. Here we can find information related to the specific cells that were identified in the study such as the number of cell identified,  their brain region, their putative type, etcera. In general these files have an struture equal to the number of cells that were found. That is, structure would be (1, n_cells) where n_cells is the number of the cells identified.
* `chanMap` : This seems to be concerned with information of the channels in the electrode. For example we find both the x and y coordinates of each of the channels. The structure of the files here is (1, n_channels) where n_channels is 64 for this setup.
* `rez` : contains duplicated information from the `chanMap` concerning the location of the electrodes plus some principal compoennt analysis parameters.

Things with wave form and wavelete we ignore.  Because this is information can be computed from the raw data. 

For sessions that have a a merge file that we will ignore as there is only one `.dat` file. We should investigate the .dat file ensure the files are nan padded. If not, we might need to investigate spliting the electrical series in different start times according to the merge files.

`.dat` files, how to open, Neuroscope recording extractor allows you to open dat files and you can look at the traces of a single channel. 

In [None]:
file_name = 'cell_metrics'
mat_file_path = mat_files_dic[file_name]
try:
    mat_file = loadmat(mat_file_path)
except NotImplementedError:
    mat_file = mat73.loadmat(mat_file_path, use_attrdict=True)

In [None]:
mat_file['cell_metrics'].keys()

In [None]:
mat_file['cell_metrics']['general']

In [None]:
for file_path in mat_files_dic.values():
    try:
        mat_file = loadmat(file_path)
    except NotImplementedError:
        mat_file = mat73.loadmat(file_path, use_attrdict=True)
    print(file_path.name, type(mat_file))
    print(mat_file.keys())

# Numpy files

In [None]:
numpy_files_dic = {p.stem:p for p in session_path.rglob('*') if p.suffix == '.npy'}
numpy_files_dic.keys()

The output should something like the following files depending on the session

    ['templates_ind', 'spike_times', 'templates', 'pc_feature_ind',
    'whitening_mat_inv', 'similar_templates', 'spike_clusters', 'template_features', 
    'spike_templates', 'template_feature_ind', 'amplitudes', 'channel_map',
    'pc_features', 'channel_positions', 'whitening_mat']

Let's the spike_times file to explore

In [None]:
numpy_file = np.load(numpy_files_dic['spike_times'])
numpy_file.shape

In [None]:
numpy_file = np.load(numpy_files_dic['amplitudes'])
numpy_file.shape

In [None]:
numpy_file = np.load(numpy_files_dic['channel_map'])
numpy_file.shape

In [None]:
numpy_file = np.load(numpy_files_dic['spike_clusters'])
np.unique(numpy_file)

In [None]:
numpy_file = np.load(numpy_files_dic['templates'])
numpy_file.shape

# NRS

# Pickled

In [None]:
pickle_files_dic = {p.stem:p for p in session_path.rglob('*') if p.suffix == '.pkl'}
pickle_files_dic.keys()

All of those files are in the in the hidden folder for the `phy` software

Opening the files is not working right now. Not priority as it is not clear that we will have to parse  this files

In [None]:
import pickle

file_name = 'spikes_per_cluster'
file_path = pickle_files_dic[file_name]
try:
    with open(str(file_path), 'rb') as f:
        data = pickle.load(f)
except: 
    print("problem oppening this file")

# TSV - Tabular separated file

In [None]:
tsv_files_dic = {p.stem:p for p in session_path.rglob('*') if p.suffix == '.tsv'}
tsv_files_dic.keys()

The only file here is `cluster_group`. Seems related to the spike sorting.

In [None]:
file_name = 'cluster_group'
file_path = tsv_files_dic[file_name]

df_cluster_group = pd.read_csv(file_path, sep='\t')
df_cluster_group.head()

In [None]:
df_cluster_group.groupby(['group'])['cluster_id'].count()

The output of this is is should be something like this:

    group
    good      53
    mua       13
    noise    460

For the default session in this notebook (the only session for subject YMV01) this seems to indicate that there are 53 good clusters. This corresponds with the cells identified in `cell_metrics.mat`.  My guess right now is that this indicates which of the clusters indicated in `spike_clusters.npy` do correspond to a cell ('good') which ones are noise, etcera.

# XML
A file that pairs with the `.dat` and `.eeg` and contains all the header information. This is processed 

In [None]:
session_path