# Data pre-processing
Pre-processing of ephys (voltage + timestamps) and position data. 
The pre-processing of ephys data includes:

    (1). The conversion of data from the .dat files into voltage; 
    (2). The removal of the first frame index if necessary (it is for chunk=0)
    (3). The subsampling of data to 5K;
    (4). Saving the data according to tetrode channels into .csv files;
    
  The pre-processing of ephys timestamps includes:
  
    (1). The removal of the first frame index if necessary (it is for chunk=0)
    (2). Conversion to seconds;
    (3). Subsamplig of timestamps to 5K.
    
  
The pre-processing of position data includes:

    (1). Visual confirmation of each runs (numbered);
    (2). Creation of a run specifications file with run information (error/correct, sample/test, trial number, described below. Created outside the notebook ans saved in a csv file);
    (3). Collect of ROI limits using the video first frame (start ROI, choice point, corners and reward ports);
    (4). Combine run specifications to the position data. Save all into a .csv file.
    

#### Imports

In [12]:
import pandas as pd
import os
import cv2
import numpy as np
import seaborn as sns
from tqdm import tqdm
import glob
import warnings
import matplotlib.pyplot as plt
from matplotlib.widgets  import RectangleSelector
from ephys_utils import *
from position_utils import * 

warnings.filterwarnings('ignore')

#### Definition of variables

In [13]:
sample_rate = 30000  # in Hz
nr_channels = 128
chunk_size = 900000  
sub_sampling_factor = 6
# You can check the list of common average tts in the dataset preparation file  
common_average_tts = [1,5,12,16,22,30] 

code = '20180504094832'
path="/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP22_20trials_20180504094832"

## Ephys data pre-processing
Includes the pre-processing of both voltage and timestamps.

##### Get data file names

In [14]:
amplifier_file, ttl_input_file, timestamp_file = get_data_filenames(path)

Amplifier file: 20180504094832_b_amplifier.dat
TTL input file: 20180504094832_b_ttlinput.dat
TTL input file: 20180504094832_b_tstamp_rhd2000.dat


##### Calculate number of blocks

In [15]:
nr_blocks, leftover=calculate_nr_of_blocks(amplifier_file, nr_channels, chunk_size)
all_chunk_sizes=[chunk_size]*nr_blocks#+[chunk_size*leftover]
#Print number of blocks
print('Number of blocks to process: {}'.format(len(all_chunk_sizes)))

Number of samples per channel: 74620920.0
Number of blocks to process: 82


##### Read, convert and reorganize chunks of ephys data

In [27]:
with open(amplifier_file, 'r') as fileid:
    
    for count, ch_s in tqdm(enumerate(all_chunk_sizes)):
        
        # Open chunk into a dataframe
        chunk = open_and_convert_chunk_to_df(fileid, ch_s, nr_channels)
        
        # Convert values to voltage
        chunk_converted = chunk.astype(int).apply(lambda x: ((x-32768) * 0.195))
        
        if count==0: 
            # Get first video frame index in the TTL input file
            first_frame_index = get_ttl_input_first_index(ttl_input_file)
            chunk_clean=chunk_converted.iloc[first_frame_index:].reset_index(drop=True)
            # Split chunk according to tetrode mapping and save to .csv files 
            chunk_clean_subsampled = chunk_clean.iloc[::sub_sampling_factor]
            organize_chunk_by_tt_and_save(path, chunk_clean_subsampled, count)
            
        else:
            # Split chunk according to tetrode mapping and save to .csv files  
            chunk_subsampled = chunk_converted.iloc[::sub_sampling_factor]
            organize_chunk_by_tt_and_save(path, chunk_subsampled, count) 

78it [31:31, 24.25s/it]


##### Read, convert and store chunks of ephys timestamps

Clip on TTL input using first_frame_index and split into chunks

In [5]:
# Clip on TTL input using first_frame_index
# Split into chunks
with open(timestamp_file, 'r') as fileid:
    for count, ch_s in tqdm(enumerate(all_chunk_sizes)):
        fileid.seek(0, 1)
        # Open chunk and convert into a dataframe
        timestamps_chunk = np.fromfile(fileid, np.int32, ch_s)
        timestamps_chunk = pd.DataFrame(timestamps_chunk)
        
        if count==0: 
            # Clip on TTL input first frame index
            first_frame_index = get_ttl_input_first_index(ttl_input_file)
            timestamps_chunk_clean=timestamps_chunk.iloc[first_frame_index:].reset_index(drop=True)
            # Get the first_timestamp
            first_timestamp = timestamps_chunk_clean.iloc[0]
            
            # Convert to seconds from start
            timestamps_converted = (timestamps_chunk_clean - first_timestamp)*(1/sample_rate)
        else:
            # Convert to second from start
            timestamps_converted = (timestamps_chunk - first_timestamp) * (1/sample_rate)
            
        timestamps_converted = timestamps_converted.rename(columns={'0':'t'})
        timestamps_subsampled = timestamps_converted.iloc[::sub_sampling_factor]
        print(count)
        print(len(timestamps_subsampled[timestamps_subsampled.duplicated()]))
        # Save chunk into a .csv file
        filename='timestamps_chunk{}.csv'.format(count)
        timestamps_subsampled.to_csv(os.path.join(path, 'Ephys_timestamps', filename))

0it [00:00, ?it/s]

0
0


1it [00:01,  1.19s/it]

1
0


2it [00:01,  1.50it/s]

2
0


3it [00:01,  2.13it/s]

3
0


4it [00:01,  2.62it/s]

4
0


5it [00:02,  3.03it/s]

5
0


6it [00:02,  3.33it/s]

6
0


7it [00:02,  3.55it/s]

7
0


8it [00:02,  3.68it/s]

8
0


9it [00:03,  3.78it/s]

9
0


10it [00:03,  3.91it/s]

10
0


11it [00:03,  4.00it/s]

11
0


12it [00:03,  4.01it/s]

12
0


13it [00:04,  3.99it/s]

13
0


14it [00:04,  4.03it/s]

14
0


15it [00:04,  4.00it/s]

15
0


16it [00:04,  3.96it/s]

16
0


17it [00:05,  3.90it/s]

17
0


18it [00:05,  3.92it/s]

18
0


19it [00:05,  3.96it/s]

19
0


20it [00:05,  3.99it/s]

20
0


21it [00:06,  3.92it/s]

21
0


22it [00:06,  3.83it/s]

22
0


23it [00:06,  3.83it/s]

23
0


24it [00:06,  3.85it/s]

24
0


25it [00:07,  3.86it/s]

25
0


26it [00:07,  3.85it/s]

26
0


27it [00:07,  3.87it/s]

27
0


28it [00:08,  3.89it/s]

28
0


29it [00:08,  3.91it/s]

29
0


30it [00:08,  3.85it/s]

30
0


31it [00:08,  3.90it/s]

31
0


32it [00:09,  3.88it/s]

32
0


33it [00:09,  3.84it/s]

33
0


34it [00:09,  3.81it/s]

34
0


35it [00:09,  3.82it/s]

35
0


36it [00:10,  3.83it/s]

36
0


37it [00:10,  3.86it/s]

37
0


38it [00:10,  3.87it/s]

38
0


39it [00:10,  3.90it/s]

39
0


40it [00:11,  3.87it/s]

40
0


41it [00:11,  3.87it/s]

41
0


42it [00:11,  3.83it/s]

42
0


43it [00:11,  3.81it/s]

43
0


44it [00:12,  3.70it/s]

44
0


45it [00:12,  3.69it/s]

45
0


46it [00:12,  3.51it/s]

46
0


47it [00:13,  3.62it/s]

47
0


48it [00:13,  3.69it/s]

48
0


49it [00:13,  3.72it/s]

49
0


50it [00:13,  3.43it/s]

50
0


51it [00:14,  3.56it/s]

51
0


52it [00:14,  3.63it/s]

52
0


53it [00:14,  3.70it/s]

53
0


54it [00:15,  3.09it/s]

54
0


55it [00:15,  3.26it/s]

55
0


56it [00:15,  3.40it/s]

56
0


57it [00:15,  3.49it/s]

57
0


58it [00:16,  3.55it/s]

58
0


59it [00:16,  3.64it/s]

59
0


60it [00:16,  3.68it/s]

60
0


61it [00:17,  3.71it/s]

61
0


62it [00:17,  3.72it/s]

62
0


63it [00:17,  3.74it/s]

63
0


64it [00:17,  3.74it/s]

64
0


65it [00:18,  3.71it/s]

65
0


66it [00:18,  3.69it/s]

66
0


67it [00:18,  3.72it/s]

67
0


68it [00:18,  3.71it/s]

68
0


69it [00:19,  3.73it/s]

69
0


70it [00:19,  3.79it/s]

70
0


71it [00:19,  3.81it/s]

71
0


72it [00:19,  3.81it/s]

72
0


73it [00:20,  3.83it/s]

73
0


74it [00:20,  3.55it/s]

74
0


75it [00:20,  3.65it/s]

75
0


76it [00:21,  3.73it/s]

76
0


77it [00:21,  3.80it/s]

77
0


78it [00:21,  3.78it/s]

78
0


79it [00:21,  3.64it/s]

79
0


80it [00:22,  3.64it/s]

80
116880


81it [00:22,  3.67it/s]

81
149840


82it [00:22,  3.62it/s]


### Calculate the common average

In [11]:
tt_lfps = {}
    
# Calculate common average from one channel per tetrode
for tt in tqdm(common_average_tts): 
        
    tt_path=os.path.join(path, 'TT{}'.format(tt))
    files = get_file_list(tt_path, "*.csv")   
    tt_lfp=list() 
        
    for f in files:
        # Read each ephys raw data file
        file_path = os.path.join(tt_path, f)
        chunk = pd.read_csv(file_path, index_col=0)
                
        # Append chunk dataframe from each file to list
        tt_lfp.append(chunk.iloc[:,0]) 
                
        # Concatenate and store in a dictionary   
        tt_lfps[tt]=pd.concat(tt_lfp)
        
data= pd.concat(tt_lfps, axis=1)
common_average = np.mean(data, axis=1)
file_path = os.path.join(path, "%s_common_average.csv"%code)
print('Saving...')
common_average.to_csv(file_path, index=False)
print('Saved!')

100%|██████████████████████████████████████████████████████████████████████| 6/6 [01:31<00:00, 15.27s/it]


Saving...
Saved!



# Position data pre-processing

#### Visual confirmation of individual runs in the session

In [None]:
visual_check_of_individual_runs(path, -30, 190)    #-100, 100

Session code: 20180504094832, Rat code: NAPOLEAO

 Opening timestamps:20180504094832_b_tstamp_image.csv. Length:35078

 Opening  x position:20180504094832_b_xcoord.csv. Length:35078

 Opening y position:20180504094832_b_ycoord.csv. Length:35078

   timestamp          x         y   x_diff  run_nr         session       rat
0   0.000000  116.26666  75.66666  0.00000     1.0  20180504094832  NAPOLEAO
1   0.092365  115.35556  61.04444 -0.91110     1.0  20180504094832  NAPOLEAO
2   0.118362  115.30000  61.00000 -0.05556     1.0  20180504094832  NAPOLEAO
3   0.155866  115.10910  60.92728 -0.19090     1.0  20180504094832  NAPOLEAO
4   0.187123  115.10000  60.90000 -0.00910     1.0  20180504094832  NAPOLEAO


  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(v

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):


### Creation of a run specs file:
Confirm that all runs are visually consistent with the rat's trajectories in space. Create a run_specs .csv file (for example: 'xxx(ratname)_RUN_SPECS_DNMPxx_2020-06-30T14_05_16.csv'), containing a mapping between the run numbers and each run information: 

    1st. column - run_nr;  
    2nd column - run_type ('S', sample or 'T', test); 
    3rd column - outcome(0 - error, 1 - correct). 
    4th colum - trial number
If a run does not exist (is not visually consistent with a rat' trajectory, all other columns should contain 'NaN').
Column names should not be included in the file. They are added later on!
Save as a .csv file into the directory containing the position and timestamp files, such that the run specifications can be added later on during the analysis.

### Collect CP and start limits
In this order:

<b>1st: start limits</b>
   - Only the xlim will be used later on. Size of ROI not as important.
   - Use the reference tape to create ROI. 

<b>2nd: CP limits</b>
   - Size of ROI is important and it should align with maze limits.
   - Use maze CP square limits to create ROI. 

<b>3nd: Corner1 limits (left from start region) </b>
   - Size of ROI is important and it should align with maze limits.

<b>4th:  RW1 limits (left from start region) </b>
   - Only the xlim will be used later on. Size of ROI not as important.
   - Target xlim to be in the center of the well.

<b>5th:  Corner2 limits  (right from start region) </b>
   - Size of ROI is important and it should align with maze limits.

<b>6th: RW2 limits (right from start region) </b>
   - Only the xlim will be used later on. Size of ROI not as important.
   - Target xlim to be in the center of the well.

In [31]:
session_code = collect_maze_limits(path)

Directory is empty. Selecting the 1st frame from the videos
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(221, 507, 82, 50)
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(1068, 507, 77, 50)
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(1070, 135, 68, 63)
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(907, 131, 8, 77)
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(1061, 864, 72, 62)
/VOLUMES/Biló/EphysData/NAPOLEAO_DNMP24_15trials_20180505150034/ROI_Frames/20180505150034_b_movie_1stframe.jpg
(879, 869, 9, 72)
      x      y  width  height         session
0  44.2  101.4   16.4    10.0  20180505150034
       x      y  width  height         session
0  213.6  101.4   15

### Add run specifications
All position data from runs with NaNs will be removed from the dataframe

In [32]:
data = add_run_specs(path, code)#session_code)

### Save data into a .csv file

In [33]:
file_path = os.path.join(path,
                        "Timestamped_position", 
                        "%s_timestamped_position_df_clean.csv"%code)#session_code)
data.to_csv(file_path, index=False)