## Rat hexmaze import data script

This script achieves the following:

1. Reads all rat hexmaze experimental logs in a given directory, removing duplicate lines and duplicate trials, storing all data as a pandas dataframe
2. Performs basic data ingegrity verification, assuring:
    - Temporal consistency of read data (i.e., all times recorded are ascending throughout session)
    - Experimental integrity: rat only passess through target node once, and always at the end of the trial
    - Between-column consistency (e.g. speed equal to distance / time)
3. A cleaned version of the dataframe is generated, removing trials not meeting the above criteria
4. New useful columns are added to the clean data, to assist in posterior data exploration
5. Both the full dataframe generated in step 1 and the clean, enriched dataframe generated in step 4 are saved to a csv file for easy loading for subsequent exploration and analysis
6. The results of import and cleaning operations are saved to a logfile. Both the generated csvs and the import logfile names include the date of execution of the present script.

In [1]:
## Import functions

import numpy as np  
import pandas as pd
import re # Regular expressions
from datetime import datetime, time 
import logging
import glob
import os
import myRatlib as rat # Custom functions contained in myRatlib.py to read and clean rat data

In [2]:
# Set up import log filename and path
today = datetime.today()
log_filename = 'import_log_' + today.strftime("%Y%m%d") + '.log'
log_path = '../data/log/' 
logfile = log_path + log_filename
logging.basicConfig(filename = logfile, level=logging.INFO)

# Set up data filename and path
data_path = '../data/'
data_filename = 'Rat_HM_Ephys_Agg_' + today.strftime("%Y%m%d") + '.csv' # Aggregated Clean data
data_pr_path = '../results/'
data_pr_fname = 'Rat_HM_Ephys_AggProc_' + today.strftime("%Y%m%d") + '.csv' # Aggregated Clean data with additional processed columns

# Get raw data filenames from directory
raw_dir = '../data/raw/'
raw_list = glob.glob(raw_dir + 'Rat_HM_Ephys_Rat*.txt')
# Get file names without the paths
raw_fname_list = [os.path.basename(fpath) for fpath in raw_list]
raw_fname_list.sort() # Sort filenames so they are in Rat order and date order

msg_info = '\n -------------- FOUND ' + str(len(raw_list)) + ' FILES TO IMPORT -------------\n\n'
print(msg_info)
logging.info(msg_info)


 -------------- FOUND 22 FILES TO IMPORT -------------




In [3]:
# Load data from all files removing duplicates, and perform basic verification and cleaning
data_cl = pd.DataFrame()
data = pd.DataFrame()
for file in raw_fname_list:
    session = rat.import_file(file, raw_dir)
    data = data.append(session)
    
    session_cl = rat.clean_data(session)
    data_cl = data_cl.append(session_cl)
    
# Save data cl data to csv
data_cl['time'] = data_cl['time'].astype(str).str.split('0 days ').str[-1]
data_cl.to_csv(data_path+data_filename)
data_cl['time'] = pd.to_timedelta(data_cl['time'])

# Report number of imported rats and sessions per rat
import_summary = data_cl.groupby(['rat_no', 'date'])['trial_no'].nunique()

msg_info = '\n -------------- IMPORT RESULTS --------------------------\n\n Imported ' + str(len(import_summary)) + ' sessions: \n'
msg_info = msg_info + 'rat_no    date     no of trials'
print(msg_info)
print(import_summary)
logging.info(msg_info)
logging.info(import_summary)

Importing data from Rat5 on 2021-06-28 (Rat_HM_Ephys_Rat5_406576_20210628.txt):
29 trials found initially

3 duplicate trials found

Trial 1
Trial 11
Trial 25
746 duplicate lines found

Final number of trials loaded: 25

ERROR: trial 1 contains target node 104 in places other than last

ERROR: trial 11 contains target node 104 in places other than last

ERROR: trial 25 contains target node 104 in places other than last


Removed bad trials: [1, 11, 25]

Importing data from Rat5 on 2021-06-29 (Rat_HM_Ephys_Rat5_406576_20210629.txt):
47 trials found initially

18 duplicate trials found

Trial 1
Trial 1
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
Trial 6
Trial 7
Trial 8
Trial 9
Trial 10
Trial 11
Trial 12
Trial 13
Trial 14
Trial 14
Trial 28
1147 duplicate lines found

Final number of trials loaded: 28

ERROR: trial 1 contains target node 104 in places other than last

ERROR: trial 14 contains target node 104 in places other than last

ERROR: trial 28 contains target node 104 in places other th

### Process data
Now data have been imported, add columns to ease posterior exploration and analysis:

* island          string     Current island name
* re-visit        bool       Has this node been visited before within this trial
* u_turn          bool       Has the rat performed a u-turn (is currently in the same node as two steps ago)
* cum_distance    float64    Cumulative distance travelled within the trial
* cum_seconds     float64    Cumulative seconds ellapsed within the trial
* act_stps_2trgt  int        Number of steps to target along *actual* path followed by rat
* min_stps_2trgt  int        Number of steps to target along *shortest* path
* nd-visits-day   int        Number of previous visits to present node within the session (session node familiarity) 
* nd-visits-rat   int        Number of previous visits to present node for the rat (overall node familiarity)


In [4]:
## BUILD A MAZE REPRESENTATION:
# Load hexMaze graph and compute shortest paths
import networkx as nx # Package for graph represenations 

edgelist_filename = '../data/graph_edgelist.dat'

print("load graph")
G = nx.read_edgelist(edgelist_filename)  
print("graph: number of nodes = ",G.number_of_nodes(),", edges = ",G.number_of_edges())

### pre-compute all possible shortest path lengths and save in dictionary (does not save shortest paths itself)
DD = nx.shortest_path_length(G) 
DD = dict(DD)
# Set trial number as index for easy selecting trials
data_cl.set_index(['rat_no', 'date', 'trial_no'], inplace=True)



load graph
graph: number of nodes =  96 , edges =  125


In [5]:
# Add column with node classification according to island
data_cl["island"] = ""
data_cl.loc[data_cl['node'].astype('int') // 100 == 1, 'island'] = 'Ireland'
data_cl.loc[data_cl['node'].astype('int') // 100 == 2, 'island'] = 'Japan'
data_cl.loc[data_cl['node'].astype('int') // 100 == 3, 'island'] = 'Hawaii'
data_cl.loc[data_cl['node'].astype('int') // 100 == 4, 'island'] = 'Easter I.'

# Add within-trial info: 
    # cum distance and seconds, 
    # within-trial u-turns and revisits
    # optimal and actual traj left at each step
    
data_cl['re-visit'] = False
data_cl['u_turn'] = False

for trial in data_cl.index.unique():
    # Cumulative distance and cumulative seconds columns
    data_cl.loc[trial, 'cum_distance'] = data_cl.loc[trial, 'distance'].cumsum()
    data_cl.loc[trial, 'cum_seconds'] = data_cl.loc[trial, 'seconds_ff'].cumsum()
    # U-turns and re-visits
    data_cl.loc[trial, 'u_turn'] = (data_cl.loc[trial, 'node'] == data_cl.loc[trial, 'node'].shift(2))
    dup_nodes = data_cl.loc[trial, 'node'].duplicated()
    data_cl.loc[trial, 're-visit'] = dup_nodes
    
    # Trajectory data
    # Number of steps taken by rat in the trial (trajectory lenght):
    traj_len = len(data_cl.loc[trial, 'node'])
    # Number of steps still left until rat reaches target, at each node passed:
    act_stps_2trgt = [traj_len-n for n in range(1, traj_len+1)]
    # Target node
    tgt_node = data_cl.loc[trial, 'node'].iloc[-1]
    # Minimum number of steps until target at each node passed by rat
    min_stps_2trgt = [DD[str(a_node)][tgt_node] for a_node in data_cl.loc[trial, 'node']]
    data_cl.loc[trial, 'act_stps_2trgt'] = act_stps_2trgt
    data_cl.loc[trial, 'min_stps_2trgt'] = min_stps_2trgt

# Across session variables: cumulative node visits according to day and to rat
data_cl['nd-visits-day'] = None
data_cl['nd-visits-rat'] = None

# Cumulative node visits within one session (rat_day)
for rat_day, df in data_cl.groupby(level=['rat_no', 'date']):
    visits_count = {}
    for i in range(len(data_cl.loc[rat_day])):
        node = data_cl.loc[rat_day, 'node'].iloc[i]
        if node in visits_count.keys():
            visits_count[node] += 1
        else:
            visits_count[node] = 1
        data_cl.loc[rat_day, 'nd-visits-day'].iloc[i] = visits_count[node]

# Cumulative node visits for one rat across sessions
for rat, df in data_cl.groupby(level=['rat_no']):
    visits_count = {}
    for i in range(len(data_cl.loc[rat])):
        node = data_cl.loc[rat, 'node'].iloc[i]
        if node in visits_count.keys():
            visits_count[node] += 1
        else:
            visits_count[node] = 1
        data_cl.loc[rat, 'nd-visits-rat'].iloc[i] = visits_count[node]

data_cl['act_stps_2trgt'] = data_cl['act_stps_2trgt'].astype('int')
data_cl['min_stps_2trgt'] = data_cl['min_stps_2trgt'].astype('int')
data_cl['nd-visits-day'] = data_cl['nd-visits-day'].astype('int')
data_cl['nd-visits-rat'] = data_cl['nd-visits-rat'].astype('int')

        
# Save data to csv
data_cl['time'] = data_cl['time'].astype(str).str.split('0 days ').str[-1]
data_cl.to_csv(data_pr_path+data_pr_fname)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [6]:
data_cl.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 7149 entries, ('Rat5', Timestamp('2021-06-28 00:00:00'), 2) to ('Rat8', Timestamp('2021-08-19 00:00:00'), 26)
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rat_id          7149 non-null   object 
 1   node            7149 non-null   object 
 2   time            7149 non-null   object 
 3   distance        7149 non-null   float64
 4   seconds_ff      7149 non-null   float64
 5   speed_ff        7149 non-null   float64
 6   island          7149 non-null   object 
 7   re-visit        7149 non-null   bool   
 8   u_turn          7149 non-null   bool   
 9   cum_distance    7149 non-null   float64
 10  cum_seconds     7149 non-null   float64
 11  act_stps_2trgt  7149 non-null   int64  
 12  min_stps_2trgt  7149 non-null   int64  
 13  nd-visits-day   7149 non-null   int64  
 14  nd-visits-rat   7149 non-null   int64  
dtypes: bool(2), float64(5), int64(4), obj