# Prepare-a-SLAV

Prepare-a-SLAV utilises the mirofile library to load raw MIROSLAV data into a pandas data frame. Appropriate animal IDs are applied, and the data is downsampled from the default MIROSLAV sampling rate (10 sensor readings/binary values per second) to an arbitrary, user-defined time interval (bin). Prepare-a-SLAV's configuration is performed through the Prepare-a-SLAV TOML configuration file where you can find more information about its parameters.

If you are running Prepare-a-SLAV via Google Colab, Prepare-a-SLAV will autodetect and set up the Colab environment in the following cell, and pull example data and the TOML configuration file from the [MIROSLAV toolkit GitHub repository](https://github.com/davorvr/MIROSLAV-analysis).

If you want to run Prepare-a-SLAV in Google Colab *and* with your own data, you can upload your configuration and data files using the File Browser in the sidebar on the left after running the following cell.

In [1]:
try:
    import google.colab
    IN_COLAB = True
except ModuleNotFoundError:
    IN_COLAB = False
    pass
else:
    import sys
    from IPython.display import clear_output 
    clear_output()
    %pip install --ignore-requires-python mirofile
    %pip install pandas --upgrade
    %pip install fastparquet
    !wget https://raw.githubusercontent.com/davorvr/MIROSLAV-analysis/main/1_Prepare-a-SLAV_config.toml
    !mkdir 0_raw_logs
    !wget -O 0_raw_logs/mph-pir-rack_M.2022-05-06T19-19-57-669585.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_M.2022-05-06T19-19-57-669585.gz
    !wget -O 0_raw_logs/mph-pir-rack_M.2022-05-06T19-19-57-669585.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_M.2022-05-06T19-19-57-669585.gz
    !wget -O 0_raw_logs/mph-pir-rack_M.2022-05-16T01-33-25-478055.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_M.2022-05-16T01-33-25-478055.gz
    !wget -O 0_raw_logs/mph-pir-rack_M.2022-05-25T14-03-17-240158.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_M.2022-05-25T14-03-17-240158.gz
    !wget -O 0_raw_logs/mph-pir-rack_R.2022-05-06T19-19-57-669185.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_R.2022-05-06T19-19-57-669185.gz
    !wget -O 0_raw_logs/mph-pir-rack_R.2022-05-16T01-33-22-935712.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_R.2022-05-16T01-33-22-935712.gz
    !wget -O 0_raw_logs/mph-pir-rack_R.2022-05-25T07-57-01-575482.gz https://github.com/davorvr/MIROSLAV-analysis/raw/main/0_raw_logs/mph-pir-rack_R.2022-05-25T07-57-01-575482.gz
    clear_output()
    pass

The environment has been set up. If you wish, you can load your own data using the sidebar now.

***

In [2]:
import pandas as pd
if IN_COLAB and sys.hexversion < 0x030b0000:
    OLD_TOML = True
    import toml
else:
    OLD_TOML = False
    import tomllib
import os
from datetime import datetime
from pathlib import Path
from math import ceil
from mirofile import mirofile

Define helper functions for managing imported column mappings

In [3]:
def _unpack_mcp_colmap(colmap: dict):
    unpacked_colmap = {}
    # first do PH7..0
    for i in range(7,-1,-1):
        unpacked_colmap.update({"PH"+str(i):"H_"+colmap["PHL"+str(i)]})
    # then do PL0..7
    for i in range(0,8):
        unpacked_colmap.update({"PL"+str(i):"L_"+colmap["PHL"+str(i)]})
    return unpacked_colmap

def unpack_full_colmap(colmap: dict[dict]):
    unpacked_colmap = {}
    input_boards = list(colmap.keys())
    input_boards.remove("top_board")
    input_boards.sort()
    input_boards = ["top_board"] + input_boards
    for mcp_i, colmap_name in enumerate(input_boards):
        mcp_colmap = colmap[colmap_name]
        mcp_colmap_unpacked = _unpack_mcp_colmap(mcp_colmap)
        # new_mcp_colmap = {}
        for k, v in mcp_colmap_unpacked.items():
            #new_mcp_colmap.update({f"MCP{mcp_i+1}_"+k : v})
            unpacked_colmap.update({f"MCP{mcp_i+1}_"+k : v})
        #unpacked_colmap.append(new_mcp_colmap.copy())
    return unpacked_colmap

Set the current working directory to the location of this script

In [4]:
wd = Path(os.path.dirname(os.path.realpath('__file__'))).resolve()

Load the TOML config file and extract all user-defined parameters

In [5]:
if OLD_TOML:
    with open(wd / "1_Prepare-a-SLAV_config.toml", "r") as cfg_file:
        config = toml.load(cfg_file)
else:
    with open(wd / "1_Prepare-a-SLAV_config.toml", "rb") as cfg_file:
        config = tomllib.load(cfg_file)

try:
    experiment = config["id_variables"]["experiment"]
    set_dtypes = config["processing_params"]["set_dtypes"]
    resample = config["processing_params"]["resample"]
    resample_bin = config["processing_params"]["resample_bin"]
    toml_colnames = config[experiment]
except KeyError as exc:
    raise KeyError("Config file is improperly formatted!") from exc

colmaps = {}
for k, v in toml_colnames.items():
    try:
        do_process = v.pop("process")
    except KeyError as exc:
        raise KeyError("Config file is improperly formatted!") from exc
    if not do_process:
        continue
    try:
        v = unpack_full_colmap(v)
    except Exception as exc:
        raise KeyError("Couldn't process cage mappings from config file!") from exc
    colmaps.update({k: v})

log_path = Path.cwd() / "0_raw_logs"
output_path = Path.cwd() / "1_outputs_prepared"
output_path.mkdir(exist_ok=True)

The number of rows read at once. Reduce if you run into memory issues.

In [6]:
chunk_size = 10**6

In [7]:
ts_column = "ts_recv"
ts_delta = True
ts_index_map = {"ts_recv": 0,
                "ts_sent": 1}
ts_index = ts_index_map[ts_column]
resample_bin_td = pd.to_timedelta(resample_bin)

for device, colmap in colmaps.items():
    print(f"Processing device {device}. ")
    mname = "-".join([experiment, "pir", device])

    mfile = mirofile.open_experiment(experiment, device, compression="gz", path=log_path)
    suffix = ""
    if set_dtypes:
        suffix += "-dtyped"
    if resample:
        suffix += f"-resampled-{resample_bin}".replace(" ", "")
    pqfile = Path(output_path, mname+suffix+".parquet")
    
    if pqfile.exists():
        raise FileExistsError("Database file already exists, not overwriting!")
        #pass

    # Column names for each of the 16 bits outputted by one MCP are stored
    # in order in the mirofile.mcp_colnames list (for more explanation, see
    # mirofile_columns.txt). But, since we usually have more than one MCP,
    # we need to prepend the column names with "MCPn_". This line does this:
    data_columns = [ f"MCP{n_mcp}_"+colname for n_mcp in range(1, mfile.n_mcps+1) for colname in mirofile.mcp_columns ]
    all_columns = mirofile.timestamp_columns+data_columns
    all_column_dtypes = [*["datetime64[ns]"]*2, *["int"]*(len(all_columns)-2)]

    start = datetime.now()
    file_chunk_carryover = None
    #file_chunk = mfile.readlists(size=chunk_size, progress_bar=True)
    sampling_period = None
    n_chunk = 0
    while True:
        n_chunk += 1
        print(f"({device}) Processing chunk {n_chunk}...")
        if isinstance(file_chunk_carryover, pd.DataFrame):
            #file_chunk = [file_chunk_carryover]
            file_chunk = mfile.readlists(size=chunk_size-len(file_chunk_carryover), progress_bar=True)
        else:
            file_chunk = mfile.readlists(size=chunk_size, progress_bar=True)
        if not file_chunk:
            break
        file_chunk = pd.DataFrame.from_records(file_chunk, columns=all_columns)
        if set_dtypes:
            file_chunk = file_chunk.astype(dict(zip(all_columns, all_column_dtypes)))
            if isinstance(file_chunk_carryover, pd.DataFrame) and not file_chunk_carryover.empty:
                file_chunk = pd.concat([file_chunk_carryover, file_chunk], ignore_index=True)
                file_chunk_carryover = None
            if not sampling_period:
                sampling_period = file_chunk[ts_column].diff().mode().iloc[0]
                supp_len = ceil(resample_bin_td / sampling_period)
            if resample:
                file_chunk_supplement = []
                bin_end = file_chunk[ts_column].iloc[-1].ceil(freq=resample_bin_td)
                file_chunk_supplement = mfile.readlists(supp_len, progress_bar=True)
                if file_chunk_supplement:
                    #DEBUG: if pd.to_datetime(last_line[ts_index]) > pd.Timestamp('2022-05-17 05:59:59.000000'):
                        #DEBUG: print("break here")
                    chunk_end = pd.to_datetime(file_chunk_supplement[-1][ts_index])
                    while chunk_end < bin_end:
                        file_chunk_supplement.append(mfile.readlists(supp_len, progress_bar=True))
                        chunk_end = pd.to_datetime(file_chunk_supplement[-1][ts_index])
                        #last_line = mfile.readlist()
                
                    file_chunk_supplement = pd.DataFrame.from_records(file_chunk_supplement, columns=all_columns)
                    file_chunk_supplement = file_chunk_supplement.astype(dict(zip(all_columns, all_column_dtypes)))
                    file_chunk = pd.concat([file_chunk, file_chunk_supplement.loc[file_chunk_supplement[ts_column] < bin_end].copy()], ignore_index=True)
                    file_chunk_carryover = file_chunk_supplement.loc[file_chunk_supplement[ts_column] >= bin_end].copy()
                    file_chunk_supplement = None
                else:
                    file_chunk_carryover = None

        if colmap:
            file_chunk = file_chunk.rename(colmap, axis="columns")
            file_chunk = file_chunk.loc[:,~file_chunk.columns.str.startswith("H_na")]
            file_chunk = file_chunk.loc[:,~file_chunk.columns.str.startswith("L_na")]
            file_chunk = file_chunk.copy()
        if ts_delta:
            if ts_column == "ts_recv":
                secondary_ts = "ts_sent"
            elif ts_column == "ts_sent":
                secondary_ts = "ts_recv"
            else:
                raise ValueError("ts_column must be either 'ts_recv' or 'ts_sent'")
            # delta is always ts_recv - ts_sent since ts_recv is always later, but
            # we keep the other's name so it's clear which column got replaced with a delta.
            file_chunk[secondary_ts+"_delta"] = file_chunk["ts_recv"] - file_chunk["ts_sent"]
            file_chunk = file_chunk.drop(columns=secondary_ts)
        file_chunk = file_chunk.set_index(ts_column)
        if resample:
            if ts_delta:
                file_chunk = file_chunk.resample(resample_bin_td).mean(numeric_only=False)
            else:
                file_chunk = file_chunk.resample(resample_bin_td).mean(numeric_only=True)
            
        if not pqfile.exists():
            file_chunk.to_parquet(pqfile.absolute(), engine="fastparquet")
        else:
            file_chunk.to_parquet(pqfile.absolute(), engine="fastparquet", append=True)
        #DEBUG: last_chunk = file_chunk.copy()
        print(f"({device}) Chunk processed. Last timestamp: {file_chunk.index[-1]}")
    print(f"{pqfile.name}: ", datetime.now()-start)

Processing device rack_M. 
(rack_M) Processing chunk 1...


100%|██████████| 1000000/1000000 [00:07<00:00, 130730.55it/s]
100%|██████████| 600/600 [00:00<00:00, 199950.93it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-07 23:06:00
(rack_M) Processing chunk 2...


100%|██████████| 999615/999615 [00:06<00:00, 147381.35it/s]
100%|██████████| 600/600 [00:00<00:00, 151391.59it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-09 02:53:00
(rack_M) Processing chunk 3...


100%|██████████| 999590/999590 [00:06<00:00, 155170.97it/s]
100%|██████████| 600/600 [00:00<00:00, 146151.48it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-10 06:40:00
(rack_M) Processing chunk 4...


100%|██████████| 999590/999590 [00:06<00:00, 147166.06it/s]
100%|██████████| 600/600 [00:00<00:00, 120111.80it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-11 10:27:00
(rack_M) Processing chunk 5...


100%|██████████| 999591/999591 [00:06<00:00, 158758.20it/s]
100%|██████████| 600/600 [00:00<00:00, 100011.22it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-12 14:14:00
(rack_M) Processing chunk 6...


100%|██████████| 999590/999590 [00:06<00:00, 158953.05it/s]
100%|██████████| 600/600 [00:00<00:00, 149957.24it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-13 18:01:00
(rack_M) Processing chunk 7...


100%|██████████| 999590/999590 [00:06<00:00, 150724.34it/s]
100%|██████████| 600/600 [00:00<00:00, 108872.26it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-14 21:48:00
(rack_M) Processing chunk 8...


100%|██████████| 999590/999590 [00:07<00:00, 136944.73it/s]
100%|██████████| 600/600 [00:00<00:00, 120111.80it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-16 01:35:00
(rack_M) Processing chunk 9...


100%|██████████| 999590/999590 [00:06<00:00, 151292.37it/s]
100%|██████████| 600/600 [00:00<00:00, 200094.01it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-17 05:22:00
(rack_M) Processing chunk 10...


100%|██████████| 999590/999590 [00:06<00:00, 144384.19it/s]
100%|██████████| 600/600 [00:00<00:00, 120025.87it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-18 09:09:00
(rack_M) Processing chunk 11...


100%|██████████| 999591/999591 [00:08<00:00, 121456.86it/s]
100%|██████████| 600/600 [00:00<00:00, 117224.82it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-19 12:56:00
(rack_M) Processing chunk 12...


100%|██████████| 999591/999591 [00:07<00:00, 137905.88it/s]
100%|██████████| 600/600 [00:00<00:00, 119951.50it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-20 16:43:00
(rack_M) Processing chunk 13...


100%|██████████| 999581/999581 [00:06<00:00, 153022.04it/s]
100%|██████████| 600/600 [00:00<00:00, 200062.20it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-21 20:30:00
(rack_M) Processing chunk 14...


100%|██████████| 999590/999590 [00:06<00:00, 147260.75it/s]
100%|██████████| 600/600 [00:00<00:00, 120111.80it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-23 00:17:00
(rack_M) Processing chunk 15...


100%|██████████| 999590/999590 [00:07<00:00, 140558.22it/s]
100%|██████████| 600/600 [00:00<00:00, 119980.09it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-24 04:04:00
(rack_M) Processing chunk 16...


100%|██████████| 999591/999591 [00:06<00:00, 150468.68it/s]
100%|██████████| 600/600 [00:00<00:00, 75003.20it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-25 14:08:00
(rack_M) Processing chunk 17...


100%|██████████| 999958/999958 [00:07<00:00, 125487.10it/s]
100%|██████████| 600/600 [00:00<00:00, 298668.69it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-26 17:55:00
(rack_M) Processing chunk 18...


100%|██████████| 999591/999591 [00:06<00:00, 149806.70it/s]
100%|██████████| 600/600 [00:00<00:00, 74985.32it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-27 21:42:00
(rack_M) Processing chunk 19...


100%|██████████| 999590/999590 [00:06<00:00, 148348.15it/s]
100%|██████████| 600/600 [00:00<00:00, 150082.44it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-29 01:29:00
(rack_M) Processing chunk 20...


100%|██████████| 999591/999591 [00:06<00:00, 151916.71it/s]
100%|██████████| 600/600 [00:00<00:00, 149121.97it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-30 05:16:00
(rack_M) Processing chunk 21...


100%|██████████| 999592/999592 [00:06<00:00, 153768.01it/s]
100%|██████████| 600/600 [00:00<00:00, 199903.28it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-31 09:03:00
(rack_M) Processing chunk 22...


 35%|███▌      | 350064/999592 [00:02<00:04, 152025.59it/s]


(rack_M) Chunk processed. Last timestamp: 2022-05-31 18:48:00
(rack_M) Processing chunk 23...
mph-pir-rack_M-dtyped-resampled-1minute.parquet:  0:05:16.964134
Processing device rack_R. 
(rack_R) Processing chunk 1...


100%|██████████| 1000000/1000000 [00:05<00:00, 172155.62it/s]
100%|██████████| 600/600 [00:00<00:00, 113089.58it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-07 23:06:00
(rack_R) Processing chunk 2...


100%|██████████| 999618/999618 [00:06<00:00, 150666.85it/s]
100%|██████████| 600/600 [00:00<00:00, 150055.60it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-09 02:53:00
(rack_R) Processing chunk 3...


100%|██████████| 999593/999593 [00:06<00:00, 154786.67it/s]
100%|██████████| 600/600 [00:00<00:00, 145990.39it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-10 06:40:00
(rack_R) Processing chunk 4...


100%|██████████| 999594/999594 [00:06<00:00, 162791.69it/s]
100%|██████████| 600/600 [00:00<00:00, 148208.62it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-11 10:27:00
(rack_R) Processing chunk 5...


100%|██████████| 999593/999593 [00:05<00:00, 186177.52it/s]
100%|██████████| 600/600 [00:00<00:00, 200141.75it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-12 14:14:00
(rack_R) Processing chunk 6...


100%|██████████| 999593/999593 [00:06<00:00, 166405.77it/s]
100%|██████████| 600/600 [00:00<00:00, 149716.37it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-13 18:01:00
(rack_R) Processing chunk 7...


100%|██████████| 999593/999593 [00:06<00:00, 156971.00it/s]
100%|██████████| 600/600 [00:00<00:00, 150037.70it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-14 21:48:00
(rack_R) Processing chunk 8...


100%|██████████| 999594/999594 [00:05<00:00, 166684.48it/s]
100%|██████████| 600/600 [00:00<00:00, 132277.66it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-16 01:35:00
(rack_R) Processing chunk 9...


100%|██████████| 999593/999593 [00:06<00:00, 163526.92it/s]
100%|██████████| 600/600 [00:00<00:00, 205100.44it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-17 05:22:00
(rack_R) Processing chunk 10...


100%|██████████| 999594/999594 [00:07<00:00, 135626.04it/s]
100%|██████████| 600/600 [00:00<00:00, 149823.33it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-18 09:09:00
(rack_R) Processing chunk 11...


100%|██████████| 999593/999593 [00:06<00:00, 146232.49it/s]
100%|██████████| 600/600 [00:00<00:00, 149680.75it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-19 12:56:00
(rack_R) Processing chunk 12...


100%|██████████| 999595/999595 [00:08<00:00, 116446.99it/s]
100%|██████████| 600/600 [00:00<00:00, 151336.97it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-20 16:43:00
(rack_R) Processing chunk 13...


100%|██████████| 999584/999584 [00:05<00:00, 179896.94it/s]
100%|██████████| 600/600 [00:00<00:00, 199554.55it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-21 20:30:00
(rack_R) Processing chunk 14...


100%|██████████| 999593/999593 [00:04<00:00, 223844.84it/s]
100%|██████████| 600/600 [00:00<00:00, 198374.78it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-23 00:17:00
(rack_R) Processing chunk 15...


100%|██████████| 999594/999594 [00:04<00:00, 209779.45it/s]
100%|██████████| 600/600 [00:00<00:00, 292150.27it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-24 04:04:00
(rack_R) Processing chunk 16...


100%|██████████| 999593/999593 [00:04<00:00, 213886.99it/s]
100%|██████████| 600/600 [00:00<00:00, 199554.55it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-25 08:01:00
(rack_R) Processing chunk 17...


100%|██████████| 999468/999468 [00:04<00:00, 211087.09it/s]
100%|██████████| 600/600 [00:00<00:00, 311073.23it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-26 11:48:00
(rack_R) Processing chunk 18...


100%|██████████| 999593/999593 [00:04<00:00, 214173.09it/s]
100%|██████████| 600/600 [00:00<00:00, 199538.73it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-27 15:35:00
(rack_R) Processing chunk 19...


100%|██████████| 999594/999594 [00:04<00:00, 216903.12it/s]
100%|██████████| 600/600 [00:00<00:00, 150172.00it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-28 19:22:00
(rack_R) Processing chunk 20...


100%|██████████| 999594/999594 [00:04<00:00, 211655.60it/s]
100%|██████████| 600/600 [00:00<00:00, 171335.95it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-29 23:09:00
(rack_R) Processing chunk 21...


100%|██████████| 999595/999595 [00:04<00:00, 212544.86it/s]
100%|██████████| 600/600 [00:00<00:00, 190462.61it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-31 02:56:00
(rack_R) Processing chunk 22...


 57%|█████▋    | 570267/999595 [00:02<00:01, 216555.06it/s]


(rack_R) Chunk processed. Last timestamp: 2022-05-31 18:48:00
(rack_R) Processing chunk 23...
mph-pir-rack_R-dtyped-resampled-1minute.parquet:  0:04:00.568651


***

You should be able to obtain the output files from the sidebar now, and proceed to [TidySLAV](https://github.com/davorvr/MIROSLAV-analysis).