# Data Preprocessing Notes for i417
Author: Camilla Billari <br> 
Date: 16/01/24

Notes for the data preprocessing that was carried out upon loading the Mele Veedu Lab Experiment i417.

In [1]:
import sys

import numpy as np
import pandas as pd

MAIN_DICT = "/gws/nopw/j04/ai4er/users/cgbill/earthquake-predictability"
sys.path.append(MAIN_DICT)

from utils.dataset import SlowEarthquakeDataset

## Raw Data

In [2]:
# Directories paths
GTC_DATA_DIR = "/gws/nopw/j04/ai4er/users/cgbill/earthquake-predictability/data/gtc_quakes_data"
LABQUAKES_DATA_DIR = f"{GTC_DATA_DIR}/labquakes"
MELEVEEDU_DATA_DIR = f"{LABQUAKES_DATA_DIR}/MeleVeeduetal2020"

# Open i417 experiment in a dataframe
i417_FILE_PATH = f"{MELEVEEDU_DATA_DIR}/i417/i417.txt"
with open(i417_FILE_PATH, "r") as file:
    df = pd.read_csv(
        file, delim_whitespace=True, header=0, index_col=0, low_memory=False
    )

# Remove units
df = df.iloc[1:, :]

# Handle exception for space in "# Rec" column name creating two separate columns
cols = list(df.keys()) + [""]  # create a new cols list
df.columns = cols[1:]  # remove the first
df.pop(df.columns[-1])  # pop the last column

df.head()

Unnamed: 0_level_0,lp_disp,LT,Tau,SigN,dcdtOB,Time,recN,timedcdt,ec_disp,mu,etrain,slipVelocity
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0,3805.51,0,1e-07,0,0,0,39999.9,0,0,0,0.0
1,0,3805.51,0,1e-07,0,1,1,39999.9,0,0,0,0.0
2,0,3805.51,0,1e-07,0,2,2,39999.9,0,0,0,0.0
3,0,3805.51,0,1e-07,0,3,3,39999.9,0,0,0,0.0
4,0,3805.5,0,1e-07,0,4,4,39999.9,0,0,0,0.0


## Pre-processed Data

In [3]:
# Access i417 and output dataframe head using Pritt's data loaders (which utilises Adriano's loading + pre-processing)
dataset = SlowEarthquakeDataset(["i417"])
dataset.load()

# Get data optupts
ds_exp = dataset["i417"]
X, Y, t = ds_exp["X"], ds_exp["Y"], ds_exp["t"]

# Create dataframe
df = pd.DataFrame(
    np.hstack((X, Y, t.reshape(-1, 1))),
    columns=[ds_exp["hdrs"]["X"], *ds_exp["hdrs"]["Y"], ds_exp["hdrs"]["t"]],
)

df.head()

Unnamed: 0,det_shear_stress,obs_shear_stress,obs_normal_stress,obs_ecdisp,obs_shear_strain,time
0,-0.105806,15.857,24.9724,22476500.0,16191.7,0.0
1,-0.101709,15.8611,24.9736,22477400.0,16192.1,0.01
2,-0.111013,15.8518,24.9725,22477000.0,16191.5,0.02
3,-0.117616,15.8452,24.9755,22476900.0,16192.0,0.03
4,-0.12222,15.8406,24.9744,22476800.0,16191.2,0.04


## Notes on Pre-processing

### General notes:

* We have sampled 3.78% of dataset (in the 3650-3850 window).
* Downsampling frequency = (from Mele Veedu?).
* Original columns were: [RecNum, lp_disp, LT, Tau, SigN, dcdtOB, Time, recN, timedcdt, ec_disp, mu, etrain, slipVelocity].
* Loader output columns: [det6_shear_stress, obs_shear_stress, obs_normal_stress, obs_ecdisp, obs_shear_strain, time], where:
    * Tau + polyfit -> det_shear_stress &emsp; (processed - detrended)
    * Tau -> obs_shear_stress &emsp; &emsp; &emsp; &emsp; (not processed)
    * SigN -> obs_normal_stress &emsp; &emsp; &emsp; (not processed)
    * ec_disp -> obs_ecdisp &emsp; &emsp; &emsp; &emsp; &emsp;(processed - handles exceptions)
    * etrain -> obs_shear_strain &emsp; &emsp;&emsp;&emsp;(processed - handles exceptions)
    * Time -> time &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;&emsp;&emsp;(not processed)
* Pre-processing steps:
    * Handling exceptions for ec_disp and etrain is done by discarding the data and creating empty columns for obs_ecdisp and obs_shear_strain. (See load_data(), lines 41-47.)
    * De-trending for det_shear_stress is done by fitting np.polyfit to obs_shear_stress (degree=1), and then subtracting it from obs_shear_stress. (See load_data(), lines 62-63.)



#### Useful notes from Gualandi et al. 2023

> "We use data from 14 stick-slip friction experiments conducted at different imposed normal stress ($\sigma_n$) conditions (Mele Veedu et al., 2020). In the experiments, two layers of quartz powder are put under $\sigma_n$ and then sheared using an acrylic piston to modulate the elastic stiffness k around a value (14.8 GPa/m) (Tinti et al., 2016; Mele Veedu et al., 2020) (Fig. 1a). The applied normal stress is used as a control parameter to systematically traverse the critical stability condition for which the stiffness of the loading apparatus (k) is equal to the critical rate of frictional weakening with slip (also called the critical stiffness) of the system ($k_x \propto \frac{1}{\sigma_n}$
) (Rice and Ruina, 1983; Gu et al., 1984). In fact, we use data that includes the full transition from stable frictional sliding to slow and fast stick-slip motion. The transition from slow to fast events (Fig. 1b) is defined by the peak sliding velocity during failure, which is dictated by the ratio of $k\gtrsim k_c$ or $k\lesssim k_c$. We provide this information in Tab. S1 in the Supplementary Material."

> "For normal stresses below 14 MPa slip is stable (e.g., Scuderi et al., 2016). At higher normal stresses, slow stick-slip events (labquakes) occur and these have small stress drop. At intermediate normal stresses, we observe alternating fast and slow events (e.g., Leeman et al., 2016). At the highest normal stress, labquakes are faster with elastodynamic energy release and large amplitude acoustic emissions (Bolton et al., 2020) (Fig. 1b). We use the laboratory observations to determine the number of dofs governing the system with particular focus on changes across the stability transition. We first set up a model with the appropriate number of phase space dimensions to match laboratory data and then explore possible variations for the range of labquake behaviors observed."

>"Each experiment includes several labquake cycles. We use data in a 200s time window (Fig. 1 and Tab. S2) in order to have a sufficient number of cycles to perform dynamical systems analysis. We do not extend this window because we do not want to include friction evolution effects associated with shear fabric development and wear. Throughout each experiment, the loading velocity $v_0$ and the applied $\sigma_n$ are kept constant to some precision using servo-control. Their mean values and standard deviations are reported in Tab. S2. The nominal $v_0$ was 10 Î¼m/s for all experiments. Data are sampled every $\Delta t=0.01s$."

### Annotated Code

#### Setting Experiment Parameter
From _params.py_: 

```python
elif exp == "i417":
        parameters = {
            "t0": 3650.0,           # Starting time window loaded - Note: raw data min = 0
            "tend": 3850.0,         # Ending time window loaded - Note: raw data max = 5285.9
            "Nheaders": 2,          # Header that np array starts with in import_data
            "dir_data": "gtc_quakes_data/labquakes/",
            "case_study": "MeleVeeduetal2020/i417",
            "data_type": "lab",
            "struct_type": "MeleVeeduetal2020",
            "file_format": "txt",
            "downsample_factor": 1, # No downsampling (in != 1, no code has been written for it)
            "vl": 10,               # Loading velocity
            "segment": None,        # Only relevant for gnss data to segment the data
            "obs_unit": "MPa",
            "time_unit": "s",
        }

        [...] # Assigns new params for obs and time labels with units
```


#### Importing Data
Relevant parts from _load.py_: 

```python
def import_data(dirs, filename, parameters):
    [...] # sets format

    if struct == "MeleVeeduetal2020":
        [...] # accesses file

            Nheaders = parameters["Nheaders"] # From parameters, for "i417" =2
            L = L - Nheaders

            [...] # Creates new array columns, one per quantity in data (see below for assignment)

            [...] # loads data, for loop to assign columns from data

                # For each header, assign quantity from column - see comments for data column headers
                Rec[tt] = int(columns[0])               # RecNum
                LPDisp[tt] = float(columns[1])          # lp_disp (mic)
                LayerThick[tt] = float(columns[2])      # LT (mic) - micrometer?
                ShearStress[tt] = float(columns[3])     # Tau (MPa)
                NormStress[tt] = float(columns[4])      # SigN (MPa)
                OnBoard[tt] = float(columns[5])         # dcdtOB (mic) - micrometer?
                Time[tt] = float(columns[6])            # Time (sec)
                Rec_float[tt] = float(columns[7])       # recN
                TimeOnBoard[tt] = float(columns[8])     # timedcdt (sec)
                ecDisp[tt] = float(columns[9])          # ec_disp
                mu[tt] = float(columns[10])             # mu
                ShearStrain[tt] = float(columns[11])    # etrain
                slip_velocity[tt] = float(columns[12])  # slipVelocity (micrometer/sec)

            [...] # only keep indices with time between time range chosen (3650-3850) as set in parameters

```

#### Loading and Pre-processing
Note: load_data() runs the import_data() which is the one with the loading code for the lab data, the rest of the code in load_data() then processes it and outputs it into X, Y, t, dt, vl.

Relevant parts from _load.py_: 

```python
def load_data(exp, dirs, params):

    if params["data_type"] == "lab":
            [...] # choose data based on params set and run import_data()

            #---- Copy obs_shear_stress, obs_normal_stress as is!
            ShearStressobs = data["ShearStress"]
            NormalStressobs = data["NormStress"]

            #---- Copy obs_ecdisp and obs_shear_strain, if error create an empty (NaN) column
            try:
                ecDispobs = data["ecDisp"]
            except Exception:
                ecDispobs = np.nan * np.ones(ShearStressobs.shape)
            try:
                ShearStrainobs = data["ShearStrain"]
            except Exception:
                ShearStrainobs = np.nan * np.ones(ShearStressobs.shape)

            [...] # about n of samples, only relevant for Marone

            #----  Reassign time for new range
            if params["struct_type"] == "MeleVeeduetal2020":
                t = data["Time"] - data["Time"][0]
                
            [...] # handle time for other experiments

            #---- Detrend shear stress (into our det_shear_stress) and normal stress
            p = np.polyfit(t, ShearStressobs, deg=1)
            ShearStressobs_det = ShearStressobs - (p[0] * t + p[1]) # our det_shear_stress
            del p

            [...] #---- Detrend normal stress, displacement and strain in same way,
            #           but they are already commented out? No need?

            #---- Assign outputs 
            # observed data
            X = np.array([ShearStressobs_det]).T # our det_shear_stress, note it will be 1st column

            # observed time step
            dt = t[1] - t[0]

            vl = params["vl"]
            [...] #---- Estimate loading velocity from loading displacenment if not present, but in i417 vl=10

            # Y = np.array([ShearStressobs_det, NormalStressobs_det]).T
            Y = np.array(
                [ShearStressobs, NormalStressobs, ecDispobs, ShearStrainobs]
            ).T
            # our [obs_shear_stress, obs_normal_stress, obs_ecdisp, obs_shear_strain]
            
    
    return X, Y, t, dt, vl # note we read the first 3 in as out 6 column dataset [X, Y, t]
```