# Data Preprocessing Notes for p4679
Author: Pritthijit Nath <br> 
Date: 17/01/24 <br>
Notebook Template Credit: Camilla Billari

Notes for the data preprocessing that was carried out upon loading the Chris Marone Lab Experiment p4679.

In [1]:
import sys

import numpy as np
import pandas as pd

MAIN_DICT = "/gws/nopw/j04/ai4er/users/pn341/earthquake-predictability"
sys.path.append(MAIN_DICT)

from utils.dataset import SlowEarthquakeDataset

## Raw Data

In [2]:
# Directories paths
GTC_DATA_DIR = "/gws/nopw/j04/ai4er/users/pn341/earthquake-predictability/data/gtc_quakes_data"
LABQUAKES_DATA_DIR = f"{GTC_DATA_DIR}/labquakes"
MARONE_DATA_DIR = f"{LABQUAKES_DATA_DIR}/Marone"

# Open p4679 experiment in a dataframe
p4679_FILE_PATH = f"{MARONE_DATA_DIR}/p4679/p4679.txt"
with open(p4679_FILE_PATH, "r") as file:
    raw_df = pd.read_csv(file, skiprows=1)

# Rename the columns to match the raw data
raw_df.columns = [
    "id",
    "lp_disp",
    "shr_stress",
    "nor_disp",
    "nor_stress",
    "time",
    "mu",
    "layer_thick",
    "ec_disp",
]

# Drop the record number column
raw_df = raw_df.drop(["id"], axis=1)

raw_df.head()

Unnamed: 0,lp_disp,shr_stress,nor_disp,nor_stress,time,mu,layer_thick,ec_disp
0,0.0,0.0,-3893.90946,1e-18,1.0,0.0,3766.95473,0.0
1,0.0,0.0,-3893.90946,1e-18,2.0,0.0,3766.95473,0.0
2,0.0,0.0,-3893.90946,1e-18,3.0,0.0,3766.95473,0.0
3,0.0,0.0,-3893.90946,1e-18,4.0,0.0,3766.95473,0.0
4,0.0,0.0,-3893.90946,1e-18,5.0,0.0,3766.95473,0.0


## Pre-processed Data

In [3]:
# Access p4679 and output dataframe head using Pritt's data loaders (which utilises Adriano's loading + pre-processing)
dataset = SlowEarthquakeDataset(["p4679"])

# Get data outputs
ds_exp = dataset["p4679"]
X, Y, t = ds_exp["X"], ds_exp["Y"], ds_exp["t"]

# Create dataframe
df = pd.DataFrame(
    np.hstack((X, Y, t.reshape(-1, 1))),
    columns=[ds_exp["hdrs"]["X"], *ds_exp["hdrs"]["Y"], ds_exp["hdrs"]["t"]],
)

# Dropping obs_shear_strain due to being filled with NaN
df = df.drop(["obs_shear_strain"], axis=1)
df.head()

Unnamed: 0,det_shear_stress,obs_shear_stress,obs_normal_stress,obs_ecdisp,time
0,0.057305,5.09152,6.98674,22107.1104,0.0
1,0.056437,5.090652,6.98841,22109.7823,0.001
2,0.055774,5.089989,6.986299,22103.79,0.002
3,0.055277,5.089492,6.98597,22109.2161,0.003
4,0.054028,5.088243,6.987547,22108.59,0.004


In [4]:
analysis = {}

# Percentage of processed data
analysis["%_processed"] = round((len(df) / len(raw_df)) * 100, 2)

# Raw time range
analysis[
    "raw_time_range"
] = f"{raw_df['time'].iloc[0]}-{raw_df['time'].iloc[-1]}"

# Sampled time range
analysis["sampled_time_range"] = f"4233.28-4535"

# Raw dataset - column count
analysis["raw_col_count"] = len(raw_df.columns)

# Processed dataset - column count
analysis["proc_col_count"] = len(df.columns)

# Raw dataset - columns
analysis["raw_cols"] = list(raw_df.columns)

# Processed dataset - columns
analysis["proc_cols"] = list(df.columns)

In [5]:
from tabulate import tabulate

x = pd.DataFrame([(k, analysis[k]) for k in analysis], columns=["", ""])
print(tabulate(x, headers="keys", tablefmt="psql"))

+----+--------------------+---------------------------------------------------------------------------------------------+
|    |                    |                                                                                             |
|----+--------------------+---------------------------------------------------------------------------------------------|
|  0 | %_processed        | 4.73                                                                                        |
|  1 | raw_time_range     | 1.0-8382.329                                                                                |
|  2 | sampled_time_range | 4233.28-4535                                                                                |
|  3 | raw_col_count      | 8                                                                                           |
|  4 | proc_col_count     | 5                                                                                           |
|  5 | raw_cols         

## Notes on Pre-processing

### General notes:

* We have sampled 4.73% of dataset (in the 4233.28-4535 window).
* Downsampling frequency = 1s (window), 0.1s (shift) (from Laurentti et al.).
* Original columns were: [lp_disp, shr_stress, nor_disp, nor_stress, time, mu, layer_thick, ec_disp].
* Pre-processed columns: [det_shear_stress, obs_shear_stress, obs_normal_stress, obs_ecdisp, time], where:
    * shr_stress + polyfit -> det_shear_stress &emsp; (processed - detrended)
    * shr_stress -> obs_shear_stress &emsp; &emsp; &emsp; &emsp; &nbsp; (not processed)
    * nor_stress -> obs_normal_stress &emsp; &emsp; &emsp; (not processed)
    * ec_disp -> obs_ecdisp &emsp; &emsp; &emsp; &emsp; &emsp;&emsp; &emsp;(processed - handles exceptions)
    * Time -> time &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;&emsp;&emsp; &emsp; &emsp;(not processed)
* Pre-processing steps:
    * Handling exceptions for ec_disp and obs_shear_strain is done by discarding the data and creating empty columns for obs_ecdisp and obs_shear_strain. (See load_data(), lines 42-49.)
    * De-trending for det_shear_stress is done by fitting np.polyfit to obs_shear_stress (degree=1), and then subtracting it from obs_shear_stress. (See load_data(), lines 67-69.)

### Annotated Code

#### Setting Experiment Parameter
From _params.py_: 

```python
elif exp == "p4679":
        parameters = {
            "t0": 4233.28,            # Starting time window loaded - Note: raw data min = 0
            "tend": 4535.0,           # Ending time window loaded - Note: raw data max = 8382.329
            "Nheaders": 2,            # Header that np array starts with in import_data
            "dir_data": "gtc_quakes_data/labquakes/",
            "case_study": "Marone/p4679",
            "data_type": "lab",
            "struct_type": "Marone",
            "file_format": "txt",
            "downsample_factor": 1,   # No downsampling (in != 1, no code has been written for it)
            "vl": None,               # Loading velocity
            "segment": None,          # Only relevant for gnss data to segment the data
            "obs_unit": "MPa",
            "time_unit": "s",
        }
        [...] # Assigns new params for obs and time labels with units
```


#### Importing Data
Relevant parts from _load.py_: 

```python
def import_data(dirs, filename, parameters):
    [...] # sets format

    if struct == "Marone":
        [...] # accesses file

            Nheaders = parameters["Nheaders"] # From parameters, for "p4679" =2
            L = L - Nheaders

            [...] # Creates new array columns, one per quantity in data (see below for assignment)

            [...] # loads data, for loop to assign columns from data

                # For each header, assign quantity from column - see comments for data column headers
                Rec[tt] = int(columns[0])               # id
                LPDisp[tt] = float(columns[1])          # lp_disp (mic)                
                ShearStress[tt] = float(columns[2])     # shr_stress (MPa)
                NormDisp[tt] = float(columns[3][:-1])   # nor_disp (mic)
                NormStress[tt] = float(columns[4])      # nor_stress (MPa)
                Time[tt] = float(columns[5])            # time (sec)                
                mu[tt] = float(columns[6][:-1])         # mu
                LayerThick[tt] = float(columns[7][:-1]) # layer_thick (mic)
                ecDisp[tt] = float(columns[8])          # ec_disp (mic)

            [...] # only keep indices with time between time range chosen (4233.28-4535) as set in parameters

```

#### Loading and Pre-processing
Note: load_data() runs the import_data() which is the one with the loading code, the rest of the code in load_data() then processes it and outputs it into X, Y, t, dt, vl.

Relevant parts from _load.py_: 

```python
def load_data(exp, dirs, params):

    if params["data_type"] == "lab":
            [...] # choose data based on params set and run import_data()

            #---- Copy obs_shear_stress, obs_normal_stress as is!
            ShearStressobs = data["ShearStress"]
            NormalStressobs = data["NormStress"]

            #---- Copy obs_ecdisp and obs_shear_strain, if error create an empty (NaN) column
            try:
                ecDispobs = data["ecDisp"]
            except Exception:
                ecDispobs = np.nan * np.ones(ShearStressobs.shape)
            try:
                ShearStrainobs = data["ShearStrain"]
            except Exception:
                ShearStrainobs = np.nan * np.ones(ShearStressobs.shape)

            [...] # about n of samples, only relevant for Marone

            #----  Reassign time for new range
            elif params["struct_type"] == "Marone":
                t = data["Time"] - data["Time"][0]
                
            [...] # handle time for other experiments

            #---- Detrend shear stress (into our det_shear_stress) and normal stress
            p = np.polyfit(t, ShearStressobs, deg=1)
            ShearStressobs_det = ShearStressobs - (p[0] * t + p[1]) # our det_shear_stress
            del p

            [...] #---- Detrend normal stress, displacement and strain in same way,
            #           but they are already commented out? No need?

            #---- Assign outputs 
            # observed data
            X = np.array([ShearStressobs_det]).T # our det_shear_stress, note it will be 1st column

            # observed time step
            dt = t[1] - t[0]

            vl = params["vl"]
            [...] #---- Estimate loading velocity from loading displacenment if not present, but in p4679 vl=None

            # Y = np.array([ShearStressobs_det, NormalStressobs_det]).T
            Y = np.array(
                [ShearStressobs, NormalStressobs, ecDispobs, ShearStrainobs]
            ).T
            # our [obs_shear_stress, obs_normal_stress, obs_ecdisp, obs_shear_strain]
            
    
    return X, Y, t, dt, vl # note we read the first 3 in as out 6 column dataset [X, Y, t]
```