# MLCW Data Preparation and Compilation Workflow

## Scientific Purpose
Transform complex multi-layered geospatial displacement measurements into a unified, analysis-ready dataset.

## Workflow Overview
1. **Data Extraction**
   - Load monthly displacement measurements from HDF5 file
   - Extract time series for multiple monitoring stations

2. **Data Processing Stages**
   - Decode time and measurement arrays
   - Create multi-layered DataFrame
   - Temporal filtering (from 2014 onwards)
   - Trim to first and last valid measurements

3. **Geospatial Enrichment**
   - Append station geographical coordinates (TWD97 system)
   - Add station identifiers to each measurement record

4. **Data Compilation**
   - Concatenate measurements from all stations
   - Preserve cumulative displacement time series
   - Maintain multi-layer structure

5. **Output**
   - Save processed data as compressed pickle file
   - Retain high-precision floating-point representation

## Key Scientific Techniques
- Multidimensional time series management
- Geospatial data integration
- Temporal data cleaning and validation

## Computational Strategy
- Vectorized pandas operations
- Memory-efficient data processing
- Systematic station-wise data transformation

## Use Case
Preparing ground displacement monitoring data for advanced geospatial analysis, focusing on multi-layer cumulative displacement measurements.

In [1]:
from my_packages import *
from appgeopy import *

In [2]:
mlcw_fpath = r"MLCW_dataset/20250415_MLCW_CRFP_monthly_v2.h5"
mlcw_obj = MLCW(mlcw_fpath)

mlcw_measures, mlcw_metadata = mlcw_obj.get_data()

available_stations = mlcw_obj.list_stations()
available_stations[:5]

['ANHE', 'BEICHEN', 'CANLIN', 'DONGGUANG', 'ERLUN']

In [3]:
station_info = pd.read_pickle(r"MLCW_dataset/MLCW_station_info.xz")

In [6]:
combined_df = pd.DataFrame(data=None, index=None, dtype=np.float32)

# select_station = available_stations[0]

number_of_layers = []

for select_station in tqdm(available_stations[:]):
    
    string_decoder = lambda arr: [x.decode("utf-8") for x in arr]
    
    measures_byStation = mlcw_measures[select_station]
    monthly_date_arr = pd.to_datetime(string_decoder(measures_byStation["monthly_date"]))
    monthly_values_arr = measures_byStation["monthly_values"]["compactbylayer_PCA"]
    
    cdisp_mlcw_df = pd.DataFrame(data={"time":monthly_date_arr})
    cdisp_mlcw_df = cdisp_mlcw_df.set_index("time")
    
    n_layers = monthly_values_arr.shape[0]
    number_of_layers.append(n_layers)
    
    if n_layers==3:
        print(select_station)
    
    for i in range(n_layers):
        cdisp_mlcw_df[f"Layer_{i+1}"] = monthly_values_arr[i]

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
    # 2025/4/15 : decide to keep working with cumulative displacement
    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    # cdisp_mlcw_df = cdisp_mlcw_df.set_index("time")
    # # cdisp_mlcw_df = cdisp_mlcw_df.interpolate(method="time")
    
    # cdisp_mlcw_df = cdisp_mlcw_df.loc["2014":, :]

    # first_datetime = cdisp_mlcw_df.first_valid_index()
    # last_datetime = cdisp_mlcw_df.last_valid_index()

    # cdisp_mlcw_df = cdisp_mlcw_df.loc[first_datetime:last_datetime, :]

    # y_twd97 = station_info.loc[select_station].Y_TWD97
    # x_twd97 = station_info.loc[select_station].X_TWD97
    
    # cdisp_mlcw_df.insert(loc=0, column="Y_TWD97", value=[y_twd97]*len(cdisp_mlcw_df))
    # cdisp_mlcw_df.insert(loc=0, column="X_TWD97", value=[x_twd97]*len(cdisp_mlcw_df))
    # cdisp_mlcw_df.insert(loc=0, column="STATION", value=[select_station]*len(cdisp_mlcw_df))
    
    
    # combined_df = pd.concat([combined_df, cdisp_mlcw_df], axis=0)
    # - - - - - - - - - - - - - - - - - - - - - - - - - 
    # 2025/4/8 : convert to displacement time series
    #
    # 2025/4/18: come back to reproduce displacement input data
    # for testing
    # - - - - - - - - - - - - - - - - - - - - - - - - - 
    disp_mlcw_df = cdisp_mlcw_df.diff(axis=0)
    disp_mlcw_df = disp_mlcw_df.loc["2014":, :]
    
    first_datetime = disp_mlcw_df.first_valid_index()
    last_datetime = disp_mlcw_df.last_valid_index()
    
    disp_mlcw_df = disp_mlcw_df.loc[first_datetime:last_datetime, :]

    y_twd97 = station_info.loc[select_station].Y_TWD97
    x_twd97 = station_info.loc[select_station].X_TWD97
    
    disp_mlcw_df.insert(loc=0, column="Y_TWD97", value=[y_twd97]*len(disp_mlcw_df))
    disp_mlcw_df.insert(loc=0, column="X_TWD97", value=[x_twd97]*len(disp_mlcw_df))
    disp_mlcw_df.insert(loc=0, column="STATION", value=[select_station]*len(disp_mlcw_df))
    
    combined_df = pd.concat([combined_df, disp_mlcw_df], axis=0)

combined_df = combined_df.reset_index(drop=False)
combined_df = combined_df.set_index("time")
combined_df.to_pickle(r"Monthly_MLCW_pca_DISPLACEMENT_v3.xz")

  0%|          | 0/32 [00:00<?, ?it/s]

HAIFENG
JIANYANG
