# Data Processing

- Using the other Jupyter notebooks, you've generated the following:
    - .h5 files containing atomic point clouds
    - CSV files containing targets and global descriptors (hMOFX-DB only)
    
___________


- This notebook provides suggestions you want to follow to ensure compatibility with our training script and to prevent potential data leakages. We also provide the code for rotational data augmentation here. 

___________

- **We recommend you to follow these steps for rotational data augmentation:** 
    - 1. Manual train/validation/test split, generating three CSV files with non-overlapping structure IDs
    - 2. Perform rotational data upsampling for atomic point clouds (.h5). This will change the structure IDs.
    - 3. Perform structure ID modifications (train first -> validation and test later using seperate scripts)

### Rotational Data Augmentation for Atomic Point Clouds (.h5)

- This code introduces a modification to the structure IDs used to reference each atomic point cloud.
- Depending on the degree of duplication, we add different numbers of atomic point clouds. 
- For instance, if we upsample by **adding** 5 rotational duplicates:
    - This changes the ID of the canonical point cloud to {ID}_0.
    - We further add rotated duplicates {ID}_1 ... {ID}_5 to the .h5 file.
- Read the following section because this affects how training, validation, and test CSV files should be processed.

### CAUTION: This code can generate a large file (~20-50 GB). Please run with enough storage.

In [1]:
input_h5_path = ""    # Original atomic point cloud
output_h5_path = ""   # (rotational upsampled)
duplicates = 24       # set to 24 for all our experiments 
normalization = False # false for all our experiments (unit sphere normalization)

In [8]:
import numpy as np
import h5py
import pandas as pd
from scipy.spatial.transform import Rotation as R

def normalize_to_unit_sphere(point_cloud: np.ndarray) -> np.ndarray:
    pc = point_cloud.copy()
    xyz = pc[:, :3]
    centroid = np.mean(xyz, axis=0)
    xyz -= centroid
    max_dist = np.max(np.linalg.norm(xyz, axis=1))
    if max_dist > 0:
        xyz /= max_dist
    pc[:, :3] = xyz
    return pc

def random_rotation(point_cloud: np.ndarray, seed: int = None) -> np.ndarray:
    pc = point_cloud.copy()
    xyz = pc[:, :3]
    rng = np.random.default_rng(seed)
    rot = R.random(random_state=rng)
    xyz_rot = rot.apply(xyz)
    pc[:, :3] = xyz_rot
    return pc

def augment_h5_pointclouds(input_path: str, output_path: str, N: int, normalize: bool = True):
    
    with h5py.File(input_path, "r") as f_in, h5py.File(output_path, "w") as f_out:
        global_seed = np.random.SeedSequence().entropy

        for i, id_ in enumerate(f_in.keys()):
            pc = f_in[id_][:]
            id_str = id_.decode("utf-8") if isinstance(id_, bytes) else str(id_)

            # Normalize once if requested
            base_pc = normalize_to_unit_sphere(pc) if normalize else pc

            # Save original
            f_out.create_dataset(f"{id_str}_0", data=base_pc)

            # Generate N rotated versions
            for j in range(1, N + 1):
                seed = np.random.SeedSequence([global_seed, i, j]).generate_state(1)[0]
                rotated = random_rotation(base_pc, seed=seed)
                f_out.create_dataset(f"{id_str}_{j}", data=rotated)

    print(f"Augmented dataset saved to {output_path}")

In [9]:
augment_h5_pointclouds(input_h5_path, output_h5_path, duplicates, normalization)

Augmented dataset saved to /scratch/sk10275/AdsMOFNet-LIBRARY/0_final_submission_bench/datasets/jarvis_dft_2021_structures_up.h5


### Processing Target CSV Files 

- By performing rotatioanl data augmentation, we change the structural IDs of the point cloud as described above. 
- For the training set, we want to use all of these upsampled structure. Therefore, we apply the function **upsample_training_dataframe** to it, adding _0-N to the ID column.
- For the validation and test sets, we want to only add _0 to the ID column because we want to use the original structures only. To this end, we apply **edit_val_test_dataframe**.

In [19]:
training_df = pd.read_csv("")
val_df = pd.read_csv("")
test_df = pd.read_csv("")

structural_id_col = "ID" # set as 'ID' typically

training_csv_upsampled_save_path = ""
val_csv_edited_save_path = ""
test_csv_edited_save_path = ""

In [20]:
import pandas as pd

def upsample_training_dataframe(
    df: pd.DataFrame,
    N: int,
    structural_id_col: str
) -> pd.DataFrame:
    if structural_id_col not in df.columns:
        raise KeyError(f"Column '{structural_id_col}' not found in DataFrame.")

    new_rows = []

    for _, row in df.iterrows():
        base_id = row[structural_id_col]
        for j in range(N + 1):
            new_row = row.copy()
            new_row[structural_id_col] = f"{base_id}_{j}"
            new_rows.append(new_row)

    return pd.DataFrame(new_rows).reset_index(drop=True)

def edit_val_test_dataframe(
    df: pd.DataFrame,
    structural_id_col: str
) -> pd.DataFrame:
    if structural_id_col not in df.columns:
        raise KeyError(f"Column '{structural_id_col}' not found in DataFrame.")

    new_df = df.copy()
    new_df[structural_id_col] = new_df[structural_id_col].astype(str) + "_0"
    return new_df

In [21]:
training_df_upsampled = upsample_training_dataframe(training_df, duplicates, structural_id_col)
val_df_edited = edit_val_test_dataframe(val_df, structural_id_col)
test_df_edited = edit_val_test_dataframe(test_df, structural_id_col)

training_df_upsampled.to_csv(training_csv_upsampled_save_path, index=False)
val_df_edited.to_csv(val_csv_edited_save_path, index=False)
test_df_edited.to_csv(test_csv_edited_save_path, index=False)