# UCA-EHAR – Data Preprocessing for Unit 2.1 (Native AI Solutions Development)

This notebook prepares a **preprocessed dataset** from the original ZIP file `UCA-EHAR-1.0.0.zip`.

Goals:

- Read every CSV file inside the original ZIP.
- Keep only the 8 activities relevant to our Smart Glasses scenario:
  - `STANDING`, `SITTING`, `WALKING`, `WALKING_UPSTAIRS`,
    `WALKING_DOWNSTAIRS`, `RUNNING`, `LYING`, `DRINKING`
- Segment continuous activity into sliding windows:
  - 100 samples per window, with 50% overlap.
- Generate NumPy tensors:
  - `X` shape → `(num_windows, 100, 7)` using sensor signals `Ax, Ay, Az, Gx, Gy, Gz, P`
  - `y` shape → `(num_windows,)` with label IDs `0–7`
- Split dataset into `train / val / test` (70% / 15% / 15%)
- Save everything into a compressed `.npz` file for Unit 2.1 optimization lab.

In [16]:
# Import required libraries
import zipfile
import numpy as np
import pandas as pd
import collections

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)

## Main Parameters

(Update the file paths if the ZIP is located elsewhere.)

In [17]:
# Path to the original dataset ZIP file

import os
ZIP_PATH = os.path.join("..", "data", "UCA-EHAR-1.0.0.zip")

# Output NPZ file
OUTPUT_NPZ = os.path.join("..", "data", "uca_ehar_preprocessed_win100_step50.npz")

# Target activities for our use case
TARGET_CLASSES = [
    "STANDING",
    "SITTING",
    "WALKING",
    "WALKING_UPSTAIRS",
    "WALKING_DOWNSTAIRS",
    "RUNNING",
    "LYING",
    "DRINKING",
]

# Sliding window parameters
WINDOW_SIZE = 100  # samples per window
STEP_SIZE = 50     # window shift (50% overlap)

TARGET_CLASSES

['STANDING',
 'SITTING',
 'WALKING',
 'WALKING_UPSTAIRS',
 'WALKING_DOWNSTAIRS',
 'RUNNING',
 'LYING',
 'DRINKING']

## Inspect ZIP Contents

In [18]:
# List CSV files inside the ZIP and display one sample

with zipfile.ZipFile(ZIP_PATH, "r") as z:
    csv_files = [name for name in z.namelist() if name.endswith(".csv")]

print(f"Total CSV files: {len(csv_files)}")
print("Examples:")
for name in csv_files[:10]:
    print("  ", name)

sample_name = csv_files[0]
print("\nPreview of:", sample_name)

with zipfile.ZipFile(ZIP_PATH, "r") as z:
    with z.open(sample_name) as f:
        df_sample = pd.read_csv(f, sep=";")

display(df_sample.head())
print("\nColumns:", list(df_sample.columns))

Total CSV files: 135
Examples:
   UCA-EHAR-1.0.0/WALKING_T9.csv
   UCA-EHAR-1.0.0/WALKING_T8.csv
   UCA-EHAR-1.0.0/WALKING_T7.csv
   UCA-EHAR-1.0.0/WALKING_T6.csv
   UCA-EHAR-1.0.0/WALKING_T5.csv
   UCA-EHAR-1.0.0/WALKING_T4.csv
   UCA-EHAR-1.0.0/WALKING_T3.csv
   UCA-EHAR-1.0.0/WALKING_T2.csv
   UCA-EHAR-1.0.0/WALKING_T21.csv
   UCA-EHAR-1.0.0/WALKING_T20.csv

Preview of: UCA-EHAR-1.0.0/WALKING_T9.csv


Unnamed: 0,T,Ax,Ay,Az,Gx,Gy,Gz,P,CLASS
0,35.0,1.22,-6.03,8.37,-2.83,6.4,-1.95,1004.69,STANDING
1,74.0,1.22,-5.99,8.2,0.07,-0.01,-0.06,1004.67,STANDING
2,113.0,0.1,-4.44,8.64,0.11,0.05,0.16,1004.66,STANDING
3,156.0,0.64,-5.82,8.19,0.12,0.13,0.54,1004.64,STANDING
4,195.0,0.63,-5.99,8.52,0.06,0.18,0.45,1004.63,STANDING



Columns: ['T', 'Ax', 'Ay', 'Az', 'Gx', 'Gy', 'Gz', 'P', 'CLASS']


## Helper Function: Sliding Window Extraction

In [19]:
def extract_windows_from_df(df, target_classes, window_size, step_size):
    """Extract sliding windows from continuous activity segments.

    df: DataFrame with columns ['T','Ax','Ay','Az','Gx','Gy','Gz','P','CLASS']
    """
    df = df.dropna()
    df = df[df["CLASS"].isin(target_classes)].reset_index(drop=True)
    if df.empty:
        return [], []

    X_list, y_list = [], []
    start_idx = 0
    current_class = df.loc[0, "CLASS"]

    for i in range(1, len(df) + 1):
        if i == len(df) or df.loc[i, "CLASS"] != current_class:
            seg_len = i - start_idx

            if seg_len >= window_size:
                seg = df.iloc[start_idx:i]
                for start in range(0, seg_len - window_size + 1, step_size):
                    win = seg.iloc[start:start + window_size]
                    feats = win[["Ax","Ay","Az","Gx","Gy","Gz","P"]].to_numpy(dtype=np.float32)
                    X_list.append(feats)
                    y_list.append(current_class)

            if i < len(df):
                start_idx = i
                current_class = df.loc[i, "CLASS"]

    return X_list, y_list

## Generate Sliding Windows From All CSV Files

In [20]:
all_X, all_y = [], []

with zipfile.ZipFile(ZIP_PATH, "r") as z:
    csv_files = [name for name in z.namelist() if name.endswith(".csv")]
    print(f"Processing {len(csv_files)} files...")

    for name in csv_files:
        with z.open(name) as f:
            df = pd.read_csv(f, sep=";")

        X_list, y_list = extract_windows_from_df(df, TARGET_CLASSES, WINDOW_SIZE, STEP_SIZE)

        all_X.extend(X_list)
        all_y.extend(y_list)

print("Total windows generated:", len(all_X))
print("Total labels:", len(all_y))

Processing 135 files...
Total windows generated: 11774
Total labels: 11774


## Convert to NumPy Arrays + Dataset Summary

In [21]:
X = np.stack(all_X, axis=0)
class_names = TARGET_CLASSES.copy()
label_idx = {cls: i for i, cls in enumerate(class_names)}
y = np.array([label_idx[label] for label in all_y], dtype=np.int64)

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Label mapping:", label_idx)

counter = collections.Counter(y.tolist())
print("\nClass distribution:")
for idx, name in enumerate(class_names):
    count = counter.get(idx, 0)
    perc = 100.0 * count / len(y)
    print(f"{idx:2d} - {name:20s}: {count:6d}  ({perc:5.1f}%)")

X shape: (11774, 100, 7)
y shape: (11774,)
Label mapping: {'STANDING': 0, 'SITTING': 1, 'WALKING': 2, 'WALKING_UPSTAIRS': 3, 'WALKING_DOWNSTAIRS': 4, 'RUNNING': 5, 'LYING': 6, 'DRINKING': 7}

Class distribution:
 0 - STANDING            :   1564  ( 13.3%)
 1 - SITTING             :   2762  ( 23.5%)
 2 - WALKING             :   3135  ( 26.6%)
 3 - WALKING_UPSTAIRS    :    710  (  6.0%)
 4 - WALKING_DOWNSTAIRS  :    614  (  5.2%)
 5 - RUNNING             :   1531  ( 13.0%)
 6 - LYING               :   1365  ( 11.6%)
 7 - DRINKING            :     93  (  0.8%)


## Train / Validation / Test Split

In [22]:
rng = np.random.RandomState(42)
indices = np.arange(len(X))
rng.shuffle(indices)

X = X[indices]
y = y[indices]

n = len(X)
n_train = int(0.7 * n)
n_val = int(0.15 * n)
n_test = n - n_train - n_val

X_train, y_train = X[:n_train], y[:n_train]
X_val, y_val = X[n_train:n_train + n_val], y[n_train:n_train + n_val]
X_test, y_test = X[n_train + n_val:], y[n_train + n_val:]

print("Train:", X_train.shape)
print("Val:  ", X_val.shape)
print("Test: ", X_test.shape)

def show_dist(arr, name):
    cnt = collections.Counter(arr.tolist())
    print(f"\n{name} set distribution:")
    for idx, cname in enumerate(class_names):
        print(f"{idx} - {cname:20s}: {cnt.get(idx,0):5d}")

show_dist(y_train, "Train")
show_dist(y_val, "Val")
show_dist(y_test, "Test")

Train: (8241, 100, 7)
Val:   (1766, 100, 7)
Test:  (1767, 100, 7)

Train set distribution:
0 - STANDING            :  1091
1 - SITTING             :  1929
2 - WALKING             :  2215
3 - WALKING_UPSTAIRS    :   493
4 - WALKING_DOWNSTAIRS  :   426
5 - RUNNING             :  1045
6 - LYING               :   975
7 - DRINKING            :    67

Val set distribution:
0 - STANDING            :   222
1 - SITTING             :   424
2 - WALKING             :   465
3 - WALKING_UPSTAIRS    :   105
4 - WALKING_DOWNSTAIRS  :    85
5 - RUNNING             :   242
6 - LYING               :   210
7 - DRINKING            :    13

Test set distribution:
0 - STANDING            :   251
1 - SITTING             :   409
2 - WALKING             :   455
3 - WALKING_UPSTAIRS    :   112
4 - WALKING_DOWNSTAIRS  :   103
5 - RUNNING             :   244
6 - LYING               :   180
7 - DRINKING            :    13


## Save Preprocessed Dataset to `.npz`

In [23]:
np.savez_compressed(
    OUTPUT_NPZ,
    X_train=X_train, y_train=y_train,
    X_val=X_val, y_val=y_val,
    X_test=X_test, y_test=y_test,
    class_names=np.array(class_names),
    window_size=np.array(WINDOW_SIZE),
    step_size=np.array(STEP_SIZE),
)

print(f"Dataset saved successfully → {OUTPUT_NPZ}")

Dataset saved successfully → ..\data\uca_ehar_preprocessed_win100_step50.npz
