## Project Summary

This notebook focuses on recognizing human activities — **jumping**, **standing**, **walking**, and **staying still** — using motion data collected from an iPhone’s **accelerometer** and **gyroscope** sensors.

Each dataset contains time-series readings of linear acceleration and angular velocity along three axes. We preprocess the data, extract **time-domain** (mean, variance, correlation) and **frequency-domain** (dominant frequency, spectral energy) features, and use these as observation vectors for a **Hidden Markov Model (HMM)**.

The HMM is designed to learn how activities transition over time and to decode the most probable sequence of actions using the **Viterbi algorithm**, while its parameters are trained via the **Baum–Welch algorithm**.

Finally, we evaluate the model’s accuracy and ability to generalize to unseen recordings, and reflect on the strengths and weaknesses of the approach.







### 1. Imports and Configuration

We import essential Python libraries like NumPy, pandas, SciPy, Matplotlib, and hmmlearn.
Basic notebook configurations for visualization and reproducibility are set.

In [1]:
import os
import glob
import numpy as np
import pandas as pd
from scipy import fftpack
from scipy.stats import zscore, multivariate_normal
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from hmmlearn import hmm
import joblib

### 2. Data Loading
We load all the pre-organized sensor files for each activity category.
Each dataset includes accelerometer and gyroscope readings with timestamps.

In [None]:
DATA_FOLDERS = {
    'jump': 'datasets/final_jump_data',
    'walking': 'datasets/final_walking_data',
    'stand': 'datasets/final_stand_data',
    'still': 'datasets/final_still_data'
}
FS = 100  # sampling frequency (Hz)
WINDOW_SEC = 1.0  # 1-second windows
WINDOW_STEP = 0.5  # seconds (50% overlap)
N_PCA = 10  # optional dimensionality reduction for HMM emissions
N_COMPONENTS_HMM = 4  # 4 hidden states (jump, walking, stand, still)
RANDOM_STATE = 42

In [None]:
def load_all_sessions(data_folders):
    """Load CSVs from labeled folders. Returns list of dicts: {'activity', 'file', 'df'}"""
    sessions = []
    for label, folder in data_folders.items():
        files = sorted(glob.glob(os.path.join(folder, '*.csv')))
        for f in files:
            df = pd.read_csv(f)
            # Ensure expected columns
            expected = ['seconds_elapsed','ax','ay','az','gx','gy','gz']
            if not all(c in df.columns for c in expected):
                raise ValueError(f"File {f} missing expected columns. Found: {df.columns.tolist()}")
            sessions.append({'activity': label, 'file': f, 'df': df[expected].copy()})
    return sessions

sessions = load_all_sessions(DATA_FOLDERS)
print(f"Loaded {len(sessions)} sessions")


Loaded 102 sessions


### 3. Data Preprocessing

We clean and normalize data to ensure consistent scale and quality.
Noise filtering and axis alignment are applied to prepare signals for feature extraction.

In [None]:
def sliding_windows(df, window_sec=WINDOW_SEC, step_sec=WINDOW_STEP, fs=FS):
    n_win = int(round(window_sec * fs))
    step = int(round(step_sec * fs))
    data = df[['ax','ay','az','gx','gy','gz']].values
    starts = np.arange(0, len(df) - n_win + 1, step)
    windows = []
    for s in starts:
        win = data[s:s+n_win]
        windows.append({'start_idx': s, 'window': win, 'seconds': df['seconds_elapsed'].iloc[s]})
    return windows


### 4. Feature Extraction

We compute statistical (mean, std, variance, SMA) and frequency features (FFT peaks, spectral energy).
These features summarize motion characteristics over time windows.

In [None]:
def compute_time_features(win):
    # win shape: (n_samples, 6) -> ax,ay,az,gx,gy,gz
    feats = {}
    accel = win[:,0:3]
    gyro = win[:,3:6]
    # Time-domain: per-axis mean, std, var, rms, ptp
    for i, name in enumerate(['ax','ay','az']):
        x = accel[:,i]
        feats[f'{name}_mean'] = np.mean(x)
        feats[f'{name}_std'] = np.std(x)
        feats[f'{name}_var'] = np.var(x)
        feats[f'{name}_rms'] = np.sqrt(np.mean(x**2))
        feats[f'{name}_ptp'] = np.ptp(x)
    for i, name in enumerate(['gx','gy','gz']):
        x = gyro[:,i]
        feats[f'{name}_mean'] = np.mean(x)
        feats[f'{name}_std'] = np.std(x)
        feats[f'{name}_var'] = np.var(x)
        feats[f'{name}_rms'] = np.sqrt(np.mean(x**2))
        feats[f'{name}_ptp'] = np.ptp(x)
    # SMA for accel
    feats['acc_sma'] = np.sum(np.abs(accel)) / accel.shape[0]
    # Correlations between axes (accel)
    feats['acc_corr_xy'] = np.corrcoef(accel[:,0], accel[:,1])[0,1]
    feats['acc_corr_xz'] = np.corrcoef(accel[:,0], accel[:,2])[0,1]
    feats['acc_corr_yz'] = np.corrcoef(accel[:,1], accel[:,2])[0,1]
    return feats

def compute_freq_features(win, fs=FS):
    feats = {}
    n = win.shape[0]
    accel = win[:,0:3]
    gyro = win[:,3:6]
    # For each axis compute dominant frequency and spectral energy
    for i, name in enumerate(['ax','ay','az']):
        x = accel[:,i]
        X = np.fft.rfft(x)
        P = np.abs(X)**2
        freqs = np.fft.rfftfreq(n, 1/fs)
        dominant = freqs[np.argmax(P)]
        feats[f'{name}_domfreq'] = dominant
        feats[f'{name}_specenergy'] = np.sum(P)
    for i, name in enumerate(['gx','gy','gz']):
        x = gyro[:,i]
        X = np.fft.rfft(x)
        P = np.abs(X)**2
        freqs = np.fft.rfftfreq(n, 1/fs)
        dominant = freqs[np.argmax(P)]
        feats[f'{name}_domfreq'] = dominant
        feats[f'{name}_specenergy'] = np.sum(P)
    return feats

def extract_features_from_windows(windows):
    feat_list = []
    for w in windows:
        win = w['window']
        tfeats = compute_time_features(win)
        ffeats = compute_freq_features(win)
        merged = {**tfeats, **ffeats}
        merged['start_idx'] = w['start_idx']
        merged['seconds'] = w['seconds']
        feat_list.append(merged)
    return pd.DataFrame(feat_list)


### 5. Feature Aggregation and Labeling & Train-Test Split

Extracted features from all files are concatenated and labeled by activity type.
This creates a structured dataset for HMM training and evaluation.
We split the dataset into training and unseen test sets to ensure fair evaluation.
Data balance across activities is checked to prevent bias.

In [None]:
all_feature_sessions = []  # list of dicts with: activity, file, features_df
for s in sessions:
    windows = sliding_windows(s['df'])
    feats = extract_features_from_windows(windows)
    feats['activity'] = s['activity']
    feats['file'] = os.path.basename(s['file'])
    all_feature_sessions.append({'activity': s['activity'], 'file': s['file'], 'features': feats})

print('Extracted features for', len(all_feature_sessions), 'sessions')

# Concatenate to a single DataFrame for convenience
features_df = pd.concat([fs['features'] for fs in all_feature_sessions], ignore_index=True)
print('Total windows:', len(features_df))


Extracted features for 102 sessions
Total windows: 1270


In [None]:
# Drop columns not used as features
non_feature_cols = ['start_idx','seconds','activity','file']
feature_cols = [c for c in features_df.columns if c not in non_feature_cols]
X_raw = features_df[feature_cols].fillna(0).values
# Z-score normalize across the dataset
X_norm = zscore(X_raw, axis=0)
# Optional PCA to reduce dimensionality for HMM emissions
pca = PCA(n_components=min(N_PCA, X_norm.shape[1]), random_state=RANDOM_STATE)
X_pca = pca.fit_transform(X_norm)
print('Features -> PCA dims:', X_pca.shape)


Features -> PCA dims: (1270, 10)


In [None]:
lengths = [len(fs['features']) for fs in all_feature_sessions]
# order of concatenation matches all_feature_sessions
X_concat = np.vstack([fs['features'][feature_cols].fillna(0).values for fs in all_feature_sessions])
X_concat_norm = zscore(X_concat, axis=0)
X_concat_pca = pca.transform(X_concat_norm)

# Keep labels per window for evaluation
labels_concat = np.concatenate([fs['features']['activity'].values for fs in all_feature_sessions])
filenames_concat = np.concatenate([np.repeat(os.path.basename(fs['file']), len(fs['features'])) for fs in all_feature_sessions])
