Predict the sit_stand label using time-series sensor data from data.csv and generate predictions for evaluation_ext.csv.

In [74]:
import pandas as pd
import numpy as np
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report


In [76]:
# Step 1: Parse space-separated array strings
def fix_array_string(s):
    try:
        s_clean = re.sub(r'(?<=\d)\s+(?=\d)', ', ', s)
        return np.array(eval(s_clean))
    except:
        return np.nan

In [78]:
# Step 2: Feature extraction from sensor arrays
def extract_features(row, col):
    arr = row[col]
    if isinstance(arr, np.ndarray) and arr.size > 0:
        return pd.Series({
            f"{col}_mean": np.mean(arr),
            f"{col}_std": np.std(arr),
            f"{col}_min": np.min(arr),
            f"{col}_max": np.max(arr)
        })
    else:
        return pd.Series({f"{col}_mean": 0, f"{col}_std": 0, f"{col}_min": 0, f"{col}_max": 0})


In [79]:
# Load training data
df = pd.read_csv("data.csv")

sensor_cols = ['ax', 'ay', 'az', 'gx', 'gy', 'gz', 'mx', 'my', 'mz']


In [82]:
# Parse array columns
for col in sensor_cols:
    df[col] = df[col].apply(fix_array_string)


In [83]:
# Extract features
features_df = pd.concat([df.apply(lambda row: extract_features(row, col), axis=1) for col in sensor_cols], axis=1)


In [84]:
# Target variable
y = df["sit_stand"]

In [86]:
# Optional: PCA to reduce to N components
pca = PCA(n_components=10)
X_pca = pca.fit_transform(features_df)

In [87]:
# Train/test split
X_train, X_val, y_train, y_val = train_test_split(X_pca, y, test_size=0.2, random_state=42)

In [88]:
# Model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)


In [89]:
# Evaluate
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

     Sitting       0.82      0.75      0.78        12
    Standing       0.94      0.96      0.95        48

    accuracy                           0.92        60
   macro avg       0.88      0.85      0.87        60
weighted avg       0.91      0.92      0.92        60



In [90]:
# Load evaluation data
eval_df = pd.read_csv("evaluation_ext.csv")


In [91]:
# Parse and extract features
for col in sensor_cols:
    eval_df[col] = eval_df[col].apply(fix_array_string)
eval_features = pd.concat([eval_df.apply(lambda row: extract_features(row, col), axis=1) for col in sensor_cols], axis=1)

In [92]:
# PCA transform on eval features
eval_pca = pca.transform(eval_features)


In [93]:
# Predict
eval_df["sit_stand_predicted"] = clf.predict(eval_pca)


In [94]:
# Save to CSV
eval_df.to_csv("evaluation_ext.csv", index=False)
print("✅ Prediction file saved: evaluation_ext.csv")

✅ Prediction file saved: evaluation_ext.csv


*

** Key Observations:
-Sensor readings were noisy, but using summary features like mean and max helped capture useful patterns.
-Random Forest worked well without needing to scale or normalize the data.
-Using PCA reduced extra noise and helped improve accuracy and prevent overfitting.

** Model Used:
-Random Forest Classifier was used for its robustness and ability to handle mixed types of features.
-It works well without needing scaling or normalization.
-The model was trained on statistical summaries (like mean, std, max) of the sensor data.
-PCA (Principal Component Analysis) was used to reduce dimensionality and noise before model training.

** Assumptions:
-All sensor readings cover similar time durations.
-Each row represents a complete and valid activity segment.
-The posture labels in data.csv are accurate and trustworthy.



