# Data Preprocessing

The dataset used below was downloaded from kaggle (https://www.kaggle.com/datasets/die9origephit/human-activity-recognition/data)

This section of the code imports necessary packages:

1. Loads the data set from the sensor_data.csv using it's relative path and storing it in a pandas object
2. Removes any instances in the data that are missing information 

In [13]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

data = pd.read_csv("../data/sensor_data.csv")

print(data.head())


   user activity      timestamp  x-axis  y-axis  z-axis
0     1  Walking  4991922345000    0.69   10.80   -2.03
1     1  Walking  4991972333000    6.85    7.44   -0.50
2     1  Walking  4992022351000    0.93    5.63   -0.50
3     1  Walking  4992072339000   -2.11    5.01   -0.69
4     1  Walking  4992122358000   -4.59    4.29   -1.95


## Data Normalization using Min-Max Scaling

This section normalizes the data by executing the following steps:

1. Removing any data instances with missing values.
2. Normalizing the position values.
3. Encoding the activity labels.

In [14]:
data = data.dropna() 

# Normalize x-axis, y-axis, z-axis columns
scaler = MinMaxScaler()
data[['x-axis', 'y-axis', 'z-axis']] = scaler.fit_transform(data[['x-axis', 'y-axis', 'z-axis']])

activities = data['activity']

encoder = LabelEncoder()
data['activity'] = encoder.fit_transform(data['activity'])

# Check the normalized data
print(data.head())
print("\n")

activity_catalog = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

# Print the catalog
print("Activity Encoding Legend:")
for activity, encoded_value in activity_catalog.items():
    print(f"{activity}: {encoded_value}")

folder_path = os.path.join('..', 'data')

# Save the normalized and encoded data to a CSV file in the 'data' folder
data.to_csv(os.path.join(folder_path, 'preprocessed_data.csv'), index=False)


   user  activity      timestamp    x-axis    y-axis    z-axis
0     1         5  4991922345000  0.513145  0.766961  0.450901
1     1         5  4991972333000  0.668857  0.682219  0.489723
2     1         5  4992022351000  0.519211  0.636570  0.489723
3     1         5  4992072339000  0.442366  0.620933  0.484902
4     1         5  4992122358000  0.379676  0.602774  0.452931


Activity Encoding Legend:
Downstairs: 0
Jogging: 1
Sitting: 2
Standing: 3
Upstairs: 4
Walking: 5


## Data Segmentation for Time Series Analysis

This section segments into windows using the following steps:

1. Defining the window size, 5 seconds or 100 iterations per window.
2. Defining the stride length, 25% overlap
2. Removing windows spread across more than one users or tasks.

In [15]:
def segmentData(rawData, windowSize, stride):
    segmentedData = []
    labels = [] 
    
    for start in range(0, len(rawData) - windowSize + 1, stride):
        window = rawData.iloc[start:start + windowSize]

        uniqueActivities = window['activity'].unique()
        uniqueUsers = window['user'].unique()
        if len(uniqueActivities) > 1 or len(uniqueUsers) > 1:
            continue  
        segmentedData.append(window)
        
        # Store user and activity information for each window
        labels.append({
            'user': uniqueUsers[0], 
            'activity': uniqueActivities[0], 
            'timestamp_start': window['timestamp'].iloc[0], 
            'timestamp_end': window['timestamp'].iloc[-1] 
        })
    
    return segmentedData, labels

stride = 75
windowSize = 100

segmentedData, labels = segmentData(data, windowSize, stride)

# Save the accelerometer data
segmentedDataArray = np.array([window[['x-axis', 'y-axis', 'z-axis']].values for window in segmentedData])
np.save(os.path.join(folder_path, 'segmented_data.npy'), segmentedDataArray)

# Save the labels of the accelerometer data at the corresponding indexes
labels_df = pd.DataFrame(labels)
labels_df.to_csv(os.path.join(folder_path, 'segmented_labels.csv'), index=False)
