## 1. Time Series Classification Part 1: Feature Creation/Extraction

### (a) Download Data

Package imports

In [18]:
import pandas as pd
import os
import numpy as np



Get the AReM Data Set

In [None]:
# Relative path to the AReM folder
base_path = "../Data/AReM"

# List of activities (folders)
activities = ["bending1", "bending2", "cycling", "lying", "sitting", "standing", "walking"]

for activity in activities:
    activity_path = os.path.join(base_path, activity)
    
    # Get a list of all dataset files in the current activity folder
    datasets = [f for f in os.listdir(activity_path) if f.startswith("dataset")]
    
    for dataset in datasets:
        file_path = os.path.join(activity_path, dataset)
        
        # Check if the current file is bending2/dataset4.csv because dataset4 for bending2 is space separated
        is_special_file = (activity == "bending2" and dataset == "dataset4.csv")
        
        try:
            if is_special_file:
                # Read the file as space-separated
                df = pd.read_csv(file_path, skiprows=4, sep=' ', header=None)
                # Rename columns
                df.columns = ['time', 'avg_rss12', 'var_rss12', 'avg_rss13', 'var_rss13', 'avg_rss23', 'var_rss23']
            else:
                # Load the dataset into a pandas DataFrame, assuming the first 4 rows are metadata
                df = pd.read_csv(file_path, skiprows=4)
            
            # Display the first few rows of the DataFrame
            print(f"First few rows for {activity} - {dataset}:")
            print(df.head())
            print("-" * 50)
            
        except pd.errors.ParserError:
            print(f"Error reading {file_path}.")

### (b) Test and Train Data

In [15]:
base_path = "../Data/AReM"

activities = ["bending1", "bending2", "cycling", "lying", "sitting", "standing", "walking"]

test_data_paths = []
train_data_paths = []

for activity in activities:
    activity_path = os.path.join(base_path, activity)
    
    # Check for the existence of the activity_path
    if not os.path.exists(activity_path):
        print(f"Directory not found: {activity_path}")
        continue
    
    # List all datasets for the current activity and remove .csv extension
    datasets = [f.replace('.csv', '') for f in os.listdir(activity_path) if f.startswith("dataset")]
    
    # Debugging print to show detected datasets for each activity
    print(f"Detected datasets for {activity}: {datasets}")

    if activity in ["bending1", "bending2"]:
        test_sets = [ds for ds in datasets if ds in ["dataset1", "dataset2"]]
        train_sets = [ds for ds in datasets if ds not in ["dataset1", "dataset2"]]
    else:
        test_sets = [ds for ds in datasets if ds in ["dataset1", "dataset2", "dataset3"]]
        train_sets = [ds for ds in datasets if ds not in ["dataset1", "dataset2", "dataset3"]]
    
    # Debugging prints to show which datasets are assigned to test and train for each activity
    print(f"Test datasets for {activity}: {test_sets}")
    print(f"Train datasets for {activity}: {train_sets}")

    test_data_paths.extend([os.path.join(activity_path, ds + '.csv') for ds in test_sets])  # Added .csv back for the path
    train_data_paths.extend([os.path.join(activity_path, ds + '.csv') for ds in train_sets])  # Added .csv back for the path

print("\nTest Data Paths:")
for path in test_data_paths:
    print(path)

print("\nTrain Data Paths:")
for path in train_data_paths:
    print(path)
    
print(len(train_data_paths)+len(test_data_paths))

Detected datasets for bending1: ['dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5', 'dataset6', 'dataset7']
Test datasets for bending1: ['dataset1', 'dataset2']
Train datasets for bending1: ['dataset3', 'dataset4', 'dataset5', 'dataset6', 'dataset7']
Detected datasets for bending2: ['dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5', 'dataset6']
Test datasets for bending2: ['dataset1', 'dataset2']
Train datasets for bending2: ['dataset3', 'dataset4', 'dataset5', 'dataset6']
Detected datasets for cycling: ['dataset1', 'dataset10', 'dataset11', 'dataset12', 'dataset13', 'dataset14', 'dataset15', 'dataset2', 'dataset3', 'dataset4', 'dataset5', 'dataset6', 'dataset7', 'dataset8', 'dataset9']
Test datasets for cycling: ['dataset1', 'dataset2', 'dataset3']
Train datasets for cycling: ['dataset10', 'dataset11', 'dataset12', 'dataset13', 'dataset14', 'dataset15', 'dataset4', 'dataset5', 'dataset6', 'dataset7', 'dataset8', 'dataset9']
Detected datasets for lying: ['dataset1', '

### (c) Feature Extraction

#### i. Research

Basic Statistical Features:

Mean: Average value of the time series.
Median: Middle value of the time series.
Mode: The value that appears most frequently in a data set.
Standard Deviation: Measure of the amount of variation or dispersion in the values.
Variance: Square of the standard deviation. It represents the variability from the average.
Skewness: Measure of the asymmetry of the distribution of the time series.
Kurtosis: Measure of the "tailedness" of the distribution in the time series.
Range-Based Features:

Minimum: The smallest value in the time series.
Maximum: The largest value in the time series.
Range: Difference between the maximum and the minimum.
Quantiles: Values taken at regular intervals, e.g., 25th, 50th (median), and 75th percentiles.
Frequency:

Zero Crossing Rate: The rate at which the signal changes from positive to negative or vice versa.
Mean Crossings: The rate at which the signal crosses its mean value.
Signal Magnitude Area (SMA): Represents the cumulative magnitude of a time series.

Energy Measures:

Total Energy: Sum of the squared values.
Signal Power: Average power of the signal.
Shape-Based Features:

Waveform Length: Sum of the absolute differences between adjacent data points.
Slope: Directional information about the signal.
Autocorrelation: Measures the similarity between observations as a function of the time lag between them.

Entropy: Measure of unpredictability or randomness in the time series.

Time Series Decomposition Components: Such as trends, seasonalities, and residuals.

#### ii. Extraction
1 -> avg_rss12
2 -> var_rss12
3 -> avg_rss13
4 -> var_rss13
5 -> avg_rss23
6 -> var_rss23

In [16]:
# Setting the path to the dataset
base_path = "../Data/AReM/"

# Assuming you have a list of the activities you're interested in
activities = ["bending1", "bending2", "cycling", "lying", "sitting", "standing", "walking"]

# Collect all the paths of the training data
train_data_paths = []

for activity in activities:
    activity_path = os.path.join(base_path, activity)
    for filename in os.listdir(activity_path):
        # Assuming 'dataset' in filename indicates it's a data file
        if "dataset" in filename:
            train_data_paths.append(os.path.join(activity_path, filename))

all_features = []

for data_path in train_data_paths:
    data = pd.read_csv(data_path, skiprows=4)
    instance_features = []
    
    for idx, column in enumerate(['avg_rss12','var_rss12','avg_rss13','var_rss13','avg_rss23','var_rss23'], start=1):
        values = data[column].values
        features = [
            np.min(values),
            np.max(values),
            np.mean(values),
            np.median(values),
            np.std(values),  # This is the standard deviation
            np.percentile(values, 25),
            np.percentile(values, 75),
        ]
        instance_features.extend(features)
        
    all_features.append(instance_features)

# Create DataFrame columns dynamically based on the features and time series
columns = []
for i in range(1, 7):  # For each of the six time series
    columns.extend([f'min{i}', f'max{i}', f'mean{i}', f'median{i}', f'std{i}', f'1st_quart{i}', f'3rd_quart{i}'])

df_features = pd.DataFrame(all_features, columns=columns)
print(df_features)

     min1   max1      mean1  median1      std1  1st_quart1  3rd_quart1  min2  \
0   37.25  45.00  40.624792   40.500  1.475428       39.25     42.0000   0.0   
1   38.00  45.67  42.812812   42.500  1.434054       42.00     43.6700   0.0   
2   35.00  47.40  43.954500   44.330  1.557210       43.00     45.0000   0.0   
3   33.00  47.75  42.179812   43.500  3.666840       39.15     45.0000   0.0   
4   33.00  45.75  41.678063   41.750  2.241152       41.33     42.7500   0.0   
..    ...    ...        ...      ...       ...         ...         ...   ...   
83  20.75  46.25  34.763333   35.290  4.737266       31.67     38.2500   0.0   
84  21.50  51.00  34.935812   35.500  4.641102       32.00     38.0625   0.0   
85  18.33  47.67  34.333042   34.750  4.943612       31.25     38.0000   0.0   
86  18.33  45.75  34.599875   35.125  4.726858       31.50     38.0000   0.0   
87  15.50  43.67  34.225875   34.750  4.437168       31.25     37.2500   0.0   

     max2     mean2  ...      std5  1st

#### iii. Standard Deviation

In [17]:
# Define function to compute bootstrap confidence intervals for the standard deviation
def bootstrap_std_conf_interval(data, num_samples=10000, alpha=0.1):
    np.random.seed(42)  # for reproducibility
    bootstrap_stds = []
    
    for _ in range(num_samples):
        # Draw a random sample of data with replacement
        sample = data.sample(len(data), replace=True)
        
        # Store the standard deviation of this sample
        bootstrap_stds.append(sample.std())

    # Determine percentiles for confidence interval
    lower = (alpha/2) * 100
    upper = (1 - (alpha/2)) * 100
    conf_intervals = np.percentile(bootstrap_stds, [lower, upper])

    return conf_intervals

features = ['avg_rss12', 'var_rss12', 'avg_rss13', 'var_rss13', 'avg_rss23', 'var_rss23']
results = {}

# Compute bootstrap confidence intervals for the standard deviation of each feature
for feature in features:
    conf_intervals = bootstrap_std_conf_interval(df[feature])
    results[feature] = {
        'Estimated Std Dev': df[feature].std(),
        '5th Percentile': conf_intervals[0],
        '95th Percentile': conf_intervals[1]
    }

results_df = pd.DataFrame(results).transpose()
print(results_df)

           Estimated Std Dev  5th Percentile  95th Percentile
avg_rss12           4.441798        4.203300         4.671170
var_rss12           2.518991        2.335537         2.697440
avg_rss13           2.812274        2.659772         2.951090
var_rss13           1.730792        1.635415         1.819587
avg_rss23           2.992920        2.821113         3.156677
var_rss23           1.761146        1.659445         1.858199


#### iv. Select Features

Mean: The mean provides the central tendency of the data. It gives you an idea of the average value of your time series over a period. This could be particularly useful to understand the baseline or the average level of the activity. If there's significant variation between activities in their average levels, this feature would be valuable.

Min/Max: The minimum and maximum values give the range of the data. They can capture extreme values or outliers and are especially relevant if sudden spikes or drops in the time series are important for differentiation. For example, the maximum value might help distinguish between walking and running if running produces higher peak values. Similarly, the minimum value might be crucial if you're interested in the lowest point of the signal.

Standard Deviation: This feature is valuable for understanding the variability or dispersion of the data around the mean. If the activities you're trying to classify or differentiate have different levels of variability or consistency, then standard deviation would be an important feature.

### (a) Linear Train

For the training data, the cubic regression model will almost always have a lower (or at worst, the same) RSS compared to the linear regression model. This is because the cubic model includes the linear term and has the added flexibility of the quadratic and cubic terms, allowing it to fit the data more closely, even if the true relationship is purely linear. In essence, the cubic model can reduce the residuals by adapting its shape more closely to the data points, even if some of this "fit" is just capturing noise.

So, we would expect the RSS for the cubic regression to be lower than (or at worst, equal to) the RSS for the linear regression when trained on the same dataset.

### (b) Linear Test

Given that the true relationship between X and Y is linear, a cubic regression model, even though it might fit the training data better (as explained in part (a)), may not generalize well to new, unseen data. This phenomenon is commonly referred to as overfitting.

Overfitting happens when a model is too flexible and captures not just the underlying relationship but also the random noise in the training data. When the model is then applied to test data (or any new data outside the training set), it might produce predictions that are off because it's reacting to patterns (noise) that don't exist in the new dataset.

For the test RSS:

The linear regression model is likely to generalize better to new data if the true relationship is indeed linear because it captures just the linear relationship without being influenced by the noise in the training data.

The cubic regression model, on the other hand, might perform worse on the test data because it may have adapted too closely to the idiosyncrasies (including noise) of the training data, leading to poorer generalization.

Therefore, we would expect the test RSS for the cubic regression to be higher (indicating worse performance) than the test RSS for the linear regression if the true relationship is linear. However, it's worth noting that this isn't guaranteed and depends on the nature of the noise and the specific data at hand.







### (c) Not Linear Train

The linear regression model is constrained to model only a linear relationship between X and Y. If the true relationship is not linear, this model may not fit the data very well.

The cubic regression model, on the other hand, has the potential to fit both linear and certain nonlinear relationships (up to a cubic polynomial). This added flexibility means it can adapt to a wider variety of relationships between X and Y compared to the linear model.

Considering the training RSS:

If the relationship is indeed not linear, the cubic regression will likely be able to capture more of this nonlinear pattern, leading to a lower training RSS compared to the linear regression.
Therefore, we would expect the training RSS for the cubic regression to be lower than the training RSS for the linear regression when the true relationship is not linear. However, the extent to which the cubic model outperforms the linear model depends on the nature and degree of the nonlinearity in the true relationship. If the true relationship is very close to linear, the difference in RSS might be small; if it's strongly nonlinear (and particularly if it aligns with the kinds of nonlinearities a cubic polynomial can capture), the difference might be substantial.

### (d) Not Linear Testing

If the nonlinearity in the true relationship between X and Y can be well-approximated by a cubic polynomial, then the cubic regression is likely to outperform the linear regression on both the training and test datasets. This means that the test RSS for the cubic regression would be expected to be lower than that of the linear regression because the cubic regression would generalize better to new data by capturing the underlying nonlinear relationship.

However, if the true nonlinear relationship is more complex or is of a different form than what can be captured by a cubic polynomial, the cubic regression model might still overfit the training data. This overfitting could result in a higher test RSS for the cubic regression compared to the linear regression, even if the latter is underfitting. This is because the cubic model could be capturing noise or idiosyncratic patterns in the training data that don't generalize well to the test data.

In conclusion, with only the information that the relationship is not linear (but no clarity on the nature of the nonlinearity), it's difficult to definitively state which model will have a lower test RSS. While the cubic model has the potential to better capture certain nonlinearities, its increased flexibility also means it has a higher risk of overfitting, which can adversely affect test RSS. Thus, there's not enough information to definitively tell which model will have a lower test RSS without knowing more about the nature of the true relationship or without actually evaluating both models on test data.





