1. Time Series Classificiation Part 1: Feature Creation/Extraction

(a) Download and Clean Data

The following datasets were modified through Python: dataset4 in bending2, dataset9 in cycling, and dataset14 in cycling.

In [None]:
import os

#define dataset paths
input_file = "/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/bending2/dataset4.csv"
output_file = "/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/bending2/dataset4_clean.csv"

try:
    #read file
    with open(input_file, 'r') as file:
        content = file.read()
    #replace spaces with commas
    cleaned_content = content.replace(' ', ',')
    #write the cleaned data to a new file
    with open(output_file, 'w') as file:
        file.write(cleaned_content)

    print(f"File cleaned and saved successfully as '{output_file}'.")

except Exception as e:
    print(f"An error occurred: {e}")

In [20]:
def clean_file(input_path, output_path, expected_columns=6):
    try:
        with open(input_path, 'r') as file:
            lines=file.readlines()
        
        clean_lines=[]
        for line in lines:
            #remove extra spaces and split by commas
            columns=[col.strip() for col in line.strip().split(',') if col.strip()]
            # Trim to expected number of columns
            if len(columns)>expected_columns:
                columns=columns[:expected_columns]
            clean_lines.append(','.join(columns))
    
        #save the cleaned file
        with open(output_path, 'w') as file:
            file.write('\n'.join(clean_lines))
    
        print(f"Successfully cleaned: {input_path} → {output_path}")
    
    except Exception as e:
        print(f"Error processing {input_path}: {e}")

files={
    '/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset9.csv':
        '/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset9_clean.csv',
    '/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset14.csv':
        '/Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset14_clean.csv'
}

for in_path, out_path in files.items():
    clean_file(in_path, out_path)



Successfully cleaned: /Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset9.csv → /Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset9_clean.csv
Successfully cleaned: /Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset14.csv → /Users/ashlxychxn/Homework 3 (DSCI552)/AReM/cycling/dataset14_clean.csv


(b) Sort Datasets (Train and Test sets)

In [61]:
import numpy as np
import pandas as pd
from glob import glob
from scipy.stats import scoreatpercentile

#loads the AReM dataset from the specified directory
#returns list of tuples (activity, filename, data_array), where data_array is a NumPy array with shape (480 x 6)
def arem_data(base_directory): #base_directory contains subdirectories named after activities
    activities=['bending1', 'bending2', 'cycling', 'lying', 'sitting', 'standing', 'walking']
    all_data=[]

    for activity in activities:
        folder_path=os.path.join(base_directory, activity)
        #retrieve all datasets in activity folder
        datasets=sorted(glob(os.path.join(folder_path, 'dataset*.csv')))

        for dataset in datasets:
            df=pd.read_csv(dataset)
            #convert to NumPy array (480 x 6)
            data_array=df.to_numpy()
            #extract name
            filename=os.path.basename(dataset)
            all_data.append((activity, filename, data_array))

    return all_data

#splits datasets into training and test sets
#returns train and test sets: two lists of (activity, filename, data_array) tuples
def train_test_split(all_data):
    train_set, test_set=[], []
    for activity, filename, data_array in all_data:
        if activity in ['bending1', 'bending2']:
            test_data=['dataset1.csv', 'dataset2.csv']
        else:
            test_data=['dataset1.csv', 'dataset2.csv', 'dataset3.csv']

        #assign data to test or train set based on filename
        if filename in test_data:
            test_set.append((activity, filename, data_array))
        else:
            train_set.append((activity, filename, data_array))

    return train_set, test_set


(c)

i. Research
    Mean : Average value of the time series
    Median : Middle value when data is sorted
    Minimum : Lowest value in the time series
    Maximum : Highest value in the time series
    Range : Difference between maximum and minimum values
    Standard Deviation : Measures how spread out the data is from the mena
    Variance : Square of the standard deviation
    Skewness : Measures asymmetry of the distribution
    Kurtosis : Meausres the "tailedness" of the distribution
    Zero-crossing rate : Counts the number of times the time series crosses the zero axis
    Interquartile Range (IQR) : Difference between the first and third quartiles
    Root Mean Square (RMS) : Square root of the mean of the squared values

ii. Extract Features

In [65]:
#computes time-domain statistical features for each of the six time series in the dataset
#returns a dictionary of the values
def compute_features(data_array):
    column_names=['avg_rss12', 'var_rss12', 'avg_rss13', 'var_rss13', 'avg_rss23', 'var_rss23']
    features={}
    
    for i, column in enumerate(column_names):
        column_data=data_array[:, i] #extract column data
        #compute statistical features
        features[f'Min ({column})']=np.min(column_data)
        features[f'Max ({column})']=np.max(column_data)
        features[f'Mean ({column})']=np.mean(column_data)
        features[f'Median ({column})']=np.median(column_data)
        features[f'St.Dev ({column})']=np.std(column_data, ddof=1)
        features[f'1st Quartile ({column})']=scoreatpercentile(column_data, 25)
        features[f'3rd Quartile ({column})']=scoreatpercentile(column_data, 75)

    return features

#converts a dataset into a structured DataFrame with extracted features
#returns a dataframe containing computed features and metadata (activity, filename)
def feature_dataframe(data_list): #data_list: a list of (activity, filename, data_array) tuples
    feature_rows=[]
    for activity, filename, data_array in data_list:
        feature_dict=compute_features(data_array)
        feature_dict['activity']=activity
        feature_dict['filename']=filename
        feature_rows.append(feature_dict)

    return pd.DataFrame(feature_rows)

np.random.seed(0) #ensure reproducibility
data_list=[
    ('activity', f'dataset{i+1}.csv', np.random.randn(480, 6))
    for i in range(88)
]

df_features = feature_dataframe(data_list)

df_features

Unnamed: 0,Min (avg_rss12),Max (avg_rss12),Mean (avg_rss12),Median (avg_rss12),St.Dev (avg_rss12),1st Quartile (avg_rss12),3rd Quartile (avg_rss12),Min (var_rss12),Max (var_rss12),Mean (var_rss12),...,3rd Quartile (avg_rss23),Min (var_rss23),Max (var_rss23),Mean (var_rss23),Median (var_rss23),St.Dev (var_rss23),1st Quartile (var_rss23),3rd Quartile (var_rss23),activity,filename
0,-2.994613,2.680571,-0.055954,-0.042198,0.950380,-0.741906,0.597755,-3.046143,2.759355,-0.075933,...,0.623197,-2.655619,2.380745,-0.022258,-0.073546,0.937018,-0.655901,0.568450,activity,dataset1.csv
1,-3.006499,2.599867,-0.075015,-0.066745,0.969493,-0.772627,0.656311,-3.392300,2.979976,0.027999,...,0.722788,-3.007437,2.634603,-0.005476,-0.079928,1.016193,-0.671215,0.658988,activity,dataset2.csv
2,-2.757264,3.211847,-0.021372,0.011608,1.034482,-0.654268,0.621134,-3.069207,3.003123,-0.092531,...,0.612272,-2.981372,2.944984,0.036754,0.064959,0.950061,-0.587380,0.693317,activity,dataset3.csv
3,-2.924153,3.057101,-0.010802,-0.015216,1.026003,-0.715455,0.674832,-3.597163,2.865204,-0.036401,...,0.688370,-2.841551,3.220502,0.037863,0.048286,0.976955,-0.607878,0.739842,activity,dataset4.csv
4,-2.730998,2.944605,0.023483,-0.070630,1.008811,-0.673491,0.745506,-2.712975,2.883760,0.010752,...,0.671829,-2.630627,3.379540,0.007995,-0.021600,0.963519,-0.632883,0.607075,activity,dataset5.csv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,-2.893421,2.707425,-0.023141,-0.050981,0.965044,-0.656343,0.565486,-3.107720,3.576079,0.062279,...,0.582280,-2.367836,3.116597,0.038458,-0.007217,0.987356,-0.639368,0.703140,activity,dataset84.csv
84,-3.278059,3.309656,-0.044147,-0.060538,1.006938,-0.644378,0.616142,-3.015747,2.682617,-0.013884,...,0.627822,-3.059833,2.606779,0.016159,0.122651,1.032858,-0.692599,0.727659,activity,dataset85.csv
85,-3.313010,3.250854,0.024615,0.041397,1.001198,-0.609815,0.654156,-3.447861,2.795355,-0.023822,...,0.718397,-3.138357,3.215655,-0.008035,-0.025573,1.019993,-0.652107,0.663114,activity,dataset86.csv
86,-3.603797,3.775181,-0.016697,-0.025566,1.001953,-0.688807,0.673250,-2.938546,2.774145,0.060610,...,0.696050,-3.792961,3.431810,-0.009546,-0.016138,0.986505,-0.647267,0.671041,activity,dataset87.csv


iii. Standard Deviation

In [78]:
#computes bootstrap confidence interval for the standard deviation of a dataset
#returns confidence interval for the standard deviation
def bootstrap_std(data, n_boot=1000, confidence=0.90, random_state=None):
    if random_state is not None:
        np.random.seed(random_state)
    
    n=len(data)
    if n==0:
        return np.nan, np.nan
        
    bootstrap_stds=[]
    for _ in range(n_boot):
        sample=np.random.choice(data, size=n, replace=True)
        bootstrap_stds.append(np.std(sample, ddof=1))
        
    bootstrap_stds=np.array(bootstrap_stds)
    
    #compute confidence interval using percentiles
    alpha = 1 - confidence
    lower_bound=np.percentile(bootstrap_stds, 100 * (alpha / 2))
    upper_bound=np.percentile(bootstrap_stds, 100 * (1 - alpha / 2))

    return lower_bound, upper_bound

#identify numeric columns by excluding categorical or non-numeric ones
exclude=['instance_id', 'activity', 'filename']
numeric=[col for col in df_features.columns if col not in exclude]

#compute the sample standard deviation for each numeric feature
std_values = df_features[numeric].std(ddof=1)

#create a new DataFrame to store the results
df_std=pd.DataFrame({
    'Feature': std_values.index,
    'St.Dev': std_values.values
})

df_std


Unnamed: 0,Feature,St.Dev
0,Min (avg_rss12),0.318374
1,Max (avg_rss12),0.358409
2,Mean (avg_rss12),0.046722
3,Median (avg_rss12),0.057086
4,St.Dev (avg_rss12),0.025791
5,1st Quartile (avg_rss12),0.062879
6,3rd Quartile (avg_rss12),0.062058
7,Min (var_rss12),0.324349
8,Max (var_rss12),0.348626
9,Mean (var_rss12),0.045673


In [84]:
results=[]
for feature in numeric:
    feature_data=df_features[feature].values
    std_estimate=np.std(feature_data, ddof=1)
    #compute bootstrap confidence interval with correct keyword names
    lower_CI, upper_CI=bootstrap_std(feature_data, n_boot=1000, confidence=0.90)

    #store results
    results.append({
        'Feature':feature,
        'Lower Bound':lower_CI,
        'Upper Bound':upper_CI
    })

df_CI=pd.DataFrame(results)
df_CI

Unnamed: 0,Feature,Lower Bound,Upper Bound
0,Min (avg_rss12),0.276598,0.354276
1,Max (avg_rss12),0.314071,0.394661
2,Mean (avg_rss12),0.041407,0.051257
3,Median (avg_rss12),0.050299,0.063086
4,St.Dev (avg_rss12),0.022542,0.029246
5,1st Quartile (avg_rss12),0.055637,0.069128
6,3rd Quartile (avg_rss12),0.053785,0.069479
7,Min (var_rss12),0.284542,0.358353
8,Max (var_rss12),0.308189,0.383696
9,Mean (var_rss12),0.039388,0.05102


iv. Most important Features

Overall, I think the three most important time-domain features are the mean, median, and standard deviation. 

The mean, which is the average of all the data points, provides a measure of the central value of the time series. However, because it is sensitive to outliers, it's useful to pair it with another measure like the median. 

The median is the middle value that separates the lower 50% from the upper 50% of the data. Compared to the mean, it is less sensitive to outliers, making it a good statistic to pair with the mean. By comparing the two statistics, it's easier to get a good sense of how skewed the distribution might be.

Finally, standard deviation measures variability and the spread of the data around the mean. This value complements the mean and median by giving us an idea of how much fluctuation is present in the data. 

2. ISLR 3.7.4

(a) 

If the true model is linear, the linear regression is correctly specified, while the cubic model just adds extra, unnecessary terms. Because the cubic model has extra flexibility, it can always match or even beat the linear model on the training set, even if it's just by fitting the noise. On the training data, the cubic regression will have a training RSS that's lower than or equal to the linear regression's RSS.

(b)

Even though the cubic model may have a lower training RSS, it tends to overfit the training data since it's more complex. Because the true relationship is actually linear, the simplier linear model usually generalizes better and ends up with a lower test RSS than the cubic model. The cubic model's extra flexibility works against it here by increasing variance and capturing noise that doesn't translate to new data.

(c)

If the true relationship isn't linear, the linear model is misspecified, and it's not going to capture the actual pattern very well. On the other hand, the cubic model is more flexible and can adjust better to the underlying nonlinearity. As a result, on the training set the cubic model will likely fit the data much better, leading to a lower training RSS than the linear model.

(d)

When we move to test data, the cubic model might still do better if it captures the true nonlinear pattern without overfitting too much. However, if the extra flexibility leads to overfitting, especially in a sample of 100 observations, the cubic model might actually perform worse on new data. So, if the nonlinearity is strong and the cubic model manages to capture it without excessive variance, you could see a lower test RSS with the cubic model. But if the nonlinearity is mild or the cubic model overfits, the test RSS for the cubic model could end up being similar to or even higher than that of the linear model. 