### Joseph Bu <br />
### github: @josephhbu <br />
### USC ID: 3752428485
---

## HW 3 <br />

### 1. Time Series Classification Part 1: Feature Creation/Extraction <br />

An interesting task in machine learning is classification of time series. In this problem, we will classify the activities of humans based on time series obtained by a Wireless Sensor Network. 

**(a) Download the AReM data from: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+system+based+on+Multisensor+data+fusion+\%28AReM\%29. The dataset contains 7 folders that represent seven types of activities. In each folder, there are multiple files each of which represents an instant of a human performing an activity. Each file containis 6 time series collected from activities of the same person, which are called avg_rss12, var_rss12, avg_rss13, var_rss13, vg_rss23, and ar_rss23. There are 88 instances in the dataset, each of which contains 6 time series and each time series has 480 consecutive values.** 

In [89]:
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bootstrap

**(b) Keep datasets 1 and 2 in folders bending1 and bending 2, as well as datasets 1, 2, and 3 in other folders as test data and other datasets as train data.**

In [90]:
# Splitting all data into train and test
def train_test_split():
    bending = ["bending1", "bending2"]
    all_folders = ["bending1", "bending2", "cycling", "lying", "sitting", "standing", "walking"]

    test_data = []
    train_data = []

    for folder_name in all_folders:
        folder_path = 'AReM/' + folder_name
        if not os.path.isdir(folder_path):
            print(f"Warning: Folder {folder_path} does not exist.")
            continue
        for file_name in os.listdir(folder_path):
            file_path = os.path.join(folder_path, file_name)
            match = re.search(r'\d+', file_name)
            dataset_num = int(match.group(0))
            #df = pd.read_csv(file_path, skiprows=4, on_bad_lines='skip')
            if folder_name in bending:
                if dataset_num <= 2:
                    #test_data.append(df)
                    test_data.append(file_path)
                else:
                    #train_data.append(df)
                    train_data.append(file_path)
            else:
                if dataset_num <= 3:
                    #test_data.append(df)
                    test_data.append(file_path)
                else:
                    #train_data.append(df)
                    train_data.append(file_path)

    return test_data, train_data

In [91]:
# Checking to make sure we have all the correct splits for train and test
test_data, train_data = train_test_split()
print("Test Data:", test_data)
print("Length:", len(test_data))
print("Train Data:", train_data)
print("Length:", len(train_data))

Test Data: ['AReM/bending1/dataset1.csv', 'AReM/bending1/dataset2.csv', 'AReM/bending2/dataset1.csv', 'AReM/bending2/dataset2.csv', 'AReM/cycling/dataset1.csv', 'AReM/cycling/dataset2.csv', 'AReM/cycling/dataset3.csv', 'AReM/lying/dataset1.csv', 'AReM/lying/dataset2.csv', 'AReM/lying/dataset3.csv', 'AReM/sitting/dataset1.csv', 'AReM/sitting/dataset2.csv', 'AReM/sitting/dataset3.csv', 'AReM/standing/dataset1.csv', 'AReM/standing/dataset2.csv', 'AReM/standing/dataset3.csv', 'AReM/walking/dataset1.csv', 'AReM/walking/dataset2.csv', 'AReM/walking/dataset3.csv']
Length: 19
Train Data: ['AReM/bending1/dataset7.csv', 'AReM/bending1/dataset6.csv', 'AReM/bending1/dataset4.csv', 'AReM/bending1/dataset5.csv', 'AReM/bending1/dataset3.csv', 'AReM/bending2/dataset6.csv', 'AReM/bending2/dataset4.csv', 'AReM/bending2/dataset5.csv', 'AReM/bending2/dataset3.csv', 'AReM/cycling/dataset7.csv', 'AReM/cycling/dataset6.csv', 'AReM/cycling/dataset4.csv', 'AReM/cycling/dataset5.csv', 'AReM/cycling/dataset10.cs

In [92]:
# Convert into a test and train pandas dataframe
test_df_list = []
train_df_list = []
for file_path in test_data:
    df = pd.read_csv(file_path, skiprows=4, on_bad_lines='skip')
    test_df_list.append(df)
for file_path in train_data:
    df = pd.read_csv(file_path, skiprows=4, on_bad_lines='skip')
    train_df_list.append(df)

test_df = pd.concat(test_df_list)
train_df = pd.concat(train_df_list)
print("Testing set shape:", test_df.shape)
print("Training set shape:", train_df.shape)


Testing set shape: (9120, 7)
Training set shape: (33117, 7)


**(c) Feature Extraction** 

Classification of time series usually needs extracting features from them. In this problem, we focus on time-domain features.

**i. Research what types of time-domain features are usually used in time series classification and list them (examples are minimum, maximum, mean, etc).**

**Answer**:

Statistical Features:
 - Mean
 - Standard deviation/Variance
 - Max, Min, Range
 - Skew
 - Kurtosis: tailedness of the distribution
 - IQR

**ii. Extract the time-domain features minimum, maximum, mean, median, standard deviation, first quartile, and third quartile for all of the 6 time series in each instance. You are free to normalize/standardize features or use them directly. Your new dataset will look like this:**

| Instance   | min | max | mean | median ... 1st quart | 3rd quart |
|------------|-----|-----|------|----------------------|-----------|
| 1    |     |     |     |      |                      |           |
| 2    |     |     |     |      |                      |           |
| 3    |     |     |     |      |                      |           |
| ...  | ... | ... | ... | ...  | ...                  |           |
| 88   |     |     |     |      |                      |           |

**where, for example, 1st quart, means the first quartile of the sixth time series in each of the 88 instances.**

In [93]:
all_data = test_data + train_data
feature_list = []
instance_ids = []

for file_path in all_data:
    df = pd.read_csv(file_path, skiprows=4, on_bad_lines='skip')
    df = df.drop(columns=["# Columns: time"])
    data = df.values  

    features = {}
    time_series_names = ["avg_rss12", "var_rss12", "avg_rss13", "var_rss13", "vg_rss23", "ar_rss23"]
    for i, name in enumerate(time_series_names):
        value = data[:, i]
        features[f"{name}_min"] = np.min(value)
        features[f"{name}_max"] = np.max(value)
        features[f"{name}_mean"] = np.mean(value)
        features[f"{name}_median"] = np.median(value)
        features[f"{name}_std"] = np.std(value) 
        features[f"{name}_1st_quart"] = np.percentile(value, 25)
        features[f"{name}_3rd_quart"] = np.percentile(value, 75)

    feature_list.append(features)

df_features = pd.DataFrame(feature_list)
df_features

Unnamed: 0,avg_rss12_min,avg_rss12_max,avg_rss12_mean,avg_rss12_median,avg_rss12_std,avg_rss12_1st_quart,avg_rss12_3rd_quart,var_rss12_min,var_rss12_max,var_rss12_mean,...,vg_rss23_std,vg_rss23_1st_quart,vg_rss23_3rd_quart,ar_rss23_min,ar_rss23_max,ar_rss23_mean,ar_rss23_median,ar_rss23_std,ar_rss23_1st_quart,ar_rss23_3rd_quart
0,37.25,45.00,40.624792,40.500,1.475428,39.2500,42.00,0.0,1.30,0.358604,...,2.186168,33.00,36.00,0.00,1.92,0.570583,0.430,0.582308,0.0000,1.3000
1,38.00,45.67,42.812812,42.500,1.434054,42.0000,43.67,0.0,1.22,0.372437,...,1.993175,32.00,34.50,0.00,3.11,0.571083,0.430,0.600383,0.0000,1.3000
2,12.75,51.00,24.562958,24.250,3.733619,23.1875,26.50,0.0,6.87,0.590833,...,3.689936,20.50,27.00,0.00,4.97,0.700188,0.500,0.692997,0.4300,0.8700
3,0.00,42.75,27.464604,28.000,3.579847,25.5000,30.00,0.0,7.76,0.449708,...,5.048375,15.00,20.75,0.00,6.76,1.122125,0.830,1.011287,0.4700,1.3000
4,24.25,45.00,37.177042,36.250,3.577569,34.5000,40.25,0.0,8.58,2.374208,...,2.887335,17.95,21.75,0.00,9.34,2.921729,2.500,1.850669,1.5000,3.9000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,19.75,45.50,34.322750,35.250,4.747524,31.0000,38.00,0.0,13.47,4.456333,...,3.116605,13.50,17.75,0.00,9.67,3.432563,3.200,1.730921,2.1575,4.5650
84,19.25,44.00,34.473188,35.000,4.791706,31.2500,38.00,0.0,13.86,4.359312,...,3.153030,13.73,17.75,0.43,9.00,3.340458,3.090,1.697343,2.1200,4.3750
85,23.50,46.25,34.873229,35.250,4.526997,31.7500,38.25,0.0,14.82,4.380583,...,3.127813,13.75,18.00,0.00,9.51,3.424646,3.270,1.689198,2.1700,4.5000
86,18.33,45.75,34.599875,35.125,4.726858,31.5000,38.00,0.0,15.37,4.398833,...,2.902659,14.00,18.25,0.00,8.86,3.289542,3.015,1.678418,2.1200,4.2600


**iii. Estimate the standard deviation of each of the time-domain features you extracted from the data. Then, use Python’s bootstrapped or any other method to build a 90% bootsrap confidence interval for the standard deviation of each feature.**

In [94]:
results = []
for feature in df_features.columns:
    feature_values = df_features[feature].dropna().values
    std = np.std(feature_values)

    feature_data = (feature_values, )
    res = bootstrap(feature_data, np.std, confidence_level=0.9, n_resamples=999)
    lower = res.confidence_interval.low
    upper = res.confidence_interval.high
    if std == 0:
        lower = 0
        upper = 0
    results.append({
                "feature": feature,
                "std": std,
                "lower bound": lower,
                "upper bound": upper
            })    
    
bootstrap_df = pd.DataFrame(results)
bootstrap_df

  a_hat = 1/6 * sum(nums) / sum(dens)**(3/2)
  res = bootstrap(feature_data, np.std, confidence_level=0.9, n_resamples=999)


Unnamed: 0,feature,std,lower bound,upper bound
0,avg_rss12_min,9.568541,8.565176,11.100549
1,avg_rss12_max,4.183493,3.353507,5.402882
2,avg_rss12_mean,5.246001,4.689871,5.900136
3,avg_rss12_median,5.355577,4.839773,6.125514
4,avg_rss12_std,1.759268,1.587131,1.977463
5,avg_rss12_1st_quart,6.092822,5.634456,6.712439
6,avg_rss12_3rd_quart,5.002031,4.318934,5.861556
7,var_rss12_min,0.0,0.0,0.0
8,var_rss12_max,5.030493,4.632296,5.423781
9,var_rss12_mean,1.568847,1.42753,1.709246


**iv. Use your judgement to select the three most important time-domain features (one option may be min, mean, and max).**

**Answer:**

I will pick mean, max and standard deviation. From our confidence intervals, we can see that these three features have smaller confidence intervals when comparing to others. The mean and max also give us more range in capturing the behavior of each time series. 

### 2. ISLR 3.7.4

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then ft a linear regression model to the data, as well as a separate cubic regression, i.e. $$ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon $$

**(a) Suppose that the true relationship between X and Y is linear, i.e. $ Y = β0 + β1X + \epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.**

**Answer:** We would expect the training RSS for cubic regression to be either lower than or at least the same training RSS for linear regression. The cubic regression has more predictors than the linear regression which means it should have more flexiblity, leading to a possible lower training RSS.

**(b) Answer (a) using test rather than training RSS** 

**Answer:** In this case, the linear regression is likely to have a lower test RSS than the cubic regression test RSS. This is because of the same strengths that give it a lower training RSS can lead to overfitting of the training data, which would cause a lower test RSS. The true relationship is also linear so our linear regression model should perform better on testing data.

**(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.**

**Answer:** For the same reasons as in part (a), we would expect the training RSS for cubic regression to be lower than that of the linear regression. Since the cubic regression has more predictors it has more flexiblity to fit the non-linearity in the training data, leading to a lower training RSS.

**(d) Answer (c) using test rather than training RSS.**

**Answer:** There is not enough information to tell as it will depend on if the true relationship is closer to linear or cubic. If it is closer to linear than the test RSS for linear regression will be lower. If it has more non-linearity in it, i.e. farther from linear, than the test RSS for cubic regression will be lower.