# 1.2-agifford-AnalyzeSingleDataFile
This notebook performs exploratory data analysis on an example datafile.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
from scipy import signal
from scipy.fftpack import fft, fftshift

parq_file = "../../data/interim/raw/fileID1_subjID3_dataID0.parquet"
df = pd.read_parquet(parq_file, engine="fastparquet")

In [None]:
def _make_single_annot_frame(df, shift):
    annot_df = df[df.label != df.label.shift(shift)]
    annot_df = annot_df.dropna(subset="label").reset_index()
    return annot_df

def make_annot_dataframe(df, t_start=None, t_end=None):
    t_start = t_start or df.time.min()
    t_end = t_end or df.time.max()
    
    df = df[(df.time >= t_start) & (df.time <= t_end)].copy()
    
    (act_starts_df, act_ends_df) = (
        _make_single_annot_frame(df, shift) for shift in [1, -1]
    )
    return act_starts_df, act_ends_df

In [None]:
# not sure I'll need the starts & ends df, but probably should remove the rows with no 
# activity labels
activity_starts_df, activity_ends_df = make_annot_dataframe(df)
df_dropna =  df.dropna(subset="label")

In [None]:
df.shape, df_dropna.shape

In [None]:
df.columns

In [None]:
df_dropna.label.unique()

In [None]:
df_dropna.label_group.unique()

For the sake of getting through this project end-to-end, I will not spend too much time on building a very sophisticated model. As such, I will use the column `label_group` as my desired prediction column.

Process for each `label_group`:
1. Compute FFT with a Hanning window for each instance of the `label_group`
2. Average the FFTs across instances
3. Identify the major frequencies of the group by setting some arbitrary threshold to identify peaks.
4. I will use those frequencies to generate `sin` and `cos` features as inputs to a basic model to predict `label_group`.
5. Repeat steps 1-4 for each variable x direction combination (e.g., "accel_x", "accel_y", etc.)

First, let's template out the process of analyzing a single instance of a single `label_group`.

In [None]:
act_tp_df = pd.concat([activity_starts_df.head(1), activity_ends_df.head(1)], ignore_index=True)
df_snip = df_dropna[(df_dropna.time >= act_tp_df.loc[0, "time"]) & (df_dropna.time <= act_tp_df.loc[1, "time"])]

In [None]:
fs = 50
n_fft = df_snip.shape[0]
window = signal.hann(n_fft)
X_w = fft(window * df_snip.accel_x.values)
n_points = 2 * int(np.floor(n_fft / 2))
if n_fft % 2:
    n_points += 1
freq = fs/2 * np.linspace(-1, 1, n_points)


Nothing strong in "\<Initial Activity\>" except for 0 Hz...

In [None]:
# X_w_norm = np.abs(fftshift(X_w))
X_w_norm = 20 * np.log10(np.abs(fftshift(X_w / abs(X_w).max())))
plt.plot(freq, X_w_norm)
plt.title("Frequency response first activity")
plt.ylabel("Normalized magnitude [dB]")
plt.xlabel("F [Hz]")
print(act_tp_df.loc[0, "label"])
plt.show()

In [None]:
act_tp_df = pd.concat([activity_starts_df.loc[[4], :], activity_ends_df.loc[[4], :]], ignore_index=True)
df_snip = df_dropna[(df_dropna.time >= act_tp_df.loc[0, "time"]) & (df_dropna.time <= act_tp_df.loc[1, "time"])]

In contrast, there seem to be many prevalent peaks in "Jumping Jacks" at ~1 Hz and 2.75Hz.

In [None]:
fs = 50
n_fft = df_snip.shape[0]
window = signal.hann(n_fft)
X_w = fft(window * df_snip.accel_x.values)
n_points = 2 * int(np.floor(n_fft / 2))
if n_fft % 2:
    n_points += 1
freq = fs/2 * np.linspace(-1, 1, n_points)

# X_w_norm = np.abs(fftshift(X_w))
X_w_norm = 20 * np.log10(np.abs(fftshift(X_w / abs(X_w).max())))

a = np.diff(np.sign(np.diff(X_w_norm))).nonzero()[0] + 1               # local min & max
b = (np.diff(np.sign(np.diff(X_w_norm))) > 0).nonzero()[0] + 1         # local min
c = (np.diff(np.sign(np.diff(X_w_norm))) < 0).nonzero()[0] + 1         # local max
# +1 due to the fact that diff reduces the original index number

plt.plot(freq, X_w_norm, color="grey")
plt.plot(freq, [-10 for _ in X_w_norm], color="orange")
plt.plot(freq[b], X_w_norm[b], "o", label="min", color='r')
plt.plot(freq[c], X_w_norm[c], "o", label="max", color='b')
plt.title("Frequency response first activity")
plt.ylabel("Normalized magnitude [dB]")
plt.xlabel("F [Hz]")
plt.xlim([0, 5])
print(act_tp_df.loc[0, "label"])
plt.show()


Let's build a function that pulls out the max of the peaks that cross the threshold (i.e., just gets the 0.02, 0.95, and 2.73)

In [None]:
def local_fmax_above_thresh(freq, x_w, threshold):
    local_max_ix = (np.diff(np.sign(np.diff(x_w))) < 0).nonzero()[0] + 1
    x_w_max = x_w[local_max_ix]
    freq_max = freq[local_max_ix]

    return freq_max[np.where((x_w_max>threshold) & (freq_max>0))]

In [None]:
local_fmax_above_thresh(freq, X_w_norm, -10)

Now, we want to cycle through all of the activities, and extract the peak frequencies above a particular threshold. What I want to find is an "ideal" threshold such that I'm only picking out 2 peak frequencies (3 including 0 Hz) for the majority of activities. This will be the threshold I work with for the rest of the project to extract features.

In [None]:
def calculate_normed_spectrum():
    pass

In [None]:
thresholds = [-5, -10, -15, -20]
for r_ix in range(activity_starts_df.shape[0]):
    act_tp_df = pd.concat([activity_starts_df.loc[[r_ix], :], activity_ends_df.loc[[r_ix], :]], ignore_index=True)
    df_snip = df_dropna[(df_dropna.time >= act_tp_df.loc[0, "time"]) & (df_dropna.time <= act_tp_df.loc[1, "time"])]
