# Machine Learning for Time Series Data in Python

Time series data is ubiquitous. Whether it be stock market fluctuations, sensor data recording climate change, or activity in the brain, any signal that changes over time can be described as a time series. Machine learning has emerged as a powerful method for leveraging complexity in data in order to generate predictions and insights into the problem one is trying to solve. This course is an intersection between these two worlds of machine learning and time series data, and covers feature engineering, spectograms, and other advanced techniques in order to classify heartbeat sounds and predict stock prices.

## I. Time Series and Machine Learning Primer
* Intro to the basics of machine learning, time series data, and the intersection between the two.

### Timeseries kinds and applications
* Put simply, a **timeseries** means data that changes over time.
* This can take many different forms; from atmospheric CO2 over time, to the waveform of spoken word, to climate sensor data, the fluctuation of a stock's value over the year, demographic information about a city
* **Timeseries data** consists of at least two things: 
    * One: an array of numbers that represents the data itself.
    * Two: another array that contains a timestamp for each datapoint.
* In other words, each datapoint should have a corresponding time point (whether that be a month, year, hour, or any combination of these). Note: multiple data points may have the same time point

#### Plotting a pandas timeseries

```
import matplotlib.pyploy as plt
fig, ax = plt.subplots(figsize=(12,6))
data.plot('date', 'close', ax=ax)
ax.set(title='AAPL daily closing price')
```
* **The amount of time that passes between timestamps defines the *period* of the timeseries.**
    * This often helps us infer what kind of timeseries we're dealing with.
* One crucial part of machine learning is that we can build a model of the world that formalizes our knowledge of the problem at hand. We can...
    * Predict the future
    * Automate this process
    * can be a critical component of an organization's decision making
    
* We treat timeseries data slightly differently than other types of datasets
    * Timeseries data always change over time, which turns out to be a useful pattern to utilize
    * Using timeseries-specific features lets us see a much richer representation of the raw data.
    
* This course will focus on a simple machine learning pipeline in the context of timeseries data.

* **A machine learning pipeline:**
    * Feature extraction: What kind of special features leverage a signal that changes over time?
    * Model fitting: What kinds of models are suitable for asking questions with timeseries data?
    * Prediction and Validation: How can we validate a model that uses timeseries data? What considerations must we make because it changes in time?

```
# Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10))
data.iloc[:1000].plot(y='data_values', ax=axs[0])
data2.iloc[:1000].plot(y='data_values', ax=axs[1])
plt.show()
```

```
# Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10))
data.iloc[:1000].plot(x='time', y='data_values', ax=axs[0])
data2.iloc[:1000].plot(x='time', y='data_values', ax=axs[1])
plt.show()
```

#### Machine Learning Basics
* Always begin by looking at your data:
    * `array.shape`
    * `array[:3]`
    * `dataframe.head()`
    * `dataframe.info()`
    * `dataframe.describe()`
* It is also crucial to visualize your data:

```
# Using matplotlib
fig, ax = plt.subplots()
ax.plot(...)

# Using pandas
fig, ax = plt.subplots()
df.plot(..., ax=ax)
```
* The proper visualization will depend on the kind of data you've got, though histograms and scatterplots are a good place to start.

* The most popular library for machine-learning in Python is `scikit-learn`.
    * Standardized API so that you can fit many different models with a similar code structure
    
#### Preparing data for scikit-learn
* `scikit-learn` expects a particular structure of data:
    * **`(samples, features)`**
* Make sure that your data is *at least two-dimensional.*
* Make sure the first dimension is *samples*.
* The first axis should correspond to sample number, and the second axis should correspond to feature number.

* If the axes are swapped: **transpose**
    * `array.transpose().shape`
    * `dataframe.T.shape`
    * will swap first and last axis
   
* Use **`.reshape()`** method:
    * lets you specify the shape you want

```
array.shape
array.reshape([-1, 1]).shape
```
   * `-1` will automatically fill that axis with remaining values
   
* **Investigating the model:**
    * It is often useful to investigate what kind of pattern the model has found.
    * Most models will store this information in attributes that are created after calling `.fit()`
        * `model.coef_`
        * `model.intercept_`
    * Call `.predict()` on the model to determine labels for unseen datapoints

```
# Generate predictions with the model using those inputs
predictions = model.predict(new_inputs.reshape(-1, 1))

# Visualize the inputs and predicted values
plt.scatter(new_inputs, predictions, color='r', s=3)
plt.xlabel('inputs')
plt.ylabel('predictions')
plt.show()
```

#### Combining timeseries data with machine learning
* Interaction between ML and timeseries data; introduce why they're worth thinking about in tandem.
#### The heartbeat acoustic data
* Why acoustic data? Audio is a very common kind of timeseries data
* Many recordings of heart sounds from different patients
* Some had normally-functioning hearts, others had abnormalities
* Data comes in the form of audio files + labels for each file
* Goal? Can we find the "abnormal" heart beats?

* Audio tends to have a very high sampling frequency (often above 20,000 samples per second)
* Audio data is often stored in `.wav` files
* list all of these files using the `glob` function:
    * lists files that match a certain pattern

```
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')
print(files)
```
* We'll use a library called **`librosa`** to read in the audio dataset:

```
import librosa as lr
# 'load' accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')
print(sfreq)
```
* Output: `2205`

* **`librosa`** has functions for extracting features, visualizations, and analysis for auditory data
* Import the data using the `load` function
* The data is stored as `audio` and the sampling frequency is stored in `sfreq`
* In this case, the sampling frequency is `2205`, meaning there are `2205` samples per second.

#### Inferring time from samples
* If we know the sampling rate of a timeseries, then we know the timestamp of each datapoint *relative to the first datapoint*.
* Note: this assumes the sampling rate is fixed and also that no data points are lost.
* Now, **we can create an array of timestamps for out data:**
    * Create an array of indices, one for each sample, and divide by the sample frequency.
    * To do so, two options:
        * 1. Genereate a range of indices from zero to the number of data points in your audio file; divide each index by the sampling frequency, and you have a timepoint for each data point.
        * 2. Calculate the final timepoint of your audio data using a similar method; Find the time stamp for the *N-1*th data point. Then use `linspace()` to interpolate from zero to that time.
    
```
indices = np.arange(0, len(audio))
time = indices / sfreq
```
***

```
final_time = (len(audio) - 1) / sfreq
time = np.linspace(0, final_time, sfreq)
```

* In either case, you should have an array of numbers of the same length as your audio data

#### The New York Stock Exchange dataset
* This dataset consists of company stock values for 10 years
* This dataset runs over a much longer timespan that the audio data, and has a sampling frequency on the order of one sample per day (compared with 2,205 samples per second with the audio data).
* Can we detect any patterns in historical records that allow us to predict the value of companies in the future.
* As we are predicting a continuous output value, this is a regression problem.

```
import librosa as lr
from glob import glob

# List all the wav files in the folder
audio_files = glob(data_dir + '/*.wav')

# Read in the first audio file, create the time array
audio, sfreq = lr.load(audio_files[0])
time = np.arange(0, len(audio)) / sfreq

# Plot audio over time
fig, ax = plt.subplots()
ax.plot(time, audio)
ax.set(xlabel='Time (s)', ylabel='Sound Amplitude')
plt.show()
```

```
# Read in the data
data = pd.read_csv('prices.csv', index_col=0)

# Convert the index of the DataFrame to datetime
data.index = pd.to_datetime(data.index)
print(data.head())

# Loop through each column, plot its values over time
fig, ax = plt.subplots()
for column in data:
    data[column].plot(ax=ax, label=column)
ax.legend()
plt.show()
```

### II. Timeseries as Inputs to a Model

#### Classification and Feature Engineering

* One of the most common categories for machine learning problems is classification.

* **Always visualize raw data before fitting models!**
* To plot raw audio, we need two things:
    * the raw audio waveform, usually in a one- or two- dimensional array.
    * we also need the timepoint of each sample
    
    
```
ixs = np.arange(audio.shape[-1])
time = ixs / sfreq
fig, ax = plt.subplots()
ax.plot(time, audio)

```
* We can calculate the time by dividing the index of each sample by the sampling frequency of the timeseries.
* This gives us the time for each sample relative to the beginning of the audio:

* **What features to use?**
* Using raw timeseries data is too noisy for classification
* We need to calculate features!
* An easy start: summarize your audio data
    * summary statistics removes the "time" dimension and gives us a more traditional classification dataset.
    * For each timeseries, we calculate several summary statistics (min, max, avg, etc)
    * We have expanding a single feature (raw audio amplitude) to several features $\Rightarrow$ min, max, avg of each sample.
    
* How to calculate multiple features for several timeseries 

```
# print: (n_files, time)
print(audio.shape)

means = np.mean(audio, axis=-1)
maxs = np.max(audio, axis = -1)
stds = np.std(audio, axis =-1)

# print: (n_files,)
print(means.shape)
```
* By using the "axis equals -1" keyword, we collapse across the last dimension, which is time.

#### Fitting a classifier with scikit-learn
* We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset (samples)
* We can combine each feature, and use it as input to a model
* If we have a label for each sample, we can use scikit-learn to create and a fit a classifier

* **In the case of classification, we also need a label for each timeseries that allows us to build a classifier.**

```
from sklearn.svm import LinearSVC
# Note that means are reshaped to work with sklearn
X = np.column_stack([means, maxs, stds])
y = labels.reshape([-1, 1])
model = LinearSVC()
model.fit(X, y)
```

* The **`.column_stack()`** function lets us stack 1-D arrays by turning them into the columns of a 2-D array
* Additionally, the labels array is 1-D, so we reshae it so that it is 2-D

```
from sklearn.metrics import accuracy_score
# Different input data
predictions = model.predict(X_test)

# Score our model with % correct
# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using an sklearn scorer
percent_score = accuracy_score(labels_test, predictions)
```

```
# Average across the audio files of each DataFrame
mean_normal = np.mean(normal, axis=1)
mean_abnormal = np.mean(abnormal, axis=1)

# Plot each average over time
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3), sharey=True)
ax1.plot(time, mean_normal)
ax1.set(title="Normal Data")
ax2.plot(time, mean_abnormal)
ax2.set(title="Abnormal Data")
plt.show()
```

```
from sklearn.svm import LinearSVC

# Initialize and fit the model
model = LinearSVC()
model.fit(X_train, y_train)

# Generate predictions and score them manually
predictions = model.predict(X_test)
print(sum(predictions == y_test.squeeze()) / len(y_test))
```

#### Improving the features we use for classification
* A few more features that are unique to timeseries data

#### The auditory envelope
    * Smooth the data to calculate the auditory envelope
    * Related to the total amount of audio energy present at each moment
    * The envelope throws away information about the fine-grained changes in the signal, focusing on the general shape of the audio waveform 
    * To do this, we'll need to calculate the audio's amplitude, then smooth it over time. 
    
#### Smoothing over time
* Instead of averaging over *all* time can do a *local* average
* This is called *smoothing* your timeseries
* It removes short-term noise, while retaining the general pattern

#### Calculating a rolling window statistic

```
# Audio is a pandas df
print(audio.shape)
# (n_times, n_audio_files)
```
* Output: `(5000, 20)`

```
# Smooth our data by taking the rolling mean in a window of 50 samples
window_size = 50
windowed = audio.rollin(window=window_size)
audio_smooth = windowed.mean()
```
* The `.rolling()` method returns **an object that can be used to calculate many different statistics within each window.**
* The `window` parameter tells us how many timepoints to include in each window
    * The larger the window, the smoother the result will be
    
#### Calculating the auditory envelope
* First *rectify* (calculate absolute value) your audio, then smooth it
* Called "rectification," because you ensure that all time points are positive.

```
audio_rectified = audio.apply(np.abs)
audio_envelope = audio_rectified.rolling(50).mean()
```
* Once we've calculated the acoustic envelope, we can create better features for our classifier:

#### Feature engineering the envelope

```
# Calculate several features of the envelope, one per second
envelope_mean = np.mean(audio_envelope, axis=0)
envelope_std = np.std(audio_envelope, axis=0)
envelope_max = np.max(audio_envelope, axis=0)

# Create our training data for a classifier
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = labels.reshape([-1, 1])
```

* `cross_val_score` automates the process of:
    * Splitting data into training/ validation sets
    * Fitting the model on training data
    * Scoring it on validation data
    * Repeating this process
    
```
from sklearn.model_selection import cross_val_score
model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)
```

#### Auditory features: The Tempogram
* We can summarize more complex temporal information with timeseries-specifc functions
* `librosa` is a great library for auditory and timeseries feature engineering
* Tempogram attempts to detect particular patterns over time, and summarize them statistically
* Here we'll caluclate the *tempogram*, which estimates the tempo of a sound over time.
* We can calculate summary statistics of tempo in the same way that we can for the envelope

#### Computing the Tempogram

```
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = lr.beat.tempo(audio, sr=sfreq, hop_length=2**6, aggregate= None
```
***

```
# Plot the raw data first
audio.plot(figsize=(10, 5))
plt.show()

# Rectify the audio signal
audio_rectified = audio.apply(np.abs)

# Plot the result
audio_rectified.plot(figsize=(10, 5))
plt.show()

# Smooth by applying a rolling mean
audio_rectified_smooth = audio_rectified.rolling(50).mean()

# Plot the result
audio_rectified_smooth.plot(figsize=(10, 5))
plt.show()

# Calculate stats
means = np.mean(audio_rectified_smooth, axis=0)
stds = np.std(audio_rectified_smooth, axis=0)
maxs = np.max(audio_rectified_smooth, axis=0)

# Create the X and y arrays
X = np.column_stack([means, stds, maxs])
y = labels.reshape([-1, 1])

# Fit the model and score on testing data
from sklearn.model_selection import cross_val_score
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))

# Calculate the tempo of the sounds
tempos = []
for col, i_audio in audio.items():
    tempos.append(lr.beat.tempo(i_audio.values, sr=sfreq, hop_length=2**6, aggregate=None))

# Convert the list to an array so you can manipulate it more easily
tempos = np.array(tempos)

# Calculate statistics of each tempo
tempos_mean = tempos.mean(axis=-1)
tempos_std = tempos.std(axis=-1)
tempos_max = tempos.max(axis=-1)

# Create the X and y arrays
X = np.column_stack([means, stds, maxs, tempos_mean, tempos_std, tempos_max])
y = labels.reshape([-1, 1])

# Fit the model and score on testing data
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))
```