# Machine Learning for Time Series Data in Python

Time series data is ubiquitous. Whether it be stock market fluctuations, sensor data recording climate change, or activity in the brain, any signal that changes over time can be described as a time series. Machine learning has emerged as a powerful method for leveraging complexity in data in order to generate predictions and insights into the problem one is trying to solve. This course is an intersection between these two worlds of machine learning and time series data, and covers feature engineering, spectograms, and other advanced techniques in order to classify heartbeat sounds and predict stock prices.

## I. Time Series and Machine Learning Primer
* Intro to the basics of machine learning, time series data, and the intersection between the two.

### Timeseries kinds and applications
* Put simply, a **timeseries** means data that changes over time.
* This can take many different forms; from atmospheric CO2 over time, to the waveform of spoken word, to climate sensor data, the fluctuation of a stock's value over the year, demographic information about a city
* **Timeseries data** consists of at least two things: 
    * One: an array of numbers that represents the data itself.
    * Two: another array that contains a timestamp for each datapoint.
* In other words, each datapoint should have a corresponding time point (whether that be a month, year, hour, or any combination of these). Note: multiple data points may have the same time point

#### Plotting a pandas timeseries

```
import matplotlib.pyploy as plt
fig, ax = plt.subplots(figsize=(12,6))
data.plot('date', 'close', ax=ax)
ax.set(title='AAPL daily closing price')
```
* **The amount of time that passes between timestamps defines the *period* of the timeseries.**
    * This often helps us infer what kind of timeseries we're dealing with.
* One crucial part of machine learning is that we can build a model of the world that formalizes our knowledge of the problem at hand. We can...
    * Predict the future
    * Automate this process
    * can be a critical component of an organization's decision making
    
* We treat timeseries data slightly differently than other types of datasets
    * Timeseries data always change over time, which turns out to be a useful pattern to utilize
    * Using timeseries-specific features lets us see a much richer representation of the raw data.
    
* This course will focus on a simple machine learning pipeline in the context of timeseries data.

* **A machine learning pipeline:**
    * Feature extraction: What kind of special features leverage a signal that changes over time?
    * Model fitting: What kinds of models are suitable for asking questions with timeseries data?
    * Prediction and Validation: How can we validate a model that uses timeseries data? What considerations must we make because it changes in time?

```
# Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10))
data.iloc[:1000].plot(y='data_values', ax=axs[0])
data2.iloc[:1000].plot(y='data_values', ax=axs[1])
plt.show()
```

```
# Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10))
data.iloc[:1000].plot(x='time', y='data_values', ax=axs[0])
data2.iloc[:1000].plot(x='time', y='data_values', ax=axs[1])
plt.show()
```

#### Machine Learning Basics
* Always begin by looking at your data:
    * `array.shape`
    * `array[:3]`
    * `dataframe.head()`
    * `dataframe.info()`
    * `dataframe.describe()`
* It is also crucial to visualize your data:

```
# Using matplotlib
fig, ax = plt.subplots()
ax.plot(...)

# Using pandas
fig, ax = plt.subplots()
df.plot(..., ax=ax)
```
* The proper visualization will depend on the kind of data you've got, though histograms and scatterplots are a good place to start.

* The most popular library for machine-learning in Python is `scikit-learn`.
    * Standardized API so that you can fit many different models with a similar code structure
    
#### Preparing data for scikit-learn
* `scikit-learn` expects a particular structure of data:
    * **`(samples, features)`**
* Make sure that your data is *at least two-dimensional.*
* Make sure the first dimension is *samples*.
* The first axis should correspond to sample number, and the second axis should correspond to feature number.

* If the axes are swapped: **transpose**
    * `array.transpose().shape`
    * `dataframe.T.shape`
    * will swap first and last axis
   
* Use **`.reshape()`** method:
    * lets you specify the shape you want

```
array.shape
array.reshape([-1, 1]).shape
```
   * `-1` will automatically fill that axis with remaining values
   
* **Investigating the model:**
    * It is often useful to investigate what kind of pattern the model has found.
    * Most models will store this information in attributes that are created after calling `.fit()`
        * `model.coef_`
        * `model.intercept_`
    * Call `.predict()` on the model to determine labels for unseen datapoints

```
# Generate predictions with the model using those inputs
predictions = model.predict(new_inputs.reshape(-1, 1))

# Visualize the inputs and predicted values
plt.scatter(new_inputs, predictions, color='r', s=3)
plt.xlabel('inputs')
plt.ylabel('predictions')
plt.show()
```

#### Combining timeseries data with machine learning
* Interaction between ML and timeseries data; introduce why they're worth thinking about in tandem.
#### The heartbeat acoustic data
* Why acoustic data? Audio is a very common kind of timeseries data
* Many recordings of heart sounds from different patients
* Some had normally-functioning hearts, others had abnormalities
* Data comes in the form of audio files + labels for each file
* Goal? Can we find the "abnormal" heart beats?

* Audio tends to have a very high sampling frequency (often above 20,000 samples per second)
* Audio data is often stored in `.wav` files
* list all of these files using the `glob` function:
    * lists files that match a certain pattern

```
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')
print(files)
```
* We'll use a library called **`librosa`** to read in the audio dataset:

```
import librosa as lr
# 'load' accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')
print(sfreq)
```
* Output: `2205`

* **`librosa`** has functions for extracting features, visualizations, and analysis for auditory data
* Import the data using the `load` function
* The data is stored as `audio` and the sampling frequency is stored in `sfreq`
* In this case, the sampling frequency is `2205`, meaning there are `2205` samples per second.

#### Inferring time from samples
* If we know the sampling rate of a timeseries, then we know the timestamp of each datapoint *relative to the first datapoint*.
* Note: this assumes the sampling rate is fixed and also that no data points are lost.
* Now, **we can create an array of timestamps for out data:**
    * Create an array of indices, one for each sample, and divide by the sample frequency.
    * To do so, two options:
        * 1. Genereate a range of indices from zero to the number of data points in your audio file; divide each index by the sampling frequency, and you have a timepoint for each data point.
        * 2. Calculate the final timepoint of your audio data using a similar method; Find the time stamp for the *N-1*th data point. Then use `linspace()` to interpolate from zero to that time.
    
```
indices = np.arange(0, len(audio))
time = indices / sfreq
```
***

```
final_time = (len(audio) - 1) / sfreq
time = np.linspace(0, final_time, sfreq)
```

* In either case, you should have an array of numbers of the same length as your audio data

#### The New York Stock Exchange dataset
* This dataset consists of company stock values for 10 years
* This dataset runs over a much longer timespan that the audio data, and has a sampling frequency on the order of one sample per day (compared with 2,205 samples per second with the audio data).
* Can we detect any patterns in historical records that allow us to predict the value of companies in the future.
* As we are predicting a continuous output value, this is a regression problem.

```
import librosa as lr
from glob import glob

# List all the wav files in the folder
audio_files = glob(data_dir + '/*.wav')

# Read in the first audio file, create the time array
audio, sfreq = lr.load(audio_files[0])
time = np.arange(0, len(audio)) / sfreq

# Plot audio over time
fig, ax = plt.subplots()
ax.plot(time, audio)
ax.set(xlabel='Time (s)', ylabel='Sound Amplitude')
plt.show()
```

```
# Read in the data
data = pd.read_csv('prices.csv', index_col=0)

# Convert the index of the DataFrame to datetime
data.index = pd.to_datetime(data.index)
print(data.head())

# Loop through each column, plot its values over time
fig, ax = plt.subplots()
for column in data:
    data[column].plot(ax=ax, label=column)
ax.legend()
plt.show()
```

### II. Timeseries as Inputs to a Model

#### Classification and Feature Engineering

* One of the most common categories for machine learning problems is classification.

* **Always visualize raw data before fitting models!**
* To plot raw audio, we need two things:
    * the raw audio waveform, usually in a one- or two- dimensional array.
    * we also need the timepoint of each sample
    
    
```
ixs = np.arange(audio.shape[-1])
time = ixs / sfreq
fig, ax = plt.subplots()
ax.plot(time, audio)

```
* We can calculate the time by dividing the index of each sample by the sampling frequency of the timeseries.
* This gives us the time for each sample relative to the beginning of the audio:

* **What features to use?**
* Using raw timeseries data is too noisy for classification
* We need to calculate features!
* An easy start: summarize your audio data
    * summary statistics removes the "time" dimension and gives us a more traditional classification dataset.
    * For each timeseries, we calculate several summary statistics (min, max, avg, etc)
    * We have expanding a single feature (raw audio amplitude) to several features $\Rightarrow$ min, max, avg of each sample.
    
* How to calculate multiple features for several timeseries 

```
# print: (n_files, time)
print(audio.shape)

means = np.mean(audio, axis=-1)
maxs = np.max(audio, axis = -1)
stds = np.std(audio, axis =-1)

# print: (n_files,)
print(means.shape)
```
* By using the "axis equals -1" keyword, we collapse across the last dimension, which is time.

#### Fitting a classifier with scikit-learn
* We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset (samples)
* We can combine each feature, and use it as input to a model
* If we have a label for each sample, we can use scikit-learn to create and a fit a classifier

* **In the case of classification, we also need a label for each timeseries that allows us to build a classifier.**

```
from sklearn.svm import LinearSVC
# Note that means are reshaped to work with sklearn
X = np.column_stack([means, maxs, stds])
y = labels.reshape([-1, 1])
model = LinearSVC()
model.fit(X, y)
```

* The **`.column_stack()`** function lets us stack 1-D arrays by turning them into the columns of a 2-D array
* Additionally, the labels array is 1-D, so we reshae it so that it is 2-D

```
from sklearn.metrics import accuracy_score
# Different input data
predictions = model.predict(X_test)

# Score our model with % correct
# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using an sklearn scorer
percent_score = accuracy_score(labels_test, predictions)
```

```
# Average across the audio files of each DataFrame
mean_normal = np.mean(normal, axis=1)
mean_abnormal = np.mean(abnormal, axis=1)

# Plot each average over time
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3), sharey=True)
ax1.plot(time, mean_normal)
ax1.set(title="Normal Data")
ax2.plot(time, mean_abnormal)
ax2.set(title="Abnormal Data")
plt.show()
```

```
from sklearn.svm import LinearSVC

# Initialize and fit the model
model = LinearSVC()
model.fit(X_train, y_train)

# Generate predictions and score them manually
predictions = model.predict(X_test)
print(sum(predictions == y_test.squeeze()) / len(y_test))
```

#### Improving the features we use for classification
* A few more features that are unique to timeseries data

#### The auditory envelope
    * Smooth the data to calculate the auditory envelope
    * Related to the total amount of audio energy present at each moment
    * The envelope throws away information about the fine-grained changes in the signal, focusing on the general shape of the audio waveform 
    * To do this, we'll need to calculate the audio's amplitude, then smooth it over time. 
    
#### Smoothing over time
* Instead of averaging over *all* time can do a *local* average
* This is called *smoothing* your timeseries
* It removes short-term noise, while retaining the general pattern

#### Calculating a rolling window statistic

```
# Audio is a pandas df
print(audio.shape)
# (n_times, n_audio_files)
```
* Output: `(5000, 20)`

```
# Smooth our data by taking the rolling mean in a window of 50 samples
window_size = 50
windowed = audio.rollin(window=window_size)
audio_smooth = windowed.mean()
```
* The `.rolling()` method returns **an object that can be used to calculate many different statistics within each window.**
* The `window` parameter tells us how many timepoints to include in each window
    * The larger the window, the smoother the result will be
    
#### Calculating the auditory envelope
* First *rectify* (calculate absolute value) your audio, then smooth it
* Called "rectification," because you ensure that all time points are positive.

```
audio_rectified = audio.apply(np.abs)
audio_envelope = audio_rectified.rolling(50).mean()
```
* Once we've calculated the acoustic envelope, we can create better features for our classifier:

#### Feature engineering the envelope

```
# Calculate several features of the envelope, one per second
envelope_mean = np.mean(audio_envelope, axis=0)
envelope_std = np.std(audio_envelope, axis=0)
envelope_max = np.max(audio_envelope, axis=0)

# Create our training data for a classifier
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = labels.reshape([-1, 1])
```

* `cross_val_score` automates the process of:
    * Splitting data into training/ validation sets
    * Fitting the model on training data
    * Scoring it on validation data
    * Repeating this process
    
```
from sklearn.model_selection import cross_val_score
model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)
```

#### Auditory features: The Tempogram
* We can summarize more complex temporal information with timeseries-specifc functions
* `librosa` is a great library for auditory and timeseries feature engineering
* Tempogram attempts to detect particular patterns over time, and summarize them statistically
* Here we'll caluclate the *tempogram*, which estimates the tempo of a sound over time.
* We can calculate summary statistics of tempo in the same way that we can for the envelope

#### Computing the Tempogram

```
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = lr.beat.tempo(audio, sr=sfreq, hop_length=2**6, aggregate= None
```
***

```
# Plot the raw data first
audio.plot(figsize=(10, 5))
plt.show()

# Rectify the audio signal
audio_rectified = audio.apply(np.abs)

# Plot the result
audio_rectified.plot(figsize=(10, 5))
plt.show()

# Smooth by applying a rolling mean
audio_rectified_smooth = audio_rectified.rolling(50).mean()

# Plot the result
audio_rectified_smooth.plot(figsize=(10, 5))
plt.show()

# Calculate stats
means = np.mean(audio_rectified_smooth, axis=0)
stds = np.std(audio_rectified_smooth, axis=0)
maxs = np.max(audio_rectified_smooth, axis=0)

# Create the X and y arrays
X = np.column_stack([means, stds, maxs])
y = labels.reshape([-1, 1])

# Fit the model and score on testing data
from sklearn.model_selection import cross_val_score
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))

# Calculate the tempo of the sounds
tempos = []
for col, i_audio in audio.items():
    tempos.append(lr.beat.tempo(i_audio.values, sr=sfreq, hop_length=2**6, aggregate=None))

# Convert the list to an array so you can manipulate it more easily
tempos = np.array(tempos)

# Calculate statistics of each tempo
tempos_mean = tempos.mean(axis=-1)
tempos_std = tempos.std(axis=-1)
tempos_max = tempos.max(axis=-1)

# Create the X and y arrays
X = np.column_stack([means, stds, maxs, tempos_mean, tempos_std, tempos_max])
y = labels.reshape([-1, 1])

# Fit the model and score on testing data
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))
```

#### The spectrogram- spectral changes to sound over time
* The spectrogram is a special case of timeseries features
* Spectrograms are common in timeseries analysis

#### Fourier transforms
* Timeseries data can be described as a combination of quickly-changing things and slowly-changing things
* At each moment in time, we can describe the relative presence of fast- and slow-moving components
* The simplest way to do this is called a *Fourier Transform*
* This converts a single timeseries into an array that describes the timeseries as a combination of oscillations.
* **Fourier Transform** or **FFT**

#### Spectrograms: combinations of windows Fourier Transforms
*  We can calculate multiple fourier transforms in a sliding window to see how it changes over time.
* A **spectrogram** is a collection of windowed Fourier transforms over time
* Similar to how a rolling mean was calculated:
    * 1. Choose a window size and shape
    * 2. At a timepoint, calculate the FFT for that window
    * 3. Slide the window over by one
    * 4. Aggregate the results
* Called a **Short-Time Fourier Transform (STFT)**
* Result: a description of the fourier transform as it changes through the time series
* To calculate the spectrogram, we square each value of the STFT
* Spectral content of the sound changes over time. With speech, we see interesting patterns that correspond to spoken words (for example, vowels or consonants)

#### Calculating the STFT
* We can calculate the STFT with `librosa`
* There are a several parameters we can tweak (such as window size)
* For our purposes, we'll convert into *decibels* which normalizes the average values of all frequencies
* We can visualize it with the `specshow()` function (results in the visualized spectrogram)

#### Calculating the STFT with code

```
# Import the functions we'll use for the STFT
from librosa.core import stft, amplitude_to_db
from librosa.display import specshow

# Calculate our STFT
HOP_LENGTH = 2**4
SIZE_WINDOW = 2**7
audio_spec = stft(audio, hop_length= HOP_LENGTH, n_fft= SIZE_WINDOW)

# Conver into decibels for visualization
spec_db = amplitude_to_db(audio_spec)

# Visualize
specshow(spec_db, sr = sfreq, x_axis = 'time', y_axis= 'hz', hop_length= HOP_LENGTH)

```

#### Spectral feature engineering
* Each timeseries has a different spectral pattern
* We can calculate these spectral patterns by analyzing the spectrogram.
* For example, **spectral bandwidth** and **spectral centroids** describe where most of the energy is at each moment in time.
* This means we can use patterns in the spectrogram to distinguish classes from one another

#### Calculating spectral features: spectral centroid and bandwidth 

```
# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = lr.feature.spectral_bandwidth(S=spec)[0]
centroids = lr.feature.spectral_centroids(S=spec)[0]

# Display these features on top of the spectrogram
ax = specshow(spec, x_axis = 'time', y_axis='hz', hop_length= HOP_LENGTH)
ax.plot(times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2,
                centroids + bandwidths / 2, alpha = 0.5)
```

#### Combining spectral and temporal features in a classifier

```
centroids_all = []
bandwidths_all = []
for spec in spectrograms:
    bandwidths = lr.feature.spectral_bandwidth(S=lr.db_to_amplitude(spec))
    centroids = lr.feature.spectral_centroids(S=lr.db_to_amplitude(spec))
    # Calculate the mean spectral bandwidth
    bandwidths_all.append(np.mean(bandwidths))
    # Calculate the mean spectral centroid
    centroids_all.append(np.mean(centroids))
    
# Create our X matrix
X = np.column_stack([means, stds, maxs, tempo_mean, tempo_max, tempo_std, bandwidths_al, centroids_all)]
```

* In general, as we include more complex features into our model, we'll improve model performance.

#### Spectrograms of heartbeat audio

```
# Import the stft function
from librosa.core import stft

# Prepare the STFT
HOP_LENGTH = 2**4
spec = stft(audio, hop_length=HOP_LENGTH, n_fft=2**7)

from librosa.core import amplitude_to_db
from librosa.display import specshow

# Convert into decibels
spec_db = amplitude_to_db(spec)

# Compare the raw audio to the spectrogram of the audio
fig, axs = plt.subplots(2, 1, figsize=(10, 10), sharex=True)
axs[0].plot(time, audio)
specshow(spec_db, sr=sfreq, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH)
plt.show()

import librosa as lr

# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = lr.feature.spectral_bandwidth(S=spec)[0]
centroids = lr.feature.spectral_centroid(S=spec)[0]

from librosa.core import amplitude_to_db
from librosa.display import specshow

# Convert spectrogram to decibels for visualization
spec_db = amplitude_to_db(spec)

# Display these features on top of the spectrogram
fig, ax = plt.subplots(figsize=(10, 5))
ax = specshow(spec_db, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH)
ax.plot(times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2, centroids + bandwidths / 2, alpha=.5)
ax.set(ylim=[None, 6000])
plt.show()

# Loop through each spectrogram
bandwidths = []
centroids = []

for spec in spectrograms:
    # Calculate the mean spectral bandwidth
    this_mean_bandwidth = np.mean(lr.feature.spectral_bandwidth(S=spec))
    # Calculate the mean spectral centroid
    this_mean_centroid = np.mean(lr.feature.spectral_centroid(S=spec))
    # Collect the values
    bandwidths.append(this_mean_bandwidth)  
    centroids.append(this_mean_centroid)
    
    # Create X and y arrays
X = np.column_stack([means, stds, maxs, tempo_mean, tempo_max, tempo_std, bandwidths, centroids])
y = labels.reshape([-1, 1])

# Fit the model and score on testing data
percent_score = cross_val_score(model, X, y, cv=5)
print(np.mean(percent_score))
```

## III. Predicting Time Series Data
If you want to predict patterns from data over time, there are special considerations to take in how you choose and construct your model. This chapter covers how to gain insights into the data before fitting your model, as well as best-practices in using predictive modeling for time series data.

#### Predicting data over time
* In this chapter, we shift our focus from classification to regresion.
* Regression has several features and caveats that are unique to timeseries data.
* **Regression** is similar to calculating correlation, with some key differences:
    * **Regression:** A process that results in a formal model of the data
    * **Correlation:** A statistic that describes the data. Less information that regression model.
    
* **Correlation between variables often changes over time
    * Timeseries often have patterns that change over time
    * Two timeseries that seem correlated at one moment may not remain so over time
* To visualize how the data changes over time, you can either plot the whole timeseries at once, or directly compare two segments of a time. 

#### Visualizing relationships between timeseries: 2 ways

```
fig, axs = plt.subplots(1,2)

# Make a line plot for each timeseries
axs[0].plot(x, c='k', lw=3, alpha=.2)
axs[0].plot(y)
axs[0].set(xlabel='time', title='X values = time')

# Encode time as color in a scatterplot
axs[1].scatter(x_long, y_long, c=np.arange(len(x_long)), cmap = 'viridis')
axs[1].set(xlabel='x', ylabel='y', title = 'Color = 'time')
```

#### Regression models with sklearn

```from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
model.predict(X)
```

#### Visualize predictions with sklearn
```
alphas = [.1, 1e2, 1e3]
ax.plot(y_test, color='k', alpha=.3, lw=3)
for ii, alpha in enumerate(alphas):
    y_predicted = Ridge(alpha = alpha).fit(X_train, y_train).predict(X_test)
    ax.plot(y_predicted, c=cmap(ii / len(alphas)))
ax.legend(['True values', 'Model 1', 'Model 2', 'Model 3'])
ax.set(xlabel='Time')
```
* Here (above), we visualize the predictions from sevreal different models fit on the same data
* Ridge regression has parameter called "alpha" that causes coefficients to be smoother and smaller, and is useful if you have noisy or correlated variables.

#### Scoring regression models
* Visualizing is useful, but not quantifiable.
* Two most common methods of scoring regression models:
    * Correlation (*r*, or $\rho$) -- simplest
    * Coefficient of Determination ($R^2$) -- most common
        * Value bounded on the top by 1, and can be infinitely low (since models can be infinitely bad)
        * Values closer to 1 mean the model does a better job of predicting outputs

```
from sklearn.metrics import r2_score
print(r2_score(y_predicted, y_test))
```

```
# Scatterplot with color relating to time
prices.plot.scatter('EBAY', 'YHOO', c=prices.index, 
                    cmap=plt.cm.viridis, colorbar=False)
plt.show()
```

```
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Use stock symbols to extract training data
X = all_prices[['EBAY', 'NVDA', 'YHOO']]
y = all_prices[['AAPL']]

# Fit and score the model with cross-validation
scores = cross_val_score(Ridge(), X, y, cv=3)
print(scores)
```

#### Cleaning and improving your data

#### Interpolation: uing time to fill in missing data
    * A common way to deal with missing data is to *interpolate* missing values
    * With timeseries data, you can use time to assist in interpolation
    * In this case, **interpolation** means using the *known* values on either side of a gap in the data to make assumptions about what's missing.
    
#### Interpolation in pandas

```
# Return a boolean that notes where missing values are
missing = prices.isna()

# Interpolate linearly within missing windows
prices_interp = prices.interpolate('linear')

# Plot the interpolated data in red and the data with missing values in black
ax= price_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)
```
* Above we used the `linear` method to interpolate, but other arguments (like `quadratic`, etc) will obviously result in different behavior.

#### Using a rolling window to transform data
* Another common use of rolling windows is to transform the data
* We've already done this once, in order to *smooth* the data
* However, we can also use this to do more complex transformations

#### Transforming data to standardize variance
* A common transformation to apply to data is to standardize its mean and variance over time. There are many ways to do this.
* Here, we'll show how to convert your dataset so that each point represents the *% change over a previous window*.
* This makes timepoints more comparable to one another if the absolute valus of data change a lot.
* This standardizes our data and reduces long-term drift.

In [1]:
def percent_change(values):
    """Calculates the % change between the last values and the mean of previous values"""
    # Separate the last value and all previous values into variables
    previous_values = values[:-1]
    last_value = values[-1]
    
    # Calculate the % difference between the last value and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) / np.mean(previous_values)
    return percent_change

* In the above function, we first separate out the final value of the input array 
* Then, we calculate the mean of all but the last data point
* Finally, we subtract the mean from the final datapoint, and divide by the mean 
* The result is the percent change for the final value

#### Applying this to the data

```
# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])

# Calculate % change and plot
ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)
```
* Apply the percent_change function to our data using the `.aggregate` method, passing out function as an input
* The data will now be roughly centered at zero, and periods of high and low change are easier to spot.
* Use this transformation to detect outliers

#### Finding outliers in your data
* Outliers are datapoints that are significantly statistically different from the dataset
* They can have negative effects on the predictive power of your model, biasing it away from its "true" value
* One solution is to *remove* or *replace* outliers with a more representative value
    * *Be very careful* about doing this - often it is difficult to determine what is a legitimately extreme value vs an abberation
* A common definition:
    * **Outliers:** Any datapoint that is more than three standard deviations away from the mean of the dataset.
    
#### Plotting a threshold on our data

```
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
    # Calculate the mean / standard deviation for the data
    this_mean = data.mean()
    this_std = data.std()
    
    # Plot the data, with a window that is 3 standard deviations around the mean
    data.plot(ax = ax)
    ax.axhline(this_mean + this_std * 3, ls= '--', c ='r')
    ax.axhline(this_mean - this_std * 3, ls = '--', c = 'r')
```

* We calculate the mean and standard deviation of each dataset, then plot outlier "thresholds" (three times the standard deviation from the mean) on the raw and transformed data)
* Note that the datapoints deemed an outlier depend on the transformation of the data.
* Next, replace outliers with the median of the remaining values:

```
# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation
std = prices_outlier_perc.std()

# Use the absolute value of each datapoint to make it easier to find the outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)

# Replace outliers with the median value
# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)
```
#### Visualize the results

```
fig, axs = plt.subplots(1, 2, figsize= (10,5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])
```

In [3]:
def interpolate_and_plot(prices, interpolation):
    """function we'll use to interpolate and plot"""
    # Create a boolean mask for missing values
    missing_values = prices.isna()

    # Interpolate the missing values
    prices_interp = prices.interpolate(interpolation)

    # Plot the results, highlighting the interpolated values in black
    fig, ax = plt.subplots(figsize=(10, 5))
    prices_interp.plot(color='k', alpha=.6, ax=ax, legend=False)
    
    # Now plot the interpolated values on top in red
    prices_interp[missing_values].plot(ax=ax, color='r', lw=3, legend=False)
    plt.show()

```
# Interpolate using the latest non-missing value
interpolation_type = 'zero'
interpolate_and_plot(prices, interpolation_type)
```

```
# Your custom function
def percent_change(series):
    # Collect all *but* the last value of this window, then the final value
    previous_values = series[:-1]
    last_value = series[-1]

    # Calculate the % difference between the last value and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) / np.mean(previous_values)
    return percent_change

# Apply your custom function and plot
prices_perc = prices.rolling(20).apply(percent_change)
prices_perc.loc["2014":"2015"].plot()
plt.show()
```

In [5]:
def replace_outliers(series):
    """Define function to replace outliers with median of a series"""
    # Calculate the absolute difference of each timepoint from the series mean
    absolute_differences_from_mean = np.abs(series - np.mean(series))
    
    # Calculate a mask for the differences that are > 3 standard deviations from zero
    this_mask = absolute_differences_from_mean > (np.std(series) * 3)
    
    # Replace these values with the median accross the data
    series[this_mask] = np.nanmedian(series)
    return series

```
# Apply your preprocessing function to the timeseries and plot the results
prices_perc = prices_perc.apply(replace_outliers)
prices_perc.loc["2014":"2015"].plot()
plt.show()
```