<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools, Week 13 Day 1</div>
<div style="text-align: right">Prof. Dino Konstantopoulos, 10 March 2023</div>

# Wall Street
Exploring the limits of statistical learning.

<br />
<center>
<img src="ipynb.images/broken-dreams.jpg" width=800 />
</center>

If you browse for machine learning with financial data, you will find plenty of articles that purport to predict the future price of any stock.

```
pip install alpha_vantage
```

Then visit their [web site](#https://www.alphavantage.co/) to get your API key. DO NOT USE MINE!

In [None]:
api_key = '73YAIZQIISA9D5X8'
from alpha_vantage.timeseries import TimeSeries

ticker = 'AAPL'
handler = TimeSeries(key=api_key, output_format="pandas")
aapl = handler.get_daily_adjusted(ticker)
aapl[0]

In [None]:
aapl[0].iloc[:, 1].values

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
hi = aapl[0].iloc[:, 1].values
plt.plot(hi[::-1])
plt.grid(True)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
hi = aapl[0].iloc[:, 1].values
plt.plot(aapl[0].iloc[:, 1])
plt.grid(True)

Another option was `investpy`, but that stopped working:

In [None]:
import investpy

df = investpy.get_stock_historical_data(stock='AAPL',
                                        country='United States',
                                        from_date='01/01/2019',
                                        to_date='31/01/2021')
print(df.head())

Another option was `pandas_datareader`, but that stopped working, too!

In [None]:
import pandas as pd
# the line below is the fix for is_list_like lub
#pd.core.common.is_list_like = pd.api.types.is_list_like

import pandas_datareader as web
import datetime
#start = datetime.datetime(2019, 1, 1)
#end = datetime.datetime(2021, 1, 30)
start = '2019-01-01'
end = '2021-01-31'

aapl = web.DataReader('AAPL', 'yahoo', start, end)
aapl.head(10)

Another option was `yfinance`, but that stopped working, too:

In [None]:
import yfinance as yf
data = yf.download("AAPL", start="2017-01-01", end="2017-04-30")

We can calculate the n-th (n=1 is the default) discrete difference along a given axis to find out about gains/losses, using numpy's [diff](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html) API:

In [None]:
import numpy as np
hi = aapl[0].iloc[:, 1].values
returns = np.diff(hi)
plt.plot(returns)
plt.grid(True)

In [None]:
dir(TimeSeries)

In [None]:
ticker = 'AAPL'
handler = TimeSeries(key=api_key, output_format="pandas")
aapl = handler.get_monthly_adjusted(ticker)
aapl[0]

Another option is googlefinance
```
pip install googlefinance
```
But that has stopped working, too :-(

### Time Series Exploratory Data Analysis (EDA)

A time series is simply a series of data points ordered in time. In a time series, time is often the independent variable and the goal is usually to make a **forecast** for the future. Preferrably from other columns, but sometimes from that very column itself, too.

However, there are other aspects that come into play when dealing with time series. Namely:
- Is it **stationary**? Stationarity is an important characteristic of time series. A time series is said to be stationary if its statistical properties do not change over time. 

In other words, it has constant **mean** and **variance**, and **covariance** is independent of time. We'll study what these concepts represent when we get into statistics. For now just think of them as point estimates of a distribution of numbers. 

Often, stock prices are ***not a stationary process***, since we might see a growing trend, or its volatility might increase over time (meaning that variance is changing). Ideally, we want to have a **stationary** time series for modelling. Of course, not all of them are stationary, but we can often make different transformations to make them stationary. 

[Dickey-Fuller](https://en.wikipedia.org/wiki/Dickey%E2%80%93Fuller_test) is the statistical test that we run to determine if a time series is stationary or not. If you coop for Wall Street or get a job as quant, you'll be running this test *all the time*.


- Is the target variable **autocorrelated**? Autocorrelation is the similarity between observations as a function of the
time lag between them


- Is there a **seasonality**? Seasonality refers to periodic fluctuations. For example, electricity consumption is high
during the day and low during night, or online sales increase during Christmas before slowing down again. seasonality can also be derived from an autocorrelation plot if it has a sinusoidal shape. Simply look at the period, and it gives the length of the season.

# 1. First time series model: A sketchy model

Let's see if we can predict one column (y) from the *same column*, but time-shifted by an amount.

In our NYC mdterological dataset, we also attempted to predict the future of an excel spreadsheet column, but from other columns instead.

We will train a Random Forest with a sample of our data, then test it with another sample to see how it performs. Let's import the `Scikit-learn` modules.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.concat([aapl[0].iloc[:, 1]], join='outer', axis=1, sort=False)
data

In [None]:
data.columns

Let's create our dependent variable, and see if we can predict stock price ***one month*** in advance:

In [None]:
data['predict'] = data['2. high'].shift(-30)
data

Let's remove the last 30 rows:

In [None]:
data2 = data[0:-30]
data2

Let's name our independent and dependent variables:

In [None]:
x = data2['2. high']
y = data2['predict']

In [None]:
x #a pandas Series (of 1 column)

In [None]:
y #a pandas Series (of 1 column)

Now, the Random forest API, and almost all APIs that have a `.fit()` training method, always expect the independent variables to be a pandas `DataFrame`, whereas the dependent variable can be a Pandas `Series`. In other words, the indepedent variables cannot be a pandas Series!

So what we need to do is to turn them into numpy arrays instead, because the `.fit()` API is polymorphic and works with both pandas DataFrames and numpy arrays, and then use `.reshape(-1,1)` to add a dummy extra dimension.

In [None]:
x.values # a numpy array

In [None]:
X = x.values.reshape(-1, 1)
X # a 2D numpy array that is really 1D but has a dummy extra dimension

Now we import `RandomForestRegressor` from the `sk-learn` package. 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split

Let's create training and test data:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Now we create a ML model, using `RandomForestRegressor` from the `sk-learn` package, and we train:

In [None]:
# Create a model 
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

Let's score our model at how good it is in predicting stick prices 30 days in the future:

In [None]:
rf_model.score(X_train, y_train)

In [None]:
rf_model.score(X_test, y_test)

Wow! 93%!! What the heck am I doing teaching at Northeastern?!! I should quit my job, move to New York City, and play the stock market!

Hmmm.. What do you think the random forest is actually doing?

- It has all the data minus 200 points here and there. So just fill in the blanks with the value right before it (e.g. If you have the data for Wednesday but no the Thursday thereafter, just use the value for Wednesday! Because *for all time series* that are somewhat continuous, the point *before* is a great approximation for the point *after*!)

<br />
<center>
<img src="ipynb.images/x-test-y-test.png" width=700 />
</center>

So the prediction is ***trivial***! What we really need to do is to ***mask the future*** for the model, to force it to look at only the **past**, and not the **future** in making predictions!

# 2. A Better Model





















Let's do that: Let's pick X\[0:300\] as our training data, and X\[300:500\] as our test data:

In [None]:
len(data2)

The **label** `y_xx` is the independent column `X_xx` shifted 30 days to the left:

In [None]:
X_train = X[:300]
X_test = X[300:]
y_train = y[:300]
y_test = y[300:]
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
type(y_train)

In [None]:
plt.plot(list(range(0,100)), list(range(0,100)))

In [None]:
len(X_train), len(X_test)

In [None]:
plt.plot(list(range(0,300)), X_train, )
plt.plot(list(range(300,300+194)), X_test)

We learn to predict the time-shifted blue in the graph above, and then we attempt to predict the time-shifted orange.

In the graph below, we learn to predict the blue from the green. Can we then leverage the model to predict the orange from the red?

In [None]:
plt.plot(list(range(0,300)), X_train)
plt.plot(list(range(300,300+194)), X_test)
plt.plot(list(range(0,300)), y_train)
plt.plot(list(range(300,300+194)), y_test)

>**Note**: `X_train` (indepedent variable) and `y_train` (dependent variable) are the blue and green curves above: What we give the algorithm to learn from: **How to predict a 30-day time-shift**.

>**Note**: `X_test` (independent variable) and `y_test`(dependent variable) are the orange and red curves: We want to be able to predict a 30-day shift to the left.

In [None]:
# Create a model 
rf_model2 = RandomForestRegressor()

# Train the model
rf_model2.fit(X_train, y_train)

# Score the model
rf_model2.score(X_test, y_test)

Uh oh.... Our model sucks now!

>**Conclusion**: ***We cannot predict the future unless it is somehow correlated with the past!***

Just out of curiosity, let's see what we predict:

In [None]:
y_pred = rf_model2.predict(X_test)

In [None]:
type(y_test)

In [None]:
type(y_pred)

If the types above are *different*. you need to convert either y_test or `y_pred` into the same type to be able to plot the data. Let's say we convert to a pandas dataframe, this is how to do it:
```(python)
y_pred_df = pd.DataFrame(y_pred, columns=['CLOSE'], index=y_test.index)
```

In [None]:
y_pred_df = pd.DataFrame(y_pred, columns=['CLOSE'], index=y_test.index)

In [None]:
type(y_pred_df)

Let's plot what we predict:

In [None]:
plt.plot(y_test)
plt.plot(y_pred_df)

Awful!

<br />
<center>
<img src="ipynb.images/garfield-oh-no.png" width=200 />
</center>

# 3. A more serious Model
Let's decompose our data into 4 sections: Past1/Future1, and Past2/Future2. 

We will train the model to predict Future1 from Past1 and see if it can use what it learned to predict Future2 from Past2.

No more simple 30-day time shift anymore! This should be a much more complex prediction job. However, arguably, if we learn to predict a piece of future from a piece of the past (say 30 days ago, but not just a simple time-shift anymore), then we should be able to apply this learning to predict 30 days into the future from a different piece of the past than the one we learned from.

In [None]:
len(X)

In [None]:
X_train2 = X[:120]
y_train2 = X[120:240]
X_test2 = X[240:360]
y_test2 = X[360:480]
X_train2.shape, y_train2.shape, X_test2.shape, y_test2.shape

In [None]:
plt.plot(list(range(0,120)), X_train2)
plt.plot(list(range(120,240)), y_train2)
plt.plot(list(range(240,360)), X_test2)
plt.plot(list(range(360,480)), y_test2)

Notice this model is ***different*** than our previous one!

In [None]:
# Create a model 
rf_model2 = RandomForestRegressor()

# Train the model
rf_model2.fit(X_train2, y_train2)

# Score the model
rf_model2.score(X_test2, y_test2)

Awful!

In [None]:
y_pred2 = rf_model2.predict(X_test2)

In [None]:
y_pred2_df = pd.DataFrame(y_pred2, columns=['CLOSE'], index=data2.index[360:480])

In [None]:
plt.plot(y_pred2_df)


You know what? When something does not work for me, I start again *from scratch*:

# 4. Linear regression Models
Autoregressive (AR) models follow linear regression. Let's now try to predict the future from the past:

In [None]:
data.columns

In [None]:
data.index

In [None]:
import numpy as np
data3 = data
data3.drop(['predict'], axis=1, inplace=True)
total_data = len(data3.index)

split = int(total_data * 0.90)
train = data3[0:split]
test = data3[split:]

plt.figure(figsize=(12,8))
plt.plot(train.index, train.High, label='Train')
plt.plot(test.index, test.High, label='Test')
plt.xticks(data3.index, data3.index, rotation='vertical')
plt.legend(loc='best')
plt.title("Train Test Split")
plt.show()

## 4.1 Baseline

The simplest methods says that the forecast for any period equals the last observed value. If the time series data contain seasonality, it’ll be better to take forecasts equal to the value from last season. This is often used for bench-marking purposes:

I am creating a **baseline**:

In [None]:
predictions_nv = test.copy()
# Copy the last observed Sales from training data
predictions_nv["Predictions"] = train.tail(1).iloc[0]["High"]
print (predictions_nv)

This is called a **baseline**:

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(test["High"], predictions_nv["Predictions"]))
print("Naive Mean Square Error (RMSE): %.3f" % rmse)

In [None]:
plt.figure(figsize=(12,8))
plt.plot(train.index, train.High, label='Train')
plt.plot(test.index, test.High, label='Test')
plt.plot(predictions_nv.index, predictions_nv.Predictions,
label='Prediction')
plt.xticks(data3.index, data3.index, rotation='vertical')
plt.legend(loc='best')
plt.title("Predictions by Naive model")
plt.show()

## 4.2 Auto-ARIMA
The **Autoregressive** AR(p) model follows linear regression. It makes one prediction at a time
and feeds the output back into the model. Here, p specifies the order of the model e.g,
AR(1) i.e. first-order Autoregression model. 

The output variable depends linearly on its previous values (called lags or orders) at previous time steps i.e. regression with selfvalues. The lag length must be specified when creating the model.

The **Moving Average** process is an approach to model univariate time series. This is used to
remove any seasonal trend in time series to allow us to see any trend in data. This is
represented as MA(q) where q specifies the order of the model e.g., MA(2) i.e. secondorder Moving Average model.

The **ARMA** process combines both AutoRegression (AR) and Moving Average (MA) models.
This is usually referred to as the ARMA(p,q) model where p is the order of the AR part and
q is the order of the MA part.

The **ARIMA** model has three components:
- (i) Auto regressive component, AR(p) i.e. linear regression on its previous values or
lags(p).

- (ii) Integrated component (I) indicates that the data have been replaced with the
difference between the current observation and the previous time step.

- (iii) Moving average, MA(q) i.e. consider moving average with order of q.
This model is represented as ARIMA(p, d, q) where p, d and q specifies the order of the
AR(p), I(d) and MA(q) models respectively.

The **SARIMA** model extends the ARIMA model with the ability to perform the same AR, I, and MA
modeling at the seasonal level. Seasonal ARIMA models are denoted as ARIMA(p,d,q)(P,D,Q)m, where m refers to the
number of periods in each season and P, D, Q (uppercase) refer to the autoregressive,
differencing, and moving average terms for the seasonal part of the ARIMA model

The **Auto-ARIMA** model automatically discover the optimal order for an ARIMA model. The auto-ARIMA process seeks
to identify the most optimal parameters for an ARIMA model, settling on a single fitted ARIMA
model. 

`Pmdarima` wraps `statsmodels` under the hood.
```
pip install pmdarima
```

In [None]:
from pmdarima.arima import auto_arima
model_aarima = auto_arima (y = train["High"], seasonal=False, stepwise=True)
# seasonal : default=True, whether to fit a seasonal ARIMA.
# stepwise : default=True, the auto_arima function has two modes: stepwise & parallelized (slower)

predictions_aarima = model_aarima.predict(n_periods=test.index.size, X=None, return_conf_int=False, alpha=0.05)
predictions_aarimaDf = pd.DataFrame({'Predictions': predictions_aarima})
result_aarima = pd.concat([test.reset_index(drop=True), predictions_aarimaDf], axis=1)
print (result_aarima)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_aarima = sqrt(mean_squared_error(test["High"], predictions_aarima))
print("Auto ARIMA - Root Mean Square Error (RMSE): %.3f" %rmse_aarima)

In [None]:
len(data3.index[-53:])

In [None]:
plt.figure(figsize=(12,8))
plt.plot(train.index, train.High, label='Train')
plt.plot(test.index, test.High, label='Test')
plt.plot(data3.index[-53:], predictions_aarimaDf.Predictions, label='Prediction')
plt.xticks(data3.index, data3.index, rotation='vertical')
plt.legend(loc='best')
plt.title("Predictions by Auto-ARIMA model")
plt.show()

# 5. Facebook's Prophet
```
pip install fbprophet
```

The input to Prophet is always a dataframe with two columns: ds (datestamp, either YYYY-MM-DD or YYYY-MM-DD HH:MM:SS formats) and y (numeric, represents the measurement we wish to forecast).

In [None]:
train

In [None]:
from fbprophet import Prophet

# instantiate the model and set parameters
model_fb = Prophet(interval_width = 0.95, growth = "linear", daily_seasonality = False, weekly_seasonality = False, \
yearly_seasonality = False, seasonality_mode = "multiplicative")

train_fb = train.copy()
train_fb["ds"] = train_fb.index
train_fb["y"]= train_fb["High"]
train_fb.drop(['High'], axis=1, inplace=True)
train_fb.columns

In [None]:
# fit the model to historical data
model_fb.fit(train_fb)

In [None]:
future_pd = model_fb.make_future_dataframe(periods = 6, freq = 'm', include_history=True)

# predict over the dataset
predictions_fb = model_fb.predict(future_pd)

In [None]:
predict_fig = model_fb.plot(predictions_fb, xlabel='Date', ylabel='High')

# 6. Non-neural ML Models
How about we use a Support Vector Machine (SVM), for example? We haven't seen this yet in class, but when I google, it looks like a lot of people are using SVMs for time series analysis...

In [None]:
from sklearn.svm import SVR
regressor = SVR(kernel='rbf')

In [None]:
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [None]:
type(y_test)

In [None]:
type(y_pred)

In [None]:
y_pred_df = pd.DataFrame(y_pred, columns=['High'], index=y_test.index)

In [None]:
plt.plot(y_test)
plt.plot(y_pred_df)

Better in the beginning, but then awful, too!
<br />
<center>
<img src="ipynb.images/garfield-oh-no.png" width=200 />
</center>

# 7. Neural ML Models
Ok, how about a Neural Network? I hear that neural networks are pretty cool for time series. I also hear there is a professor that is going to teach just that next semester!

For this to work, you will need tensorflow. If your `pip install`s do not work, you can wait for when we do this in class, or you can ask your TAs :-)

```(python)
pip install tensorflow
```

>**Hint**: I'm using an older version of tensorflow (the 1.x version). Most of you will download the 2.x version, so you will probably need to replace
```(python)
from keras.models import Sequential
```
with
```(python)
from tensorflow.keras.models import Sequential
```
below.

We first try our first model: The 30-day time-shift. We have 4 datasets: `X_train`, `y_train`, `X_test`, `y_test`.

In [None]:
import tensorflow as tf
tf.__version__

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import keras.backend as K

You may have to replace `keras.optimizers` with `tensorflow.kera.optimizers`.... Talk to your TAs!

## 7.1 Simple 1-Layer Neural Model
We'll use one hidden layer with one Neuron, and 100 timesteps of training, in batches of 16 timesteps:

In [None]:
#K.clear_session()
model = Sequential()
model.add(Dense(1, input_shape=(X_test.shape[1],), activation='tanh', kernel_initializer='lecun_uniform'))
model.compile(optimizer=Adam(lr=0.001), loss='mean_squared_error')
model.fit(X_train, y_train, batch_size=16, epochs=100, verbose=1)

In [None]:
from sklearn.metrics import r2_score

def adj_r2_score(r2, n, k):
    return 1-((1-r2)*((n-1)/(n-k-1)))

r2_test = r2_score(y_test, y_pred)
print("R-squared is: %f"%r2_test)

Hmmm... Does not look too good, does it?

Let's see if we can predict the test dataset `X_test`:

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred_df = pd.DataFrame(y_pred, columns=['High'], index=y_test.index)

In [None]:
plt.plot(y_test)
plt.plot(y_pred_df)
print('R-Squared: %f'%(r2_score(y_test, y_pred)))

Nope, bad! Out very simple neural model cannot even learn the 30-day time shift to the right: To predict `y_test`, just shift `X-test` 30 days to the right, because essentially that's what we asked the simple neural network to learn: `y_train` was just a 30-day-shifted-to-the-right version of `X_train`!

## 7.2 Deeper 2-Layer neural Model
Hmmm... I hear that ML is all about ***deep*** networks, so how about we increase the *intelligence* of our neural network with **2 Hidden Layers** and **50 neurons each**, and a [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) activation function as well?


<br />
<center>
<img src="ipynb.images/champagne-glass-cake.jpg" width=300 />
</center>

### 7.2.1 Shift time-series to the right by 30 days dataset
Let's see if we can learn the simple 30-days time shift with our deeper 2-Layer model:

In [None]:
#K.clear_session()
model2 = Sequential()
model2.add(Dense(50, input_shape=(X_test.shape[1],), activation='relu', kernel_initializer='lecun_uniform'))
model2.add(Dense(50, input_shape=(X_test.shape[1],), activation='relu'))
model2.add(Dense(1))
model2.compile(optimizer=Adam(lr=0.001), loss='mean_squared_error')
model2.fit(X_train, y_train, batch_size=16, epochs=50, verbose=1)

In [None]:
y_pred = model2.predict(X_test)

In [None]:
y_pred_df = pd.DataFrame(y_pred, columns=['High'], index=y_test.index)

In [None]:
plt.plot(y_test)
plt.plot(y_pred_df)
print('R-Squared: %f'%(r2_score(y_test, y_pred)))

Oooooh.. wow! The deeper 2-Layer Neural Network appears to learn how to do the 30-day time-shift correctly :-)

>***Hey wait prof, you said we cannot predict the future unless it is somehow correlated with the past!***

First of all, a guess of around 50% is not that great.

But, yay, pretty smart little deep neural network: Training with this data:

In [None]:
plt.plot(list(range(0,300)), X_train)
plt.plot(list(range(0,300)), y_train)

It learned to do this prediction:

In [None]:
plt.plot(list(range(300,300+194)), X_test)
plt.plot(list(range(300,300+194)), y_test)

So, what did it learn?
- To predict 30 days in advance, shift the curve 30 ticks to the right!

<br />
<center>
<img src="ipynb.images/x-test-x-test-30+.png" width=400 />
</center>

So yes, in "*predicting*" the time series, all your model really did is shift the training data 30 clicks to the right.

<br />
<center>
<img src="ipynb.images/duh.gif" width=200 />
</center>

### 7.2.2 2nd Model with more complex prediction (no 30-day time shift)
Now let's try our 2nd model on our deep network: We learn to predict the orange part from the blue part, and then test our learning by attempting to predict the red part from the green part.

In [None]:
plt.plot(list(range(0,120)), X_train2)
plt.plot(list(range(120,240)), y_train2)
plt.plot(list(range(240,360)), X_test2)
plt.plot(list(range(360,480)), y_test2)

In [None]:
X_test2.shape

In [None]:
#K.clear_session()
model3 = Sequential()
model3.add(Dense(50, input_shape=(X_test2.shape[1],), activation='relu', kernel_initializer='lecun_uniform'))
model3.add(Dense(50, input_shape=(X_test2.shape[1],), activation='relu'))
model3.add(Dense(1))
model3.compile(optimizer=Adam(lr=0.001), loss='mean_squared_error')
model3.fit(X_train2, y_train2, batch_size=16, epochs=50, verbose=1)

Evaluating on our test data:

In [None]:
y_pred2 = model2.predict(X_test2)

In [None]:
y_pred2_df = pd.DataFrame(y_pred2, columns=['CLOSE'], index=data2.index[360:480])
y_test2_df = pd.DataFrame(y_test2, columns=['CLOSE'], index=data2.index[360:480])

In [None]:
plt.plot(y_test2_df)
plt.plot(y_pred2_df)
print('R-Squared: %f'%(r2_score(data2["High"].values[360:480], y_pred2)))

Started well, but then missed the peaks! Even though our deep net did a good job with our first model, our second model was too hard for it to predict the 30-days time shift!

Even though training succeeded on the training data...

Wait, did it?

In [None]:
y_pred2 = model2.predict(X_train2)

In [None]:
y_pred2_df = pd.DataFrame(y_pred2, columns=['CLOSE'], index=data2.index[120:240])
y_train2_df = pd.DataFrame(y_train2, columns=['CLOSE'], index=data2.index[120:240])

In [None]:
plt.plot(y_train2_df)
plt.plot(y_pred2_df)
print('R-Squared: %f'%(r2_score(y_train2, y_pred2)))

Oh no! It did not really. And you can see why by looking at the loss values in our training loop reach a plateau and never improved further.

# 8. About Time Series prediction

First of all, a really ***bad*** way to do Machine Learning on financial data is to look for how many days in the past I can predict stock price... ***from the stock price itself***.

You may not predict a dependent variable from the ***same*** variable that you will call ***independent*** in the days prior, unless there is ***autocorrelation*** or you ***first process those historical values in some way***.

<center>
<img src="ipynb.images/oopsie.png" width=300 />
</center>

This is ***bad ML***, and you would be surprised at how many people do that on the Web. 

Why is it flawed? Because it's missing statistical **reasoning**.

Data science is about ***going back in time*** to find the process that yields the data, right? 

The process that yields the price for stock data is ***not the stock price itself***!

The price ***rarely*** decides what the price is ***going*** to be (in the future)! 

It's ***market forces*** that shape the price! So ML should attempt to uncover those market forces! And it probably could, if we give it the *right columns*. But we didn't! We just used the price in the past to attempt to predict the price in the future! 

That is like learning how to land a plane by looking at runway lights (which light up different colors depending on how we're doing, coming in too high or too low), instead of learning how to fly by understanding the dynamics of flight. The too-low lights *don't tell us why we're too low* (*but if you learn to fly just by using a flight simulator, you might just look at runway lights to land the plane*)!

Poor little Machine Learning algorithms will however do its *best* and attempt to predict future price from past price. 

Obviously, the ***best*** predictor of $y$ at time $t$ is $y$ at time $t-1$! So, **day - 1** will always work ***great*** as a predictor for **day 0**, **day - 2** also, albeit less so, and so on until it stops working as we increase the number of days we go in the past. But the best trick is to just learn to shift the data x clicks to the right, and our ***deep*** model learned to do just that.

Neural networks, arguably the *smartest* of all ML algorithms as we increase their intelligence (add more layers and more neurons), will figure out what constitues the *best strategy*, and ***do just that***! So, they ***cheat***!

<br />
<center>
<img src="ipynb.images/oh-no.png" width=200 />
    Oh no!!
</center>

Here is an example of statistical learning gone bad: A Google AI learned how to run, but nobody told it that humans like to conserve energy while running, because energy is in limited quantities for humans and animals. So the AI learned to balance itself with enormous expenditures of energy, leading to physically unrealistic solutions for balance (look at the arms waving wildly): 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('gn4nRCC9TwQ')

So we learned a few things:

>You need to find your *independent variables* ***first***, and ***then*** see if they are good predictors of your *dependent variable* (target)! 

>Machine Learning is ***not*** a replacement for Science! You still have to do the Science to figure out **cause** and **effect**, and only ***then*** use statistics (and ML) to help with prediction. 

YOU NEED TO ***LEVERAGE HISTORICAL INFORMATION ON THE STOCK TO MAKE AN EDUCATED GUESS ON THE FUTURE BEHAVIOR OF THE STOCK***! 

Otherwise, you will earn a little money when you guess right, loose a little money when you guess wrong, and at the end you will have guessed very close, but you will make no profit and loose ***lots of money*** from the overhead associated with every stock transaction.

>**Technical Analysis**: Point values of stock prices cannot be used to estimate future stock price. However, integral values (aggregated values ***over*** a specified time period, like **averages**) ***may***. That is the impetus behind what is called **financial [technical analysis](https://en.wikipedia.org/wiki/Technical_analysis)** (TA). [This](https://school.stockcharts.com/doku.php?id=overview:technical_analysis) is also a great reference.

>**Efficient-market hypothesis**: You should know however that the **[efficient-market hypothesis](https://en.wikipedia.org/wiki/Efficient-market_hypothesis)** (EMH) contradicts the basic tenets of technical analysis by stating that [past prices cannot be used to profitably predict future prices]() in any shape or form.

Here's the proof that predicting for the next day by using the value from the day before can be a good strategy ever: We evaluate the R-Squared distance between the current point, and the point x days before:

In [None]:
#print('R-Squared: %f'%(r2_score(y_test[30:130], y_test[0:100])))
print('R-Squared-3: %f'%(r2_score(y_test[:-3], y_test[3:])))

In [None]:
print('R-Squared-3: %f'%(r2_score(y_test[:-10], y_test[10:])))

In [None]:
print('R-Squared-3: %f'%(r2_score(y_test[:-20], y_test[20:])))

In [None]:
print('R-Squared-3: %f'%(r2_score(y_test[:-30], y_test[30:])))

These are **baselines** you need to keep in mind while building data models.

Before you start leveraging black-box deep learning models to solve prediction problems, you need to try out a simple **common-sense approach**. It will serve as a sanity check and will establish a **baseline** that you will have to beat in order to demonstrate the usefulness of a machine learning model. 

>**Example**: A classic example is that of *unbalanced* classification tasks (such as fraud data), where some classes can be much more common than others. If your dataset contains 90% of instances of class A and 10% of instances of class B, then a common sense approach to the classification task would be to always predict "A" when presented with a new sample. Such a classifier would be 90% accurate overall, and any learning-based approach should therefore beat this 90% score in order to demonstrate usefulness. Sometimes such elementary baseline can prove surprisingly hard to beat.

So in the case of financial time series predictions, one needs to ask oneself, how many days before does it *make sense* to attempt a prediction for (that depends on the investment portfolio)? What if I take the value of the last possible day as my prediction, how well will I do? Then, try to beat that baseline. Now, ***that*** is *science*. Everything else is voodoo. And there's *tons* of voodoo on the Web.

So, you need to ask yourself: *How many times a month can I afford to make a trade in order to make more money than the commission fees? Is it every 3 days? In that case I need to beat the following common-sense baseline, and if I build a machine model, then I need to **hide** (remove as independent features) the last 3 days, otherwise a clever machine algorithm will pick up the pattern and predict based on those values*!

In [None]:
print('R-Squared-3: %f'%(r2_score(y_test[:-3], y_test[3:])))

>**Note**: Picking the *dependent features* (what you're trying to predict) from the *independent features* (what you're leveraging as *knowledge*) is the most important problem in machine learning. It's the basis of the scientific approach of **cause and effect**.

## Data needs to be stationary
For another good explanation of why predicting a random variable from itself is ***junk science***, and why time series **stationarity** is so critical, read [this](https://towardsdatascience.com/how-not-to-use-machine-learning-for-time-series-forecasting-avoiding-the-pitfalls-19f9d7adf424) good article.

>**LESSON LEARNED**: The only prediction that is relevant is the ***difference between consecutive days***. This is also known as introducing **stationarity** in the dataset: Statistics such as the **mean**, the **standard deviation**, **autocorrelation** need to ***remain constant over time***. 

<br />
<center>
<img src="ipynb.images/will-smith-aladin-2.jpg" width=400 />
    You need to learn statistics..
</center>

So I went looking for **independent variables**. Since everyone seems to download **prices** and **volumes**, I surmised these are used to ***compute*** independent variables to make decisions with, so I went looking for financial indicators that can be computed from these basic metrics. In other words, I am making the hypothesis (unproven) that ***financial technical analysis (TA) works***. Scikit-Learn should be able to tell me if it does or not, in the end.

Specifically, I would like to know if financial indicators provided by **technical analysis** (TA) can predict up and down trends of the stock. So I plan on leveraging the library i located first (`investpy` and `trendnet`) to give me the **up** and **down** intervals for a stock, and to see if those can be predicted from financial indicators.

This is the ***scientific procedure***. It may not work and my financial knowledge is limited, but the *procedure* is correct, and my results will either prove or disprove my hypothesis.

So now that you know why this kind of learning *does not work*, let's prove it by making our time series **stationary**, and rerunning our last neural network experiment that ***appeared to work so well***..

In [None]:
X_train

We calculate the n-th (n=1 is the default) discrete difference along a given axis to evaluate gains/losses (today's price minus yesterday's), using numpy's [diff](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html) API. This makes the series **stationary**:

In [None]:
X_train.shape

This is how we remove the extra dimension:

In [None]:
np.squeeze(X_train).shape

In [None]:
import numpy as np
X_train3 = np.diff(np.squeeze(X_train))
y_train3 = np.diff(np.squeeze(y_train))
X_test3 = np.diff(np.squeeze(X_test))
y_test3 = np.diff(np.squeeze(y_test))

In [None]:
X_train3

In [None]:
plt.plot(X_train3)
plt.title('Training data made stationary')
plt.ylabel('High price ($)')
plt.xlabel('Trading day index')
plt.grid(True)
plt.show()

Great, now we have a stationary series. Let's predict! 

Notice how I need to **reshape** my data to accomodate the requirements of `keras`'s `fit()` API. Data reshaping using the `reshape` and `squeeze` APIs is something you need to learn how to do:

In [None]:
X_train.shape

In [None]:
X_train3.shape

In [None]:
X_train3.shape[0]

In [None]:
X_train4 = X_train3.reshape((X_train3.shape[0], 1))
X_train4.shape

In [None]:
y_train4 = y_train3.reshape((y_train3.shape[0], 1))
X_test4 = X_test3.reshape((X_test3.shape[0], 1))
y_test4 = y_test3.reshape((y_test3.shape[0], 1))

Let's rerun our last neural network:

In [None]:
#K.clear_session()
model4 = Sequential()
model4.add(Dense(50, input_shape=(X_test4.shape[1],), activation='relu', kernel_initializer='lecun_uniform'))
model4.add(Dense(50, input_shape=(X_test4.shape[1],), activation='relu'))
model4.add(Dense(1))
model4.compile(optimizer=Adam(lr=0.001), loss='mean_squared_error')
model4.fit(X_train4, y_train4, batch_size=16, epochs=50, verbose=1)

Let's predict:

In [None]:
y_pred4 = model.predict(X_test4)

In [None]:
type(y_test4)

In [None]:
type(y_pred4)

In [None]:
plt.plot(y_test4)
plt.plot(y_pred4)
print('R-Squared: %f'%(r2_score(y_test4, y_pred4)))

***Horrible***, even with our previously successful model!

Prediction of time series from the time series *itself* is ***not possible*** (unless the time series is **autocorrelated**: future behavior is related to past behavior)! If it where, ***I would not be teaching you today***! I'd be a millionaire driving a Maserati and lounging in the Marino gym checking out the crowd ;-)

So what exactly do [quants](https://en.wikipedia.org/wiki/Quantitative_analysis_(finance)) do on *Wall Street*?

<br />
<center>
<img src="ipynb.images/quants.jpg" width=400 />
A quant making a million dollar salary
</center>

# 9. Financial Indicators

[This](https://www.investopedia.com/articles/active-trading/041814/four-most-commonlyused-indicators-trend-trading.asp) is where I learned about financial indicators. 

Here is what I learned:

##  Price Action
[Price action](https://www.investopedia.com/terms/p/price-action.asp) is the movement of a security's price plotted over time. Price action forms the basis for all technical analysis of a stock, commodity or other asset chart. Many short-term traders rely exclusively on price action and the formations and trends extrapolated from it to make trading decisions. Technical analysis as a practice is a derivative of price action since it uses past prices in calculations that can then be used to inform trading decisions.

### What Does Price Action Tell You?
Price action can be seen and interpreted using charts that plot prices over time. Traders use different chart compositions to improve their ability to spot and interpret trends, breakouts and reversals. Many traders use candlestick charts since they help better visualize price movements by displaying the open, high, low, and close values in the context of up or down sessions.

>**LESSON**: This tells me that it is important to understand price actions in the context of the DOW Jones industrial average. So I plan to add columns that reflect how the DOW is trending.

## What is a Reversal?
A [reversal](https://www.investopedia.com/terms/r/reversal.asp) is a change in the price direction of an asset. A reversal can occur to the upside or downside. Following an uptrend, a reversal would be to the downside. Following a downtrend, a reversal would be to the upside. Reversals are based on overall price direction and are not typically based on one or two periods/bars on a chart.

## Moving Averages
We already studied this.

Moving averages ***smooth*** price data by creating a single flowing line. The line represents the average price over a period of time. Which moving average the trader decides to use is determined by the time frame in which he or she trades. For investors and long-term trend followers, the 200-day, 100-day, and 50-day simple moving average are popular choices.

There are several ways to utilize the moving average. The first is to look at the angle of the moving average. If it is mostly moving horizontally for an extended amount of time, then the price isn't trending, it is **ranging**. If the moving average line is angled up, an **uptrend** is underway. 

Moving averages don't predict though. They simply show what the price is doing, on average, over a period of time.

When the price crosses above a moving average, it can also be used as a **buy signal**, and when the price crosses below a moving average, it can be used as a **sell signal**. But since the price is more volatile than the moving average, this method is prone to more false signals, as the chart above shows.

These buy and sell signals are indicators of **latent** variables, hidden yet to be uncovered variables that explain why the stock prize moves up or down.

Many traders will watch for a short-term moving average to cross above a longer-term moving average and use this to signal increasing upward momentum. This bullish crossover suggests that the price has recently been rising at a faster rate than it has in the past, so it is a common technical buy sign. 

Conversely, a short-term moving average crossing below a longer-term average is used to illustrate that the asset's price has been moving downward at a faster rate and that it may be a good time to sell.

Crossovers are another way to utilize moving averages. By plotting a 200-day and 50-day moving average on your chart, a buy signal occurs when the 50-day crosses above the 200-day. A sell signal occurs when the 50-day drops below the 200-day.

That is the basic baseline theory behind how investors buy or sell stock.

## MACD (Moving Average Convergence Divergence)
The [MACD](https://www.investopedia.com/trading/macd/) was designed to profit by analyzing the difference between the two exponential moving averages (EMAs). Specifically, the value for the long-term moving average is subtracted from the short-term average, and the result is plotted onto a chart. The periods used to calculate the MACD can be easily customized to fit any strategy, but traders will commonly rely on the default settings of 12- and 26-day periods.

A positive MACD value, created when the short-term average is above the longer-term average, is used to signal increasing upward momentum. This value can also be used to suggest that traders may want to refrain from taking short positions until a signal suggests it is appropriate. On the other hand, falling negative MACD values suggest that the downtrend is getting stronger, and that it may not be the best time to buy.

One basic MACD strategy is to look at which side of zero the MACD lines are on in the histogram below the chart. Above zero for a sustained period of time, and the trend is likely up; below zero for a sustained period of time, and the trend is likely down. Potential buy signals occur when the MACD moves above zero, and potential sell signals when it crosses below zero.

Signal line crossovers provide additional buy and sell signals. A MACD has two lines – a fast line and a slow line. A buy signal occurs when the fast line crosses through and above the slow line. A sell signal occurs when the fast line crosses through and below the slow line.

The MACD indicator is one of the most popular tools in technical analysis because it gives traders the ability to quickly and easily identify the short-term trend direction. The clear transaction signals help minimize the subjectivity involved in trading, and the crosses over the signal line make it easy for traders to ensure that they are trading in the direction of momentum. Very few indicators in technical analysis have proved to be more reliable than the MACD, and this relatively simple indicator can quickly be incorporated into any short-term trading strategy.

## Volume As An Indicator
Volume is an important indicator in technical analysis as it is used to measure the relative worth of a market move. If the markets make a strong price movement, then the strength of that movement depends on the volume for that period. The higher the volume during the price move, the more significant the move.

**Fundamental analysis** is based on company performance and is used to determine which stock to buy. 

**Technical analysis** is based on stock price and is used to determine when to buy. Technical analysts are primarily looking for entry and exit price points, and volume levels provide clues about where the best entry and exit points are located.

Volume is one of the most important measures of strength for traders and technical analysts. Put simply, volume refers to the number of trades completed. For any trade to occur, the market needs to produce a buyer and a seller. A transaction occurs when buyers and sellers meet and is referred to as the **market price**. From an auction perspective, when buyers and sellers become particularly active at a certain price, it means there is a lot of volume.

If traders want to confirm a reversal on a level of support, or **floor**, they look for high buying volume. Conversely, if traders are looking to confirm a break in the level of support, they look for low volume from buyers.

If traders want to confirm a reversal on a level of resistance, or **ceiling**, they look for high selling volume. Conversely, if traders are looking to confirm a break in the level of resistance, they look for high volume from buyers.

### Calculating OBV
[On-balance volume](https://www.investopedia.com/terms/o/onbalancevolume.asp) provides a running total of an asset's trading volume and indicates whether this volume is flowing in or out of a given security or currency pair. The OBV is a cumulative total of volume (positive and negative). There are three rules implemented when calculating the OBV. They are:

- If today's closing price is higher than yesterday's closing price, then: Current OBV = Previous OBV + today's volume


- If today's closing price is lower than yesterday's closing price, then: Current OBV = Previous OBV - today's volume


- If today's closing price equals yesterday's closing price, then: Current OBV = Previous OBV

The theory behind OBV is based on the distinction between **smart money** – namely, institutional investors – and less sophisticated retail investors. As mutual funds and pension funds begin to buy into an issue that retail investors are selling, volume may increase even as the price remains relatively level. Eventually, volume drives the price upward. At that point, larger investors begin to sell, and smaller investors begin buying.

Despite being plotted on a price chart and measured numerically, the actual individual quantitative value of OBV is not relevant. The indicator itself is cumulative, while the time interval remains fixed by a dedicated starting point, meaning the real number value of OBV arbitrarily depends on the start date. Instead, traders and analysts look to the nature of OBV movements over time; the slope of the OBV line carries all of the weight of analysis.

Analysts look to volume numbers on the OBV to track large, **institutional investors**. They treat divergences between volume and price as a synonym of the relationship between "smart money" and the disparate masses, hoping to showcase opportunities for buying against incorrect prevailing trends. For example, institutional money may drive up the price of an asset, then sell after other investors jump on the bandwagon.

### Example Of How To Use On-Balance Volume
Below is a list of 10 days' worth of a hypothetical stock's closing price and volume:

- Day one: closing price equals \$10, volume equals 25,200 shares
- Day two: closing price equals \$10.15, volume equals 30,000 shares
- Day three: closing price equals \$10.17, volume equals 25,600 shares
- Day four: closing price equals \$10.13, volume equals 32,000 shares
- Day five: closing price equals \$10.11, volume equals 23,000 shares
- Day six: closing price equals \$10.15, volume equals 40,000 shares
- Day seven: closing price equals \$10.20, volume equals 36,000 shares
- Day eight: closing price equals \$10.20, volume equals 20,500 shares
- Day nine: closing price equals \$10.22, volume equals 23,000 shares
- Day 10: closing price equals \$10.21, volume equals 27,500 shares

As can be seen, days two, three, six, seven and nine are up days, so these trading volumes are added to the OBV. Days four, five and 10 are down days, so these trading volumes are subtracted from the OBV. On day eight, no changes are made to the OBV since the closing price did not change. Given the days, the OBV for each of the 10 days is:

- Day one OBV = 0
- Day two OBV = 0 + 30,000 = 30,000
- Day three OBV = 30,000 + 25,600 = 55,600
- Day four OBV = 55,600 - 32,000 = 23,600
- Day five OBV = 23,600 - 23,000 = 600
- Day six OBV = 600 + 40,000 = 46,600
- Day seven OBV = 46,600 + 36,000 = 76,600
- Day eight OBV = 76,600
- Day nine OBV = 76,600 + 23,000 = 99,600
- Day 10 OBV = 99,600 - 27,500 = 72,100

### The Difference Between OBV And Accumulation/Distribution
On-balance volume and the accumulation/distribution line are similar in that they are both momentum indicators that use volume to predict the movement of “smart money”. However, this is where the similarities end. In the case of on-balance volume, it is calculated by summing the volume on an up-day and subtracting the volume on a down-day.

The formula used to create the accumulation/distribution (Acc/Dist) line is quite different than the OBV shown above. The formula for the Acc/Dist, without getting too complicated, is that it uses the position of the current price relative to its recent trading range and multiplies it by that period's volume.

### Limitations Of OBV
One limitation of OBV is that it is a leading indicator, meaning that it may produce predictions, but there is little it can say about what has actually happened in terms of the signals it produces. Because of this, it is prone to produce false signals. It can therefore be balanced by lagging indicators. Add a moving average line to the OBV to look for OBV line breakouts; you can confirm a breakout in the price if the OBV indicator makes a concurrent breakout.

Another note of caution in using the OBV is that a large spike in volume on a single day can throw off the indicator for quite a while. For instance, a surprise earnings announcement, being added or removed from an index, or massive institutional block trades can cause the indicator to spike or plummet, but the spike in volume may not be indicative of a trend.

### Formula for OBV:
The Formula For OBV is:
$$\begin{aligned} &\text{OBV} = \text{OBV}_{prev} + \begin{cases} \text{volume,} & \text{if close} > \text{close}_{prev} \\ \text{0,} & \text{if close} = \text{close}_{prev} \\ -\text{volume,} & \text{if close} < \text{close}_{prev} \\ \end{cases} \\ &\textbf{where:} \\ &\text{OBV} = \text{Current on-balance volume level} \\ &\text{OBV}_{prev} = \text{Previous on-balance volume level} \\ &\text{volume} = \text{Latest trading volume amount} \\ \end{aligned}$$	

## Accumulation/Distribution Indicator (A/D)
Accumulation/distribution is a cumulative indicator that uses volume and price to assess whether a stock is being accumulated or distributed. The accumulation/distribution measure seeks to identify divergences between the stock price and volume flow. This provides insight into how strong a trend is. If the price is rising but the indicator is falling this indicates that buying or accumulation volume may not be enough to support the price rise and a price decline could be forthcoming.

The Formula for the Accumulation/Distribution Indicator is:

$$\frac{A}{D} = \text{Previous}\frac{A}{D} * \text{CMFV}$$

where:

CMFV = Current money flow volume

$$ \text{CMFV} = \frac{(P_C − P_L) − (P_H − P_C)}{(P_H − P_L)} * V$$

and

$P_C$ = Closing price

$P_L$ = Low price for the period

$P_H$ = High price for the period

$V$ = Volume for the period

The accumulation/distribution line helps to show how supply and demand factors are influencing price. A/D can move in the same direction as price changes or it may move in the opposite direction.

The multiplier in the calculation provides a gauge for how strong the buying or selling was during a particular period. It does this by determining whether the price closed in the upper or lower portion of its range. This is then multiplied by the volume. Therefore, when a stock closes near the high of the period's range, and has high volume, that will result in a large A/D jump. If the price finishes near the high of the range but volume is low, the A/D will not move up as much. If volume is high but the price finishes more toward the middle of the range, the A/D will also not move up as much.

The same concepts apply when the price closes in the lower portion of the period's price range. Both volume and where the price closes within the period's range determine how much the A/D will decline by.

The accumulation/distribution line is used to help assess price trends and potentially spot forthcoming reversals.

If a security's price is in a downtrend while the accumulation/distribution line is in an uptrend, the indicator shows there may be buying pressure and the security's price may reverse to the upside.

Conversely, if a security's price is in an uptrend while the accumulation/distribution line is in a downtrend, the indicator shows there may be selling pressure, or higher distribution. This warns that the price may be due for a decline.

# 10. A Machine Learning plan

>**WHAT I LEARNED**: I need to compute [agreggate statistics](https://en.wikipedia.org/wiki/Aggregate_data) on price indicators, and then use the aggregate statistics as independent columns to predict the closing price (dependent column, or *target*)!

So based on what I learned, I decided to focus on three indicators to evaluate, to add as columns to my dataset:

- MAD (difference) between a 200 day and a 50 days indicator
- MACD (difference) between single and double exponential smoothing
- OBVD (difference) derivative (slope) of OBV times derivative of price. If negative, smart money divergence is happening

.. and to include financial average columns, such as the [DOW](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average).

Armed with this new knowledge, I went looking for python packages that implement these indicators (why write them myself when somebody else may have done the job already, right?).

And i stumbled onto [this](https://github.com/bukosabino/ta).

So i gave it a shot:
```(python)
pip install --upgrade ta
```

In [None]:
import pandas as pd
import pandas_datareader as web
import datetime

start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2021, 1, 30)
aapl = web.DataReader('AAPL', 'yahoo', start, end)
aapl.head(10)

In [None]:
import ta as ta
help(ta.add_all_ta_features)

In [None]:
import ta

# Clean NaN values
#df = ta.utils.dropna(df)
# Actually, if I run this, I get: an Error: IndexError: single positional indexer is out-of-bounds

# Add all ta features
#aapl = ta.add_all_ta_features(df, open="Open", high="High", low="Low", close="Close", volume="Volume")
aapl = ta.add_all_ta_features(aapl, open="Open", high="High", low="Low", close="Close", volume="Volume")

In [None]:
aapl.head()

What are all the columns?

In [None]:
aapl.columns

Wow, that's a lot of financial indicators! Maybe I don't even need to add my own ones (the ones I mentionned above) because this data frame already includes them!

It's nice googling github!

In [None]:
aapl.T

Let me take a look at all columns by transposing the matrix and increasing the numner of rows to display, so I can do a bit of EDA (e.g. see which columns I should drop because they contain no relevant information).

In [None]:
pd.set_option('display.max_rows', 80)

In [None]:
aapl.T

# A time-proven investment strategy

Remember we read that *by plotting a 200-day and 50-day moving average on our chart, a buy signal occurs when the 50-day crosses above the 200-day. A sell signal occurs when the 50-day drops below the 200-day*?

Let's see if that is true at all, for **AAPL** stock:

In [None]:
def plot_moving_average_differential(series, window1, window2, plot_intervals=False, scale=1.96):
    rolling_mean1 = series.rolling(window=window1).mean()
    rolling_mean2 = series.rolling(window=window2).mean()
 
    plt.figure(figsize=(17,8))
    plt.title('Moving average\n differential window sizes = ' + str(window1) + ', ' + str(window2))
    plt.plot(rolling_mean1, 'g', label='Rolling mean window ' + str(window1))
    plt.plot(rolling_mean2, 'r', label='Rolling mean window ' + str(window2))

    #Plot confidence intervals for smoothed values
    if plot_intervals:
        mae = mean_absolute_error(series[window1:], rolling_mean[window1:])
        deviation = np.std(series[window1:] - rolling_mean[window1:])
        lower_bound = rolling_mean - (mae + scale * deviation)
        upper_bound = rolling_mean + (mae + scale * deviation)
        plt.plot(upper_bound, 'r--', label='Upper bound / Lower bound')
        plt.plot(lower_bound, 'r--')

    plt.plot(series[window1:], label='Actual values')
    plt.legend(loc='best')
    plt.grid(True)

# differential 50-200
plot_moving_average_differential(aapl['High'], 50, 200)

A surprisingly ***good*** indicator for buying and selling **AAPL** stock: When my 50-day rolling mean average (in green) goes below my 200-day rolling mean average (in red), I *indeed* should ***sell***. And when my 50-days crosses above, I *indeed* should ***buy***. That's what I'm going to use from now on!

<br />
<center>
<img src="ipynb.images/funny-fish.gif" width=400 />
    The End
</center>