# Exploratory Data Analysis - Apple Stock Prices
In this notebook we're going to explore the Apple Stock price. We want to know how the price have been changed over time, and we will try to build an accurate model to make future predictions over this particular stock.

As we said few times during this project, predict stock prices is very challenging and almost impossible, but we believe that we can follow a marked trend to forecast the future in a resasonable way.

We start —as always— importing our libraries:

In [1]:
import pandas as pd
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, plot_components_plotly
import plotly.express as px
import os, sys
path = os.getcwd()
path = os.path.dirname(path)
sys.path.append(path)
from train import train, save_model
import datetime as dt
from datetime import timedelta
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

Let's take a look at our Apple customized data:

In [2]:
# Loading Apple data
apple = pd.read_csv('../data/apple.csv')
apple

Unnamed: 0,ds,y
0,2001-11-26,0.381607
1,2001-11-27,0.375000
2,2001-11-28,0.366607
3,2001-11-29,0.364643
4,2001-11-30,0.380357
...,...,...
5032,2021-11-19,160.550003
5033,2021-11-22,161.020004
5034,2021-11-23,161.410004
5035,2021-11-24,161.940002


We have twenty years of data, as we requested to Yahoo when we got our data.

The Apple price over time looks like this:

In [3]:
px.line(apple, x='ds', y='y')

If you have readed the *EDA_Amazon_Stocks.ipynb* notebook (included in this project), you can notice that the behaviour of the price over time is almost identical in Amazon and Apple: almost flat price 15 years, and strong raises in the last years. 

As we did before, we want to make predictions with different training data. In this case, we want to make three different predictions with three different train data (5 years, 10 years, and all the data), because the growth of the price in Apple stock is more "linear" from 2010 to 2020.

In [4]:
# We want to predict 2021 year
apple['ds'] = pd.to_datetime(apple['ds'])
X_test = apple[apple['ds'].dt.year == 2021][['ds']]
X_test

Unnamed: 0,ds
4809,2021-01-04
4810,2021-01-05
4811,2021-01-06
4812,2021-01-07
4813,2021-01-08
...,...
5032,2021-11-19
5033,2021-11-22
5034,2021-11-23
5035,2021-11-24


## Case 1. Training the model with all the Data (2001-2020)

In [5]:
# Full data 2001-2020
X_train_full_data = apple[apple['ds'].dt.year != 2021]
X_train_full_data

Unnamed: 0,ds,y
0,2001-11-26,0.381607
1,2001-11-27,0.375000
2,2001-11-28,0.366607
3,2001-11-29,0.364643
4,2001-11-30,0.380357
...,...,...
4804,2020-12-24,131.970001
4805,2020-12-28,136.690002
4806,2020-12-29,134.869995
4807,2020-12-30,133.720001


In [6]:
# Predictions
model = Prophet()
model.fit(X_train_full_data)
forecast = model.predict(X_test)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [7]:
fig = plot_plotly(model, forecast, xlabel='Date', ylabel='Price')
fig.update_layout(title='APPLE stock 2021 Predictions - Model trained with Full Data')
fig.show()

In [8]:
# Validating predictions
val = forecast.merge(apple, on='ds', how='right')
val = val[['ds', 'yhat', 'y']]
val.columns = ['Date', 'Predicted Price', 'True Price']
val = val[val.Date.dt.year == 2021]
fig = px.scatter(val, x=val.Date, y=val.columns[1:],
                title='APPLE stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [9]:
# Forecast Components
plot_components_plotly(model, forecast)

In [10]:
# Scores
def scores(y_true, y_pred):
    print('MAE:', mean_absolute_error(y_true, y_pred))
    print('RMSE', np.sqrt(mean_squared_error(y_true, y_pred)))

y_true = apple[apple.ds.dt.year == 2021]['y']
y_pred =  forecast['yhat']
scores(y_true, y_pred)

MAE: 34.95002349689934
RMSE 35.66023269134483


We must consider the real price to know our real error:

In [11]:
print('Mean Apple Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 34.95 / val['True Price'].mean())

Mean Apple Price in 2021: $ 137.66
Score:  0.7461212976664359


Our score is too low. It looks like the model is only following the trend, and we can improve the results just training with less data. Let's check this.

## Case 2. Training with 10 years of data (2010-2020)

In [12]:
# Training data - 10 years
X_train_last_ten = apple[(apple.ds.dt.year >= 2010) & (apple.ds.dt.year <=2020)]
X_train_last_ten

Unnamed: 0,ds,y
2040,2010-01-04,7.643214
2041,2010-01-05,7.656429
2042,2010-01-06,7.534643
2043,2010-01-07,7.520714
2044,2010-01-08,7.570714
...,...,...
4804,2020-12-24,131.970001
4805,2020-12-28,136.690002
4806,2020-12-29,134.869995
4807,2020-12-30,133.720001


In [13]:
# Predictions
model = Prophet()
model.fit(X_train_last_ten)
forecast2 = model.predict(X_test)
fig = plot_plotly(model, forecast2, xlabel='Date', ylabel='Price')
fig.update_layout(title='APPLE stock 2021 Predictions - Model trained with last 10 years of Data')
fig.show()

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [14]:
# Validating predictions
val2 = forecast2.merge(apple, on='ds', how='right')
val2 = val2[['ds', 'yhat', 'y']]
val2 = val2[val2.ds.dt.year == 2021]
fig = px.scatter(val2, x=val2.ds, y=val2.columns[1:],
                title='APPLE stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [15]:
plot_components_plotly(model, forecast2)

Now we have more accurate predictions. Let's check our scores:

In [16]:
y_true = apple[apple.ds.dt.year == 2021]['y']
y_pred =  forecast2['yhat']
scores(y_true, y_pred)

MAE: 11.503978678573864
RMSE 13.201466685849448


In [17]:
print('Mean Apple Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 11.5 / val['True Price'].mean())

Mean Apple Price in 2021: $ 137.66
Score:  0.9164633740533337


Training with 10 years instead all the data, we are getting a score of 0.91, while the score with all the data was 0.74.

Now we are going to predict with five years of data:

## Case 3. Training with 5 years of data (2015-2020)

In [18]:
# Training data - 5 years
X_train_last_five = apple[(apple.ds.dt.year >= 2015) & (apple.ds.dt.year <=2020)]
X_train_last_five

Unnamed: 0,ds,y
3298,2015-01-02,27.332500
3299,2015-01-05,26.562500
3300,2015-01-06,26.565001
3301,2015-01-07,26.937500
3302,2015-01-08,27.972500
...,...,...
4804,2020-12-24,131.970001
4805,2020-12-28,136.690002
4806,2020-12-29,134.869995
4807,2020-12-30,133.720001


In [19]:
# Predictions
model = Prophet()
model.fit(X_train_last_five)
forecast3 = model.predict(X_test)
fig = plot_plotly(model, forecast3, xlabel='Date', ylabel='Price')
fig.update_layout(title='APPLE stock 2021 Predictions - Model trained with last 5 years of Data')
fig.show()

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [20]:
# Validating predictions
val3 = forecast3.merge(apple, on='ds', how='right')
val3 = val3[['ds', 'yhat', 'y']]
val3 = val3[val3.ds.dt.year == 2021]
fig = px.scatter(val3, x=val3.ds, y=val3.columns[1:],
                title='APPLE stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [21]:
plot_components_plotly(model, forecast3)

In [22]:
# Scores
y_true = apple[apple.ds.dt.year == 2021]['y']
y_pred =  forecast3['yhat']
scores(y_true, y_pred)

MAE: 14.603625305675749
RMSE 16.32959810245292


In [23]:
print('Mean Apple Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 14.6 / val['True Price'].mean())

Mean Apple Price in 2021: $ 137.66
Score:  0.8939448053198845


In this case our Score is really good, but not better than the score that we've obtained training with 10 years.

For the Apple stock, 10 years of trainig give us solid predictions to follow the price growth.

For sure that we can improve this results, but we don't want to overfit the model, since we want a model to make "guide" predictions. Remember that we can't *predict* the real value of a stock.

## Results
As we mentioned before, our best results was training the model with 10 years of data.

In [24]:
results = pd.DataFrame(
    {'MAE': [34.95, 11.5, 14.6], 
    'RMSE': [35.66, 13.2, 16.32], 
    'Train Data': ['All the data', 'Last 10 years', 'Last 5 years']})

In [34]:
px.bar(results, x='Train Data', y=['MAE', 'RMSE'], barmode='group', 
        title='Train MAE: All Data vs Last 5 and 10 Years (Less is Better)')

In [26]:
val['Last 5 Years'] = val2['yhat']
val['Last 10 Years'] = val3['yhat']
val = val.rename(columns={'Predicted Price': 'All the Data',  'y': 'True Price', 'ds': 'Date'})
px.line(val, x='Date', y=val.columns[1:], title='Apple Stock Predictions: Train with all the Data vs Train with Last 5 and 10 years')

## Training a model to make predictions in Apple Stocks
We already know the best way to train our model in order to get the best possible performance. Now we are going to train and save a model to make future predictions with it.

In [29]:
model = train('apple', '../data/apple.csv', False, True, len(X_train_last_ten))

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [30]:
# Making future predictions with the model: two years

# 1. Creating the forecast Dates
X_test_future = []
end = dt.datetime.strptime('2023-12-31', '%Y-%m-%d').date()
start = dt.datetime.strptime('2021-11-20', '%Y-%m-%d').date()

for i in range((end-start).days):
    X_test_future += [(start+timedelta(i)).strftime('%Y-%m-%d')]

X_test_future = pd.DataFrame(X_test_future)
X_test_future.columns = ['ds']
X_test_future

Unnamed: 0,ds
0,2021-11-20
1,2021-11-21
2,2021-11-22
3,2021-11-23
4,2021-11-24
...,...
766,2023-12-26
767,2023-12-27
768,2023-12-28
769,2023-12-29


In [32]:
# 2. Making predictions: 2 years
forecast = model.predict(X_test_future)
fig = plot_plotly(model, forecast, xlabel='Date', ylabel='Price')
fig.update_layout(title='Apple Stocks - Two Years Forecasting')
fig.show()

## Final Step: Saving the model
The final step is save our model in the *models* folder:

In [33]:
# Saving the model
save_model('../models', model, 'apple')

Model Succesfully Saved in: 
../models/apple.json


## Conclussions
**With 10 years of training data, our model got a 0.91 score**, slightly better than the model trained with 5 years of data, and much better than training with all the data.

In the next notebook we will analyze the BTC-USD pair, maybe the most challenging stock, because crytpo market volatility is very high.