# Machine Learning Trading Bot

In this Challenge, you’ll assume the role of a financial advisor at one of the top five financial advisory firms in the world. Your firm constantly competes with the other major firms to manage and automatically trade assets in a highly dynamic environment. In recent years, your firm has heavily profited by using computer algorithms that can buy and sell faster than human traders.

The speed of these transactions gave your firm a competitive advantage early on. But, people still need to specifically program these systems, which limits their ability to adapt to new data. You’re thus planning to improve the existing algorithmic trading systems and maintain the firm’s competitive advantage in the market. To do so, you’ll enhance the existing trading signals with machine learning algorithms that can adapt to new data.

## Instructions:

Use the starter code file to complete the steps that the instructions outline. The steps for this Challenge are divided into the following sections:

* Establish a Baseline Performance

* Tune the Baseline Trading Algorithm

* Evaluate a New Machine Learning Classifier

* Create an Evaluation Report

#### Establish a Baseline Performance

In this section, you’ll run the provided starter code to establish a baseline performance for the trading algorithm. To do so, complete the following steps.

Open the Jupyter notebook. Restart the kernel, run the provided cells that correspond with the first three steps, and then proceed to step four. 

1. Import the OHLCV dataset into a Pandas DataFrame.

2. Generate trading signals using short- and long-window SMA values. 

3. Split the data into training and testing datasets.

4. Use the `SVC` classifier model from SKLearn's support vector machine (SVM) learning method to fit the training data and make predictions based on the testing data. Review the predictions.

5. Review the classification report associated with the `SVC` model predictions. 

6. Create a predictions DataFrame that contains columns for “Predicted” values, “Actual Returns”, and “Strategy Returns”.

7. Create a cumulative return plot that shows the actual returns vs. the strategy returns. Save a PNG image of this plot. This will serve as a baseline against which to compare the effects of tuning the trading algorithm.

8. Write your conclusions about the performance of the baseline trading algorithm in the `README.md` file that’s associated with your GitHub repository. Support your findings by using the PNG image that you saved in the previous step.

#### Tune the Baseline Trading Algorithm

In this section, you’ll tune, or adjust, the model’s input features to find the parameters that result in the best trading outcomes. (You’ll choose the best by comparing the cumulative products of the strategy returns.) To do so, complete the following steps:

1. Tune the training algorithm by adjusting the size of the training dataset. To do so, slice your data into different periods. Rerun the notebook with the updated parameters, and record the results in your `README.md` file. Answer the following question: What impact resulted from increasing or decreasing the training window?

> **Hint** To adjust the size of the training dataset, you can use a different `DateOffset` value&mdash;for example, six months. Be aware that changing the size of the training dataset also affects the size of the testing dataset.

2. Tune the trading algorithm by adjusting the SMA input features. Adjust one or both of the windows for the algorithm. Rerun the notebook with the updated parameters, and record the results in your `README.md` file. Answer the following question: What impact resulted from increasing or decreasing either or both of the SMA windows?

3. Choose the set of parameters that best improved the trading algorithm returns. Save a PNG image of the cumulative product of the actual returns vs. the strategy returns, and document your conclusion in your `README.md` file.

#### Evaluate a New Machine Learning Classifier

In this section, you’ll use the original parameters that the starter code provided. But, you’ll apply them to the performance of a second machine learning model. To do so, complete the following steps:

1. Import a new classifier, such as `AdaBoost`, `DecisionTreeClassifier`, or `LogisticRegression`. (For the full list of classifiers, refer to the [Supervised learning page](https://scikit-learn.org/stable/supervised_learning.html) in the scikit-learn documentation.)

2. Using the original training data as the baseline model, fit another model with the new classifier.

3. Backtest the new model to evaluate its performance. Save a PNG image of the cumulative product of the actual returns vs. the strategy returns for this updated trading algorithm, and write your conclusions in your `README.md` file. Answer the following questions: Did this new model perform better or worse than the provided baseline model? Did this new model perform better or worse than your tuned trading algorithm?

#### Create an Evaluation Report

In the previous sections, you updated your `README.md` file with your conclusions. To accomplish this section, you need to add a summary evaluation report at the end of the `README.md` file. For this report, express your final conclusions and analysis. Support your findings by using the PNG images that you created.


In [None]:
# Installation packages reference for virtual environment .venv on Mac Silicon (M1), including tensorflow-metal, c.f. https://developer.apple.com/metal/tensorflow-plugin/
# !pip install tensorflow
# !pip install tensorflow-metal
# !pip install pandas
# !pip install numpy
# !pip install scikit-learn
# !pip install matplotlib
# !pip install holoviews
# !conda install -c pyviz hvplot geoviews -y
# !pip install finta

# Imports
import pandas as pd
import numpy as np
from pathlib import Path
import holoviews as hv
import hvplot.pandas
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from pandas.tseries.offsets import DateOffset
from sklearn.metrics import classification_report

---

## Tune the Baseline Trading Algorithm

In this section, you’ll tune, or adjust, the model’s input features to find the parameters that result in the best trading outcomes. You’ll choose the best by comparing the cumulative products of the strategy returns.

### Step 3: Choose the set of parameters that best improved the trading algorithm returns. 

Save a PNG image of the cumulative product of the actual returns vs. the strategy returns, and document your conclusion in your `README.md` file.

### Step 1: Import the OHLCV dataset into a Pandas DataFrame.

In [2]:
# Import the OHLCV dataset into a Pandas Dataframe
heem_etf_df = pd.read_csv(
    Path("Resources/emerging_markets_ohlcv.csv"), 
    index_col='date',
    infer_datetime_format=True,
    parse_dates=True
)

# Review the DataFrame
display(heem_etf_df.info(), heem_etf_df)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4323 entries, 2015-01-21 09:30:00 to 2021-01-22 15:45:00
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    4323 non-null   float64
 1   high    4323 non-null   float64
 2   low     4323 non-null   float64
 3   close   4323 non-null   float64
 4   volume  4323 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 202.6 KB


  heem_etf_df = pd.read_csv(
  heem_etf_df = pd.read_csv(


None

Unnamed: 0_level_0,open,high,low,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-21 09:30:00,23.83,23.83,23.83,23.83,100
2015-01-21 11:00:00,23.98,23.98,23.98,23.98,100
2015-01-22 15:00:00,24.42,24.42,24.42,24.42,100
2015-01-22 15:15:00,24.42,24.44,24.42,24.44,200
2015-01-22 15:30:00,24.46,24.46,24.46,24.46,200
...,...,...,...,...,...
2021-01-22 09:30:00,33.27,33.27,33.27,33.27,100
2021-01-22 11:30:00,33.35,33.35,33.35,33.35,200
2021-01-22 13:45:00,33.42,33.42,33.42,33.42,200
2021-01-22 14:30:00,33.47,33.47,33.47,33.47,200


In [3]:
# Filter the date index and close columns
signals_df = heem_etf_df.loc[:, ["close"]]
#signals_df = heem_etf_df[['close']] # Alternative specification that would bypass the .loc function
#display(signals_df)

# Use the pct_change function to generate returns from close prices
signals_df["Actual Returns"] = signals_df["close"].pct_change()

# Drop all NaN values from the DataFrame
signals_df = signals_df.dropna()

# Review the DataFrame
display(signals_df)

Unnamed: 0_level_0,close,Actual Returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-21 11:00:00,23.98,0.006295
2015-01-22 15:00:00,24.42,0.018349
2015-01-22 15:15:00,24.44,0.000819
2015-01-22 15:30:00,24.46,0.000818
2015-01-26 12:30:00,24.33,-0.005315
...,...,...
2021-01-22 09:30:00,33.27,-0.006866
2021-01-22 11:30:00,33.35,0.002405
2021-01-22 13:45:00,33.42,0.002099
2021-01-22 14:30:00,33.47,0.001496


## Step 2: Generate trading signals using short- and long-window SMA values. 

In [4]:
# Set the short window and long window
short_window = 4
long_window = 95

# Generate the fast and slow simple moving averages (4 and 100 days, respectively)
signals_df['SMA_Fast'] = signals_df['close'].rolling(window=short_window).mean()
signals_df['SMA_Slow'] = signals_df['close'].rolling(window=long_window).mean()

# Drop all NaN values from the DataFrame
signals_df = signals_df.dropna()

# Review the DataFrame
display(signals_df)

Unnamed: 0_level_0,close,Actual Returns,SMA_Fast,SMA_Slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-04-02 13:30:00,24.89,-0.000803,24.9075,24.289895
2015-04-02 13:45:00,24.93,0.001607,24.9125,24.299895
2015-04-02 14:00:00,24.91,-0.000802,24.9100,24.305053
2015-04-02 14:15:00,24.92,0.000401,24.9125,24.310105
2015-04-02 14:30:00,24.92,0.000000,24.9200,24.314947
...,...,...,...,...
2021-01-22 09:30:00,33.27,-0.006866,33.2025,30.471526
2021-01-22 11:30:00,33.35,0.002405,33.2725,30.517737
2021-01-22 13:45:00,33.42,0.002099,33.3850,30.564474
2021-01-22 14:30:00,33.47,0.001496,33.3775,30.611421


In [5]:
# The baseline strategy as provided by the Starter Code is to go long and stay long the HEEM ETF when the tick return is up, to reverse the long position and go short
# and stay short when the tick return is down, and vice-versa.

# Initialize the new Signal column
signals_df['Signal'] = 0.0

# The trading signals below, as provided in the Starter Code, are odd as they are based only on point-to-point price change, a very simple momentum/trend-continuation strategy,
# and not on the SMA_Fast crossing over the SMA_Slow.  However, will use Starter Code as-is so as not to introduce any deviations in project.

# When Actual Returns are greater than or equal to 0, generate signal to buy stock long
signals_df.loc[(signals_df['Actual Returns'] >= 0), 'Signal'] = 1

# When Actual Returns are less than 0, generate signal to sell stock short
signals_df.loc[(signals_df['Actual Returns'] < 0), 'Signal'] = -1

# Review the DataFrame
display(signals_df)

Unnamed: 0_level_0,close,Actual Returns,SMA_Fast,SMA_Slow,Signal
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-04-02 13:30:00,24.89,-0.000803,24.9075,24.289895,-1.0
2015-04-02 13:45:00,24.93,0.001607,24.9125,24.299895,1.0
2015-04-02 14:00:00,24.91,-0.000802,24.9100,24.305053,-1.0
2015-04-02 14:15:00,24.92,0.000401,24.9125,24.310105,1.0
2015-04-02 14:30:00,24.92,0.000000,24.9200,24.314947,1.0
...,...,...,...,...,...
2021-01-22 09:30:00,33.27,-0.006866,33.2025,30.471526,-1.0
2021-01-22 11:30:00,33.35,0.002405,33.2725,30.517737,1.0
2021-01-22 13:45:00,33.42,0.002099,33.3850,30.564474,1.0
2021-01-22 14:30:00,33.47,0.001496,33.3775,30.611421,1.0


In [6]:
print(signals_df['Signal'].value_counts())

Signal
 1.0    2371
-1.0    1857
Name: count, dtype: int64


In [7]:
# Calculate the strategy returns and add them to the signals_df DataFrame
signals_df['Strategy Returns'] = signals_df['Actual Returns'] * signals_df['Signal'].shift()

# Review the DataFrame
display(signals_df)

Unnamed: 0_level_0,close,Actual Returns,SMA_Fast,SMA_Slow,Signal,Strategy Returns
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-04-02 13:30:00,24.89,-0.000803,24.9075,24.289895,-1.0,
2015-04-02 13:45:00,24.93,0.001607,24.9125,24.299895,1.0,-0.001607
2015-04-02 14:00:00,24.91,-0.000802,24.9100,24.305053,-1.0,-0.000802
2015-04-02 14:15:00,24.92,0.000401,24.9125,24.310105,1.0,-0.000401
2015-04-02 14:30:00,24.92,0.000000,24.9200,24.314947,1.0,0.000000
...,...,...,...,...,...,...
2021-01-22 09:30:00,33.27,-0.006866,33.2025,30.471526,-1.0,-0.006866
2021-01-22 11:30:00,33.35,0.002405,33.2725,30.517737,1.0,-0.002405
2021-01-22 13:45:00,33.42,0.002099,33.3850,30.564474,1.0,0.002099
2021-01-22 14:30:00,33.47,0.001496,33.3775,30.611421,1.0,0.001496


In [8]:
# Plot Strategy Returns to examine performance
(1 + signals_df['Strategy Returns']).cumprod().hvplot(title="Baseline Strategy's Cumulative Returns (indexed to 1.0)")

  return dataset.data.dtypes[idx].type
  return dataset.data.dtypes[idx].type


### Step 3: Split the data into training and testing datasets.

In [9]:
# Assign a copy of the sma_fast and sma_slow columns to a features DataFrame called X
X = signals_df[['SMA_Fast', 'SMA_Slow']].shift().dropna()

# Review the DataFrame
display(X)

Unnamed: 0_level_0,SMA_Fast,SMA_Slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-04-02 13:45:00,24.9075,24.289895
2015-04-02 14:00:00,24.9125,24.299895
2015-04-02 14:15:00,24.9100,24.305053
2015-04-02 14:30:00,24.9125,24.310105
2015-04-02 14:45:00,24.9200,24.314947
...,...,...
2021-01-22 09:30:00,33.1725,30.426789
2021-01-22 11:30:00,33.2025,30.471526
2021-01-22 13:45:00,33.2725,30.517737
2021-01-22 14:30:00,33.3850,30.564474


In [10]:
# Create the target set selecting the Signal column and assigning it to y
y = signals_df['Signal']

# Review the value counts
print(y.value_counts())

Signal
 1.0    2371
-1.0    1857
Name: count, dtype: int64


In [11]:
# Select the start of the training period
training_begin = X.index.min()

# Display the training begin date
print(training_begin)

2015-04-02 13:45:00


In [12]:
# Select the ending period for the training data with an offset of 3 months
training_end = X.index.min() + DateOffset(months=6)

# Display the training end date
print(training_end)

2015-10-02 13:45:00


In [13]:
# Generate the X_train and y_train DataFrames
X_train = X.loc[training_begin:training_end]
y_train = y.loc[training_begin:training_end]

# Review the X_train and Y_train DataFrames
display(X_train, y_train)

Unnamed: 0_level_0,SMA_Fast,SMA_Slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-04-02 13:45:00,24.9075,24.289895
2015-04-02 14:00:00,24.9125,24.299895
2015-04-02 14:15:00,24.9100,24.305053
2015-04-02 14:30:00,24.9125,24.310105
2015-04-02 14:45:00,24.9200,24.314947
...,...,...
2015-09-29 15:30:00,21.8175,21.654947
2015-09-30 14:45:00,21.4800,21.631684
2015-10-02 09:30:00,21.2325,21.612211
2015-10-02 10:30:00,20.9875,21.593579


date
2015-04-02 13:45:00    1.0
2015-04-02 14:00:00   -1.0
2015-04-02 14:15:00    1.0
2015-04-02 14:30:00    1.0
2015-04-02 14:45:00    1.0
                      ... 
2015-09-29 15:30:00   -1.0
2015-09-30 14:45:00    1.0
2015-10-02 09:30:00    1.0
2015-10-02 10:30:00    1.0
2015-10-02 11:30:00    1.0
Name: Signal, Length: 280, dtype: float64

In [14]:
# Generate the X_test and y_test DataFrames.  Not clear why an offset is needed here as the train and test data do not otherwise overlap when using the .loc range bracketing method.
X_test = X.loc[training_end+DateOffset(hours=1):]
y_test = y.loc[training_end+DateOffset(hours=1):]

# Review the X_test and y_test DataFrames
display(X_test, y_test)

Unnamed: 0_level_0,SMA_Fast,SMA_Slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-10-02 14:45:00,20.94500,21.558316
2015-10-02 15:15:00,21.06500,21.540947
2015-10-02 15:30:00,21.20975,21.525568
2015-10-02 15:45:00,21.34725,21.510516
2015-10-05 09:45:00,21.42725,21.495463
...,...,...
2021-01-22 09:30:00,33.17250,30.426789
2021-01-22 11:30:00,33.20250,30.471526
2021-01-22 13:45:00,33.27250,30.517737
2021-01-22 14:30:00,33.38500,30.564474


date
2015-10-02 14:45:00    1.0
2015-10-02 15:15:00    1.0
2015-10-02 15:30:00    1.0
2015-10-02 15:45:00   -1.0
2015-10-05 09:45:00    1.0
                      ... 
2021-01-22 09:30:00   -1.0
2021-01-22 11:30:00    1.0
2021-01-22 13:45:00    1.0
2021-01-22 14:30:00    1.0
2021-01-22 15:45:00   -1.0
Name: Signal, Length: 3947, dtype: float64

In [15]:
# Scale the features DataFrames

# Create a StandardScaler instance
scaler = StandardScaler()

# Apply the scaler model to fit the X-train data
X_scaler = scaler.fit(X_train)

# Transform the X_train and X_test DataFrames using the X_scaler
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### Step 4: Use the `SVC` classifier model from SKLearn's support vector machine (SVM) learning method to fit the training data and make predictions based on the testing data. Review the predictions.

In [16]:
# From SVM, instantiate SVC classifier model instance.
# According to instructor, the SVM model employs a hyperplane to bisect and segregate the data, and is a bit more effective than a logistics model.
# "It's been shown that the linear kernel is a degenerate version of RBF, hence the linear kernel is never more accurate than a properly tuned RBF kernel," \n
# c.f. https://stats.stackexchange.com/questions/73032/linear-kernel-and-non-linear-kernel-for-support-vector-machine
svm_model = SVC(kernel='rbf')
 
# Fit the model to the data using the training data
svm_model = svm_model.fit(X_train_scaled, y_train)
 
# Use the testing data to make the model predictions
svm_test_pred = svm_model.predict(X_test_scaled)

# Review the model's predicted values
display(svm_test_pred[:30])

print("Predicted proportions in the test data are much different from the target 'y' labels, or the naive trading Strategy signals:")
pd.DataFrame(svm_test_pred).value_counts() # Predicted proportions in the test data are much different from the target 'y' labels, or the naive trading Strategy signals

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Predicted proportions in the test data are much different from the target 'y' labels, or the naive trading Strategy signals:


 1.0    3859
-1.0      88
Name: count, dtype: int64

### Step 5: Review the classification report associated with the `SVC` model predictions. 

In [17]:
# Use a classification report to evaluate the model using the predictions and testing data
svm_testing_report = classification_report(y_test, svm_test_pred)

# Print the classification report
print(svm_testing_report)

              precision    recall  f1-score   support

        -1.0       0.45      0.02      0.04      1733
         1.0       0.56      0.98      0.71      2214

    accuracy                           0.56      3947
   macro avg       0.51      0.50      0.38      3947
weighted avg       0.51      0.56      0.42      3947



### Step 6: Create a predictions DataFrame that contains columns for “Predicted” values, “Actual Returns”, and “Strategy Returns”.

In [18]:
# Create a new empty predictions DataFrame:

# Create a predictions DataFrame
predictions_df = pd.DataFrame(index=y_test.index)
#display(predictions_df)

# Add the SVM model predictions to the DataFrame
predictions_df['Predicted'] = svm_test_pred

# Add the actual returns to the DataFrame
predictions_df['Actual Returns'] = signals_df['Actual Returns']

# Add the strategy returns to the DataFrame
predictions_df['Strategy Returns_ML_3'] = predictions_df['Actual Returns'] * predictions_df['Predicted']

# Review the DataFrame
display(predictions_df)

Unnamed: 0_level_0,Predicted,Actual Returns,Strategy Returns_ML_3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-10-02 14:45:00,1.0,0.009000,0.009000
2015-10-02 15:15:00,1.0,0.008873,0.008873
2015-10-02 15:30:00,1.0,0.000047,0.000047
2015-10-02 15:45:00,1.0,-0.002792,-0.002792
2015-10-05 09:45:00,1.0,0.013532,0.013532
...,...,...,...
2021-01-22 09:30:00,1.0,-0.006866,-0.006866
2021-01-22 11:30:00,1.0,0.002405,0.002405
2021-01-22 13:45:00,1.0,0.002099,0.002099
2021-01-22 14:30:00,1.0,0.001496,0.001496


### Step 7: Create a cumulative return plot that shows the actual returns vs. the strategy returns. Save a PNG image of this plot.

In [19]:
# Plot the actual returns versus the strategy returns
returns_mod_parameters_combined_plot = (predictions_df[['Actual Returns', 'Strategy Returns_ML_3']]+1).cumprod().hvplot(title='SVM Algo Trading Model: Increased Training Period from Baseline 3 months to 6 months AND \n decreased SMA_Slow Window from Baseline 100 to 95', fontscale=0.9)
hv.save(returns_mod_parameters_combined_plot, 'Images/algo_trading_svm_model_returns_mod_parameters_combined_plot.png', fmt='png')
returns_mod_parameters_combined_plot

  return dataset.data.dtypes[idx].type
  return dataset.data.dtypes[idx].type
  return dataset.data.dtypes[idx].type
  return dataset.data.dtypes[idx].type
