# Preparing Data for a Machine Learning Trading Strategy

## Background

Before adding the power of machine learning into a trading algorithm, it's crucial to prepare the data that you will use to fit the model.

In this activity, you’ll prepare training and testing data for fitting a machine learning-powered trading algorithm.

## Instructions

1. Read the provided OHLCV data provided in the CSV file into a Pandas DataFrame.

    > **Hint:** Remember to set the `date` columns as the DataFrame index and parse the dates.

2. Use the `pct_change` function to add a daily returns values column to the DataFrame. Name this column `actual_returns`.

    > **Hint:** Remove NAN values from the DataFrame.

3. Generate the features and target set as follows:

    * Set a short and long window size of 4 and 100 days, respectively, and add the fast and slow simple moving average columns to the DataFrame.

      > **Hint:** Remove NAN values from the DataFrame.

    * Create the features set by copying the `sma_fast` and `sma_slow` columns to a new DataFrame called `X`.

    * Add a `signal` column to the DataFrame setting its value to zeroes.

    * Use the Pandas `loc` function to populate the `signal` column as follows: where the `actual_returns` value is greater than or equal to zero, we set the `signal` value to 1. Where the `actual_returns` value is less than zero, we set the `signal` value to −1.

    * Create the target set `y` by copying the values of the `signal` column.

4. Split the data into training and testing sets as follows.

    * Use the pandas `DateOffset` module to set the beginning and end dates for the training the testing sets.

    * Set the `training_begin` date to the minimum date in the DataSet.

    * Set the ending period for the training data with an offset of 3 months

    * Use the `loc` function to generate the training datasets using the `training_begin` and `training_end` dates as lower and upper limits.

    * Create the testing sets using the `loc` function to slice the index starting at the `training_end` value and ending at the last record of the datasets.

5. Use the `StandardScaler` to standardize the training datasets.

In [4]:
# Imports
import pandas as pd
from pathlib import Path

## Read the CSV file into Pandas DataFrame

In [6]:
# Import the OHLCV dataset into a Pandas Dataframe
# YOUR CODE HERE!

trading_df = pd.read_csv(
    Path("../Resources/ohlcv.csv"), 
    index_col="date", 
    infer_datetime_format=True, 
    parse_dates=True
# Review the DataFrame
trading_df.head()

SyntaxError: invalid syntax (3986765969.py, line 3)

## Add a Daily Return Values Column to the DataFrame

In [None]:
# Calculate the daily returns using the closing prices and the pct_change function
trading_df["actual_returns"] = trading_df["close"].pct_change()
# YOUR CODE HERE!

# Drop all NaN values from the DataFrame
trading_df = trading_df.dropna() # YOUR CODE HERE!

# Review the DataFrame
display(trading_df.head())
display(trading_df.tail())

NameError: name 'trading_df' is not defined

## Generate the Features and Target Sets

### Add the Fast and Slow Simple Moving Average Columns to the DataFrame

In [None]:
# Define a window size of 4
# YOUR CODE HERE!

# Create a simple moving average (SMA) using the short_window and assign this to a new columns called sma_fast
trading_df['sma_fast'] = # YOUR CODE HERE!

In [None]:
# Define a window size of 100
# YOUR CODE HERE!

# Create a simple moving average (SMA) using the long_window and assign this to a new columns called sma_slow
trading_df['sma_slow'] = # YOUR CODE HERE!

In [None]:
# Drop the NaNs using dropna()
trading_df = # YOUR CODE HERE!

### Create the features set

In [None]:
# Assign a copy of the sma_fast and sma_slow columns to a new DataFrame called X
X = # YOUR CODE HERE!

# Display sample data
display(X.head())
display(X.tail())

Unnamed: 0_level_0,sma_fast,sma_slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-10-24 15:00:00,15.6525,16.3403
2018-10-24 15:15:00,15.61875,16.3216
2018-10-24 15:30:00,15.55375,16.3029
2018-10-24 15:45:00,15.47625,16.2844
2018-10-25 09:30:00,15.4025,16.2656


Unnamed: 0_level_0,sma_fast,sma_slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-09-04 14:45:00,6.22875,6.2703
2020-09-04 15:00:00,6.23875,6.26985
2020-09-04 15:15:00,6.25125,6.2691
2020-09-04 15:30:00,6.2575,6.26855
2020-09-04 15:45:00,6.2575,6.26785


### Create the target set

In [None]:
# Create a new column in the trading_df called signal setting its value to zero.
# YOUR CODE HERE!

In [None]:
# Create the signal to buy
# YOUR CODE HERE!

In [None]:
# Create the signal to sell
# YOUR CODE HERE!

In [None]:
# Copy the new signal column to a new Series called y.
y = # YOUR CODE HERE!

## Split the Data Into Training and Testing Datasets

### Set the Training Begin and End Dates

In [None]:
# Imports 
from pandas.tseries.offsets import DateOffset

In [None]:
# Select the start of the training period
training_begin = # YOUR CODE HERE!

# Display the training begin date
print(training_begin)

2018-10-24 15:00:00


In [None]:
# Select the ending period for the training data with an offset of 3 months
training_end = # YOUR CODE HERE!

# Display the training end date
print(training_end)

2019-01-24 15:00:00


### Create the Training Datasets

In [None]:
# Generate the X_train and y_train DataFrames
X_train = # YOUR CODE HERE!
y_train = # YOUR CODE HERE!

# Display sample data
X_train.head()

Unnamed: 0_level_0,sma_fast,sma_slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-10-24 15:00:00,15.6525,16.3403
2018-10-24 15:15:00,15.61875,16.3216
2018-10-24 15:30:00,15.55375,16.3029
2018-10-24 15:45:00,15.47625,16.2844
2018-10-25 09:30:00,15.4025,16.2656


### Create the Testing Datasets

In [None]:
# Generate the X_test and y_test DataFrames
X_test = # YOUR CODE HERE!
y_test = # YOUR CODE HERE!

# Display sample data
X_test.head()

Unnamed: 0_level_0,sma_fast,sma_slow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-24 15:00:00,14.16,14.29466
2019-01-24 15:15:00,14.16,14.29036
2019-01-24 15:30:00,14.1575,14.28666
2019-01-24 15:45:00,14.1525,14.28161
2019-01-25 09:30:00,14.175,14.27791


## Standardize the Data

In [None]:
# Imports
from sklearn.preprocessing import StandardScaler

In [None]:
# Create a StandardScaler instance
scaler = StandardScaler()
 
# Apply the scaler model to fit the X-train data
X_scaler = # YOUR CODE HERE!
 
# Transform the X_train and X_test DataFrames using the X_scaler
X_train_scaled = # YOUR CODE HERE!
X_test_scaled = # YOUR CODE HERE!