# Predictive Analysis of Robinhood Popularity Data - Pre-Processing

Purpose: In this module, we pick up from Part Two - where we explored the data and added features to the dataset.

Our goal in Part Three of Pre-Processing, we will prepare the data for our machine learning models.

This may involve creating dummy features if appropriate, scaling the dataset, and splitting the data between the test and traing data.

We also want to use some techniques from our EDA work again to check if there any issues (collinerity) in the features.

In [1]:
# Importing modules for data pre-processing

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

In [2]:
# Let's read in our new dataset from Part Two

filepath = '../data/stock_data_new.csv'
df = pd.read_csv(filepath)
print(df.head())

# Let's also get a list of the stock basket tickers
filepath = '../data/stock_info.csv'
stock = pd.read_csv(filepath)
tickers = stock['Ticker'].tolist()
print(tickers)

   Unnamed: 0        Date  Robinhood      Price    Volume Ticker Company  \
0           0  2018-07-02   150897.0  46.794998  70925200   AAPL   Apple   
1           1  2018-07-03   151073.0  45.980000  55819200   AAPL   Apple   
2           2  2018-07-05   151258.0  46.349998  66416800   AAPL   Apple   
3           3  2018-07-06   151150.0  46.992500  69940800   AAPL   Apple   
4           4  2018-07-09   150664.0  47.645000  79026400   AAPL   Apple   

   Percentage_Volume  ExPost_PriceChange_1D  ExPost_PriceChange_5D  ...  \
0           0.002128                    NaN                    NaN  ...   
1           0.002706              -0.017416                    NaN  ...   
2           0.002277               0.008047                    NaN  ...   
3           0.002161               0.013862                    NaN  ...   
4           0.001907               0.013885                    NaN  ...   

   ExPost_PriceChange_1D_Lag2  ExPost_PriceChange_1D_Lag3  \
0                         NaN  

In [3]:
#Checking for null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10060 entries, 0 to 10059
Data columns (total 43 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  10060 non-null  int64  
 1   Date                        10060 non-null  object 
 2   Robinhood                   10060 non-null  float64
 3   Price                       10060 non-null  float64
 4   Volume                      10060 non-null  int64  
 5   Ticker                      10060 non-null  object 
 6   Company                     10060 non-null  object 
 7   Percentage_Volume           10060 non-null  float64
 8   ExPost_PriceChange_1D       10040 non-null  float64
 9   ExPost_PriceChange_5D       9960 non-null   float64
 10  ExPost_PriceChange_10D      9860 non-null   float64
 11  ExAnte_PriceChange_1D       10040 non-null  float64
 12  ExAnte_PriceChange_3D       10000 non-null  float64
 13  ExAnte_PriceChange_5D       996

The null values are explainable by the feature engineering we used in Part Two.

(1) ExPost_PriceChange_1D, ExPost_PriceChange_5D, ExPost_PriceChange_10D will have null values at the beginning of the time series because of the data required to make this calculation.

(2) ExAnte_PriceChange_1D, ExAnte_PriceChange_3D, ExAnte_PriceChange_5D will have null values at the end of the time series because of the data required to make this calculation.

(3) Lag Features (Robinhood_LagX, Percentage_Volume_LagY, ExPost_PriceChange_1D_LagZ) have starting null values because of the lagging nature.

(4) Simple moving averages (SMA_3D, SMA_5D, SMA_10D) and Expanded_Mean will have null values at the beginning of the time series because of the data required to make this calculation.

Since these represent just a few datapoints (< 2%), we are comfortable dropping those null values.

In [4]:
df.dropna(axis=0, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9760 entries, 10 to 10054
Data columns (total 43 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  9760 non-null   int64  
 1   Date                        9760 non-null   object 
 2   Robinhood                   9760 non-null   float64
 3   Price                       9760 non-null   float64
 4   Volume                      9760 non-null   int64  
 5   Ticker                      9760 non-null   object 
 6   Company                     9760 non-null   object 
 7   Percentage_Volume           9760 non-null   float64
 8   ExPost_PriceChange_1D       9760 non-null   float64
 9   ExPost_PriceChange_5D       9760 non-null   float64
 10  ExPost_PriceChange_10D      9760 non-null   float64
 11  ExAnte_PriceChange_1D       9760 non-null   float64
 12  ExAnte_PriceChange_3D       9760 non-null   float64
 13  ExAnte_PriceChange_5D       976

Here is a summary of the dataset we're working with.

Number of observations: 9,760

Number of Features: 32

Feature Names: Robinhood, Ticker, Percentage_Volume, Year, Month, Day, Day_of_Week, Robinhood_LagX (1...7), Percentage_Volume_LagY (1...7), ExPost_PriceChange_1D_LagZ (1...7), SMA_3D, SMA_5D, SMA_10D, Expanded_Mean

Number of Target Variables: 6

Name of Target Variables: ExPost_PriceChange_1D, ExPost_PriceChange_5D, ExPost_PriceChange_10D, ExAnte_PriceChange_1D, ExAnte_PriceChange_1D, ExAnte_PriceChange_3D, ExAnte_PriceChange_5D

Let's rearrange our dataframe with the features on the left and target variables on the right, and dropping any extraneous variables.

In [6]:
feature_names = ['Robinhood', 'Ticker', 'Percentage_Volume', 'Year', 'Month', 'Day', 'Day_of_Week',
                 'Robinhood_Lag1', 'Robinhood_Lag2', 'Robinhood_Lag3', 'Robinhood_Lag4', 'Robinhood_Lag5', 'Robinhood_Lag6', 'Robinhood_Lag7',
                 'Percentage_Volume_Lag1', 'Percentage_Volume_Lag2', 'Percentage_Volume_Lag3', 'Percentage_Volume_Lag4', 'Percentage_Volume_Lag5', 'Percentage_Volume_Lag6', 'Percentage_Volume_Lag7',
                 'ExPost_PriceChange_1D_Lag1', 'ExPost_PriceChange_1D_Lag2', 'ExPost_PriceChange_1D_Lag3', 'ExPost_PriceChange_1D_Lag4', 'ExPost_PriceChange_1D_Lag5', 'ExPost_PriceChange_1D_Lag6', 'ExPost_PriceChange_1D_Lag7', 
                 'SMA_3D', 'SMA_5D', 'SMA_10D', 'Expanded_Mean']

target_names = ['ExPost_PriceChange_1D', 'ExPost_PriceChange_5D', 'ExPost_PriceChange_10D',
                'ExAnte_PriceChange_1D', 'ExAnte_PriceChange_3D', 'ExAnte_PriceChange_5D']

df_new = df[feature_names + target_names].reset_index(drop=True)
df_new.head()

Unnamed: 0,Robinhood,Ticker,Percentage_Volume,Year,Month,Day,Day_of_Week,Robinhood_Lag1,Robinhood_Lag2,Robinhood_Lag3,...,SMA_3D,SMA_5D,SMA_10D,Expanded_Mean,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,ExAnte_PriceChange_1D,ExAnte_PriceChange_3D,ExAnte_PriceChange_5D
0,150065.0,AAPL,0.002415,2018,7,17,1,150087.0,150117.0,150592.0,...,0.000735,0.001199,0.002316,0.002316,0.002829,0.005779,0.022812,-0.005484,-5.2e-05,0.008096
1,150523.0,AAPL,0.002295,2018,7,18,2,150065.0,150087.0,150117.0,...,-0.001617,0.002697,0.00351,0.001607,-0.005484,0.013413,0.035233,0.007773,0.006355,0.023214
2,150750.0,AAPL,0.001858,2018,7,19,3,150523.0,150065.0,150087.0,...,0.001706,0.000899,0.003482,0.002121,0.007773,0.00445,0.034952,-0.002293,0.005837,0.012143
3,150839.0,AAPL,0.001824,2018,7,20,4,150750.0,150523.0,150065.0,...,-1e-06,0.000126,0.001867,0.001782,-0.002293,0.000575,0.01846,0.000888,0.017656,-0.002403
4,151029.0,AAPL,0.002361,2018,7,23,0,150839.0,150750.0,150523.0,...,0.002123,0.000742,0.000567,0.001718,0.000888,0.003667,0.005405,0.007254,0.013569,-0.008872


Let's also convert our stock tickers categorical data to dummy variables.

Although this is not necessary for some machine learning models like decision trees, we want to have the same dataset in case we want to standarize this pre-processing in the future.

In [7]:
df_new = pd.get_dummies(df_new)

df_new.head()

Unnamed: 0,Robinhood,Percentage_Volume,Year,Month,Day,Day_of_Week,Robinhood_Lag1,Robinhood_Lag2,Robinhood_Lag3,Robinhood_Lag4,...,Ticker_NKE,Ticker_NVDA,Ticker_PYPL,Ticker_SNAP,Ticker_SQ,Ticker_T,Ticker_TSLA,Ticker_TWTR,Ticker_V,Ticker_ZNGA
0,150065.0,0.002415,2018,7,17,1,150087.0,150117.0,150592.0,150575.0,...,0,0,0,0,0,0,0,0,0,0
1,150523.0,0.002295,2018,7,18,2,150065.0,150087.0,150117.0,150592.0,...,0,0,0,0,0,0,0,0,0,0
2,150750.0,0.001858,2018,7,19,3,150523.0,150065.0,150087.0,150117.0,...,0,0,0,0,0,0,0,0,0,0
3,150839.0,0.001824,2018,7,20,4,150750.0,150523.0,150065.0,150087.0,...,0,0,0,0,0,0,0,0,0,0
4,151029.0,0.002361,2018,7,23,0,150839.0,150750.0,150523.0,150065.0,...,0,0,0,0,0,0,0,0,0,0


We're now ready to prepare the training and test data splitting.

To prevent any training data from leaking into the test data, we should now split the data before any data normalization.

Since this is a time-series and again to prevent training data from leaking into the test data, we need to do a contiguous split (no shuffling).

In [8]:
Total_Observations = df_new.shape[0]
Total_Stocks = 20
Unique_Days = Total_Observations / Total_Stocks

# Let's do a 75/25 split of the data
Split = int(Unique_Days * .75)

# Initializing the train/test split dataframes
df_train = pd.DataFrame(columns = df_new.columns)
df_test = pd.DataFrame(columns = df_new.columns)

# For each stock, we will do a contiguous split of the data

for stock in tickers:
    
    column_name = 'Ticker_' + stock
    
    # A contiguous split of the training and test dataframes
    df_train = df_train.append(df_new[df_new[column_name]==1].iloc[:Split])
    df_test = df_test.append(df_new[df_new[column_name]==1].iloc[Split:])


Let's check the shapes of the training and target dataframes to see if they make sense.

In [9]:
print('Training dataframe shape: ' + str(df_train.shape))
print('Training size percentage: ' + str(df_train.shape[0]/df_new.shape[0]))

print('Test dataframe shape: ' + str(df_test.shape))
print('Test size percentage: ' + str(df_test.shape[0]/df_new.shape[0]))

Training dataframe shape: (7320, 57)
Training size percentage: 0.75
Test dataframe shape: (2440, 57)
Test size percentage: 0.25


Look's right, so we just need to split the training and test dataframes between the explanatory (X's) and target (y's) columns

In [10]:
X_train = df_train.drop(target_names, axis=1)
y_train = df_train[target_names]

X_test = df_test.drop(target_names, axis=1)
y_test = df_test[target_names]

# Let's do a final shape check of the train/test split data
print('X training shape: ' + str(X_train.shape))
print('y training shape: ' + str(y_train.shape))

print('X test shape: ' + str(X_test.shape))
print('y test shape: ' + str(y_test.shape))

X training shape: (7320, 51)
y training shape: (7320, 6)
X test shape: (2440, 51)
y test shape: (2440, 6)


Now we're ready to apply the Standard Scaler to our training data.

However, we don't need to apply it to our dummy variables so let's use a Column Transformer which excludes the transformation on the dummy columns.

In [19]:
# This column transformer will apply the Standard Scalization to our non-dummy columns and pass through the dummy columns

transformer = ColumnTransformer(transformers = [('standard', StandardScaler(), [0, X_train.columns.get_loc('Expanded_Mean')])],
                                remainder = 'passthrough')

# First, fitting transformer to our training data
transformer.fit(X_train)

# Finally, use the same transformer to scale our training and test data
X_train_scaled = transformer.fit_transform(X_train)
X_test_scaled = transformer.fit_transform(X_test)

For the sake of this capstone project - in order to keep the notebooks for each process separate - we'll going to export the 4 dataframes (X_train, y_train, X_test, y_test) as csv's for the next part of data modeling.

This step is usually not done as this process would be integrated into the data modeling process / pipeline as well.

In [30]:
X_train_export = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_export.to_csv("../data/X_train.csv")

X_test_export = pd.DataFrame(X_test_scaled, columns=X_test.columns)
X_test_export.to_csv("../data/X_test.csv")

y_train.to_csv("../data/y_train.csv")
y_test.to_csv("../data/y_test.csv")