# Predictive Analysis of Robinhood Popularity Data - EDA

Purpose: In this module, we pick up from Part One - where the data from Robinhood and Yahoo Finance was extracted and cleaned.

Our goal in Part Two of Exploratory Data Analysis to explore visually all of the features and use statistical techniques to understand if there are any relationships in the dataset that we should be aware of.

Also, we will leverage these insights as we're thinking about Part Three - Pre-Processing and Training - where feature engineering will be a critical aspect of this project.  The principal reason here being that since the dataset is a time-series, we want to develop features that can capture the temporal features not captured in just single snapshot in time.


In [1]:
# Importing modules for exploratory data analysis
# We will be using the Pandas Profiling library, in addition to matplotlib and seaborn

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport 

In [2]:
# Let's read in our cleaned data from Part One

filepath = '../data/stock_data.csv'
df = pd.read_csv(filepath)
print(df.head())

# Let's also get a list of the stock basket tickers
filepath = '../data/stock_info.csv'
stock = pd.read_csv(filepath)
tickers = stock['Ticker'].tolist()
print(tickers)

         Date  Robinhood      Price    Volume Ticker Company
0  2018-07-02   150897.0  46.794998  70925200   AAPL   Apple
1  2018-07-03   151073.0  45.980000  55819200   AAPL   Apple
2  2018-07-05   151258.0  46.349998  66416800   AAPL   Apple
3  2018-07-06   151150.0  46.992500  69940800   AAPL   Apple
4  2018-07-09   150664.0  47.645000  79026400   AAPL   Apple
['AAPL', 'AMD', 'AMZN', 'BABA', 'FB', 'GOOGL', 'INTC', 'JPM', 'MSFT', 'NFLX', 'NKE', 'NVDA', 'PYPL', 'SQ', 'SNAP', 'T', 'TSLA', 'TWTR', 'V', 'ZNGA']


In [3]:
# Let's create a Pandas Profile report for comprehensive review of data

profile = ProfileReport(df, title='Robinhood Stock Data Profiling Report', explorative=True)
profile.to_widgets()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=20.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

The correlation plot above shows negative correlations between Price and Volume and Price and Robinhood shares (popularity).

We do see positive correlation between Robinhood shares and Volume, which makes sense since Volume includes Robinhood shares.

Based on this positive correlation, what we can do is extract a feature - call it Percentage_Volume.

Percentage Volume will represent Robinhood shares af a percentage of Total Volume.

It captures the Volume implicitly and is a more interesting feature to study.

In [4]:
# Calculating percentage volume of Robinhood owned

df['Percentage_Volume'] = df['Robinhood'] / df['Volume']

# Let's rerun the pandas profiling on this new feature

profile = ProfileReport(df, title='Robinhood Stock Data Profiling Report', explorative=True)
profile.to_widgets()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=21.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Indeed, Percentage Volume is a more insightful feature as it has a lower correlation with the other explanatory variables.

Additionally, it has a strong correlation with the raw price level - which is interesting indeed.

Speaking of price level, it is now helpful to define our target variable.  Price level is not ideal, but percentage change is preferred.

Let's calculate 1-day, 5-day (1 trading week), and 10-day (2 trading weeks) daily realized ("Ex-Post") returns.

Additionally, let's calculate 1-day, 3-day, and 5-day daily future ("Ex-Ante") returns.

We shortened the time-horizon of the ex-ante returns relative to the ex-post as we expect minimal (if any) predictive power.

In [6]:
# Calculating daily price changes over three different time horizons.

for stock in tickers:
    df.loc[(df.Ticker == stock),'ExPost_PriceChange_1D']=df.loc[(df.Ticker == stock),'Price'].pct_change(1)
    df.loc[(df.Ticker == stock),'ExPost_PriceChange_5D']=df.loc[(df.Ticker == stock),'Price'].pct_change(5)
    df.loc[(df.Ticker == stock),'ExPost_PriceChange_10D']=df.loc[(df.Ticker == stock),'Price'].pct_change(10)
    
    df.loc[(df.Ticker == stock),'ExAnte_PriceChange_1D']=df.loc[(df.Ticker == stock),'Price'].shift(-1)/df.loc[(df.Ticker == stock),'Price'] - 1
    df.loc[(df.Ticker == stock),'ExAnte_PriceChange_3D']=df.loc[(df.Ticker == stock),'Price'].shift(-3)/df.loc[(df.Ticker == stock),'Price'] - 1
    df.loc[(df.Ticker == stock),'ExAnte_PriceChange_5D']=df.loc[(df.Ticker == stock),'Price'].shift(-5)/df.loc[(df.Ticker == stock),'Price'] - 1

    
df.sample(10)

Unnamed: 0,Date,Robinhood,Price,Volume,Ticker,Company,Percentage_Volume,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,ExAnte_PriceChange_1D,ExAnte_PriceChange_3D,ExAnte_PriceChange_5D
9857,2019-09-11,121872.0,5.76,15696000,ZNGA,Zynga,0.007765,0.006993,0.023091,0.006993,0.020833,0.019097,0.048611
8240,2019-04-08,131737.0,54.639999,52052000,TSLA,Tesla,0.002531,-0.006401,-0.05526,0.049075,-0.003258,-0.017496,-0.024963
5254,2019-05-23,46470.0,82.639999,9704200,NKE,Nike,0.004789,-0.006731,-0.019459,-0.004097,-0.005808,-0.04562,-0.066554
4716,2019-04-03,101165.0,369.75,5368900,NFLX,Netflix,0.018843,0.005521,0.046354,-0.014578,-0.005057,-0.022556,-0.015767
4454,2020-03-18,356180.0,140.399994,81593200,MSFT,Microsoft,0.004365,-0.042096,-0.086116,-0.176781,0.016453,-0.031481,0.046439
9930,2019-12-24,120628.0,6.33,3160800,ZNGA,Zynga,0.038164,0.007962,0.009569,0.032626,-0.004739,-0.030016,-0.028436
4836,2019-09-24,106076.0,254.589996,16338200,NFLX,Netflix,0.006493,-0.042607,-0.147388,-0.115976,0.039907,0.033348,0.058879
4383,2019-12-04,235279.0,149.850006,17574700,MSFT,Microsoft,0.013387,0.003617,-0.014339,-0.003591,0.000534,0.010077,0.012346
8766,2019-05-10,100844.0,38.450001,12259000,TWTR,Twitter,0.008226,-0.008765,-0.057598,-0.005689,-0.048375,-0.014304,-0.024707
33,2018-08-17,160351.0,54.395,141708000,AAPL,Apple,0.001132,0.01997,0.048427,0.046108,-0.009744,-0.011628,-0.006526


Let's move on to adding additional features to the dataset.

Since this is a time series data, it is helpful to include some temporal features before training.

The first one is simple - let's extract the year, month, day, and day of week as separate features.

In [7]:
# Feature engineering of date into year, month, day and day of week

df['Date'] = df['Date'].astype('datetime64[ns]') 
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Day_of_Week'] = df['Date'].dt.dayofweek

df.sample(10)

Unnamed: 0,Date,Robinhood,Price,Volume,Ticker,Company,Percentage_Volume,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,ExAnte_PriceChange_1D,ExAnte_PriceChange_3D,ExAnte_PriceChange_5D,Year,Month,Day,Day_of_Week
6789,2019-07-01,78519.0,73.199997,8092400,SQ,Square,0.009703,0.009238,0.006739,0.01371,0.010519,0.015164,0.065027,2019,7,1,0
1266,2019-07-16,108165.0,2009.900024,2618200,AMZN,Amazon,0.041313,-0.005487,0.010864,0.04563,-0.008891,-0.022578,-0.007667,2019,7,16,1
5506,2020-05-22,90028.0,93.75,4049500,NKE,Nike,0.022232,-0.005411,0.07771,0.03637,0.030613,0.05024,0.06176,2020,5,22,4
958,2020-04-23,202839.0,55.900002,69662700,AMD,AMD,0.002912,-0.000358,-0.018437,0.145727,0.005009,-0.006977,-0.062791,2020,4,23,3
6165,2019-01-07,30305.0,86.93,11094100,PYPL,Paypal,0.002732,0.00765,0.044079,0.054464,0.020361,0.043368,0.040262,2019,1,7,0
3014,2020-06-25,82457.0,1441.099976,1197900,GOOGL,Google,0.068835,0.005863,0.004867,0.027962,-0.054514,-0.015995,,2020,6,25,3
4719,2019-04-08,101638.0,361.410004,4653800,NFLX,Netflix,0.02184,-0.011163,-0.015124,-0.013161,0.009131,0.017266,-0.034697,2019,4,8,0
7862,2019-10-04,90116.0,37.509998,21914100,T,AT&T,0.004112,0.008604,0.002137,-0.010551,0.003999,-0.012263,0.001866,2019,10,4,4
1761,2019-07-03,122924.0,174.669998,8532600,BABA,Alibaba,0.014406,-0.004446,0.033611,0.055663,-0.007843,-0.033606,-0.046488,2019,7,3,2
3555,2018-08-20,14369.0,114.620003,8618700,JPM,JPMorgan,0.001667,-0.001307,0.00641,-0.021346,0.006107,0.00096,0.018234,2018,8,20,0


Another useful feature to add Lagged Features which will capture history of the data.

We can do this for say Lags one to seven (representing the prior seven trading days).

Let's apply these Lagged Fetures to (1) Robinhood shares, (2) Percentage Volume, and (3) and Daily 1-Day Price Change.

In [9]:
# Calcuating lagged feature for Robinhood shares

TotalLag = 7
LagFeatures = ['Robinhood', 'Percentage_Volume', 'ExPost_PriceChange_1D']

for stock in tickers:
    for feature in LagFeatures:
        for lag in range(1,TotalLag+1):
            lag_name = feature + '_Lag' + str(lag)
            df.loc[(df.Ticker == stock),lag_name] = df.loc[(df.Ticker == stock),feature].shift(lag)

df.sample(10)

Unnamed: 0,Date,Robinhood,Price,Volume,Ticker,Company,Percentage_Volume,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,...,Percentage_Volume_Lag5,Percentage_Volume_Lag6,Percentage_Volume_Lag7,ExPost_PriceChange_1D_Lag1,ExPost_PriceChange_1D_Lag2,ExPost_PriceChange_1D_Lag3,ExPost_PriceChange_1D_Lag4,ExPost_PriceChange_1D_Lag5,ExPost_PriceChange_1D_Lag6,ExPost_PriceChange_1D_Lag7
339,2019-11-05,204410.0,64.282501,79897600,AAPL,Apple,0.002558,-0.001437,0.056887,0.071554,...,0.001426,0.002101,0.002748,0.006567,0.028381,0.02261,-0.000123,-0.023128,0.010017,0.012316
6711,2019-03-11,85172.0,75.529999,7901700,SQ,Square,0.010779,0.015188,0.011246,-0.020744,...,0.005102,0.003808,0.002086,0.003913,-0.013708,-0.011836,0.018075,-0.03576,-0.046529,0.024206
4546,2018-07-30,116832.0,334.959991,18260700,NFLX,Netflix,0.006398,-0.057009,-0.07638,-0.163604,...,0.010263,0.007824,0.006995,-0.021703,0.000606,0.015532,-0.014725,0.004459,-0.008731,-0.029057
4421,2020-01-30,276436.0,172.779999,51597500,MSFT,Microsoft,0.005358,0.028208,0.036348,0.058831,...,0.013632,0.011076,0.008995,0.015593,0.019596,-0.016723,-0.010077,0.006156,-0.004805,-0.003591
8649,2018-11-19,105102.0,31.98,15745000,TWTR,Twitter,0.006675,-0.050193,-0.000937,-0.059965,...,0.005808,0.006587,0.006986,0.015686,0.007293,0.012927,0.014995,-0.06074,-0.002926,-0.02315
5062,2018-08-16,17611.0,80.050003,5475800,NKE,Nike,0.003216,0.006032,-0.01489,0.0178,...,0.003819,0.003268,0.003205,-0.007113,-0.000125,-0.007184,-0.006522,0.009441,-0.000373,0.012829
8146,2018-11-19,83697.0,70.694,48544500,TSLA,Tesla,0.001724,-0.002371,0.066983,0.035354,...,0.002467,0.003375,0.002433,0.016846,0.012907,0.015558,0.022489,-0.054863,-0.002533,0.009306
9385,2019-10-24,45742.0,176.160004,7872600,V,Visa,0.00581,0.028251,-0.010003,0.007319,...,0.009131,0.006858,0.010172,0.002692,-0.031571,0.004098,-0.012532,0.000394,-0.004923,0.007837
19,2018-07-30,151954.0,47.477501,84118000,AAPL,Apple,0.001806,-0.005603,-0.008872,-0.005238,...,0.002361,0.001824,0.001858,-0.016632,-0.003131,0.00943,0.007254,0.000888,-0.002293,0.007773
1193,2019-04-01,115241.0,1814.189941,4238800,AMZN,Amazon,0.027187,0.018779,0.022505,0.041351,...,0.022535,0.018136,0.020048,0.004133,0.004372,-0.010125,0.005354,0.005377,-0.029952,0.012235


Let's turn our attention to "window functions" to calculate statistics that capture the temporal aspects of the series.

First, let's use a rolling window across 3 different time horizons (3-day, 5-day, and 10-day).

This is commonly known as the simple moving average (SMA) when our statistic is the mean.

Let's apply this feature engineering to (1) Robinhood shares, (2) Percentage Volume, and (3) and Daily 1-Day Price Change.

In [10]:
#Calculating the rolling window feature across 3 different time horizons

RollingWindowFeatures = ['Robinhood', 'Percentage_Volume', 'ExPost_PriceChange_1D']

for stock in tickers:
    for feature in RollingWindowFeatures:
        df.loc[(df.Ticker == stock),'SMA_3D'] = df.loc[(df.Ticker == stock),feature].rolling(window=3).mean()
        df.loc[(df.Ticker == stock),'SMA_5D'] = df.loc[(df.Ticker == stock),feature].rolling(window=5).mean()
        df.loc[(df.Ticker == stock),'SMA_10D'] = df.loc[(df.Ticker == stock),feature].rolling(window=10).mean()

df.sample(10)

Unnamed: 0,Date,Robinhood,Price,Volume,Ticker,Company,Percentage_Volume,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,...,ExPost_PriceChange_1D_Lag1,ExPost_PriceChange_1D_Lag2,ExPost_PriceChange_1D_Lag3,ExPost_PriceChange_1D_Lag4,ExPost_PriceChange_1D_Lag5,ExPost_PriceChange_1D_Lag6,ExPost_PriceChange_1D_Lag7,SMA_3D,SMA_5D,SMA_10D
7434,2020-01-23,181544.0,19.25,25071500,SNAP,Snapchat,0.007241,0.007853,0.058274,0.154769,...,0.005263,-0.005756,0.047123,0.003298,0.011117,-0.000556,0.033889,0.002453,0.011556,0.01465
5653,2018-12-21,97615.0,129.570007,21593500,NVDA,NVIDIA,0.004521,-0.040933,-0.115261,-0.122214,...,-0.024619,-0.05737,0.023402,-0.019597,-0.016388,-6.7e-05,0.004791,-0.040974,-0.023824,-0.012616
3618,2018-11-16,15162.0,109.989998,13798600,JPM,JPMorgan,0.001099,-0.000727,-0.011681,0.014855,...,0.025529,-0.020622,0.005874,-0.021026,-0.009699,0.008073,0.017153,0.001393,-0.002194,0.001578
950,2020-04-13,198128.0,50.939999,64290100,AMD,AMD,0.003082,0.052914,0.196055,0.093602,...,-0.008403,0.025862,0.000842,0.115755,-0.042706,0.019011,-0.040018,0.023458,0.037394,0.010101
430,2020-03-18,281262.0,61.6675,75058400,AAPL,Apple,0.003747,-0.02448,-0.104419,-0.185208,...,0.04397,-0.128647,0.119808,-0.098755,-0.034731,0.072022,-0.079092,-0.036386,-0.017621,-0.017562
5202,2019-03-11,42899.0,85.82,3999800,NKE,Nike,0.010725,0.012028,0.002102,0.007632,...,-0.005395,0.001527,-0.003628,-0.002335,-0.017439,0.01668,-0.005106,0.00272,0.000439,0.000804
2942,2020-03-13,38488.0,1214.27002,3970000,GOOGL,Google,0.009695,0.092411,-0.062875,-0.093321,...,-0.082046,-0.050401,0.048841,-0.061702,-0.014467,-0.048379,0.032802,-0.013345,-0.01058,-0.008285
1945,2020-03-26,141380.0,195.320007,15416800,BABA,Alibaba,0.009171,0.035851,0.079832,0.055213,...,0.015128,0.053363,-0.027358,0.002322,0.004889,-0.026027,0.033324,0.03478,0.015861,0.006148
8322,2019-08-05,164655.0,45.664001,35141500,TSLA,Tesla,0.004685,-0.025689,-0.031599,-0.107009,...,0.002095,-0.032118,-0.002683,0.027527,0.033898,-0.003409,-0.136137,-0.018571,-0.006174,-0.010085
437,2020-03-27,310600.0,61.935001,51007500,AAPL,Apple,0.006089,-0.041402,0.080701,-0.108753,...,0.052623,-0.005509,0.100325,-0.021244,-0.063486,-0.007662,-0.02448,0.001904,0.016959,-0.009551


Finally, we can add a more advanced version of the rolling window technique know as the expanding window.

While the window length is fixed in the rolling window, the expanding window keeps a full history of the statistic.

Using the average as our statistic, this Expanded_Mean feature will capture the historical average over the period of interest.

Let's apply this feature engineering to (1) Robinhood shares, (2) Percentage Volume, and (3) and Daily 1-Day Price Change.

In [11]:
RollingWindowFeatures = ['Robinhood', 'Percentage_Volume', 'ExPost_PriceChange_1D']

for stock in tickers:
    for feature in RollingWindowFeatures:
        df.loc[(df.Ticker == stock),'Expanded_Mean'] = df.loc[(df.Ticker == stock),feature].expanding(2).mean()
        
df.sample(10)

Unnamed: 0,Date,Robinhood,Price,Volume,Ticker,Company,Percentage_Volume,ExPost_PriceChange_1D,ExPost_PriceChange_5D,ExPost_PriceChange_10D,...,ExPost_PriceChange_1D_Lag2,ExPost_PriceChange_1D_Lag3,ExPost_PriceChange_1D_Lag4,ExPost_PriceChange_1D_Lag5,ExPost_PriceChange_1D_Lag6,ExPost_PriceChange_1D_Lag7,SMA_3D,SMA_5D,SMA_10D,Expanded_Mean
8142,2018-11-13,85502.0,67.746002,27243000,TSLA,Tesla,0.003138,0.022489,-0.006832,0.026766,...,-0.002533,0.009306,0.020818,-0.000996,-0.014463,0.006187,-0.011636,-0.000957,0.002907,0.001108
3236,2019-05-15,49637.0,45.619999,23407900,INTC,Intel,0.002121,0.009962,-0.073518,-0.101261,...,-0.031169,-0.009009,-0.053209,-0.024564,-0.014448,-0.010242,-0.004016,-0.014853,-0.010392,-0.000233
9237,2019-03-26,35140.0,155.300003,15594400,V,Visa,0.002253,0.014834,0.004593,0.023529,...,-0.017522,0.013333,-0.005434,-0.002388,-0.003216,0.008171,-0.000983,0.00099,0.002374,0.000999
9731,2019-03-13,80229.0,5.37,13312400,ZNGA,Zynga,0.006027,0.03071,0.046784,0.042718,...,0.023437,0.007874,-0.009747,-0.003883,-0.009615,-0.026217,0.016141,0.00931,0.004341,0.001747
5823,2019-08-27,97846.0,161.800003,7274200,NVDA,NVIDIA,0.013451,-0.022061,-0.036159,0.036847,...,-0.052717,0.00146,0.020015,-0.017039,0.070318,0.072528,-0.01875,-0.006955,0.004405,-0.000874
6285,2019-06-28,35896.0,114.459999,6679800,PYPL,Paypal,0.005374,0.00695,-0.015059,-0.01472,...,-0.002534,-0.017682,0.002495,-0.021719,0.014086,0.010002,7e-05,-0.002995,-0.001416,0.001448
2359,2019-11-15,128975.0,195.100006,11524300,FB,Facebook,0.011192,0.010096,0.022322,0.007644,...,-0.006582,0.025632,-0.006445,0.002206,-0.005899,-0.014255,0.001102,0.004499,0.000817,0.000215
4704,2019-03-18,102385.0,363.440002,7194700,NFLX,Netflix,0.014231,0.005478,0.012763,0.035324,...,-0.006617,0.013866,-0.007217,0.026487,-0.008508,-0.019493,0.002073,0.002573,0.003563,-7.6e-05
9057,2018-07-06,23501.0,134.089996,4839800,V,Visa,0.004856,0.006002,,,...,-0.007925,,,,,,0.004025,,,0.004025
8904,2019-11-25,120816.0,30.540001,14035700,TWTR,Twitter,0.008608,0.016983,0.035605,0.0409,...,0.021255,-0.009171,-0.001695,0.008205,0.012461,-0.006534,0.015431,0.007085,0.004071,-0.000582


In [15]:
# Finally, let's export the new dataset with all these features to a CSV file
filepath = "../data/stock_data_new.csv"
df.to_csv(filepath)