Predictive Analysis of Robinhood Popularity Data - Data Wrangling

Purpose: In this module, we will analyze the Robinhood Popularity Data provided by RobinTrack, which tracks customer ownership of U.S. stocks on the Robinhood trading platform.

Our goal is to determine if there are any features in the Robinhood data that provides meaningful insights into both historical (ex post) and prospective (ex ante) price returns (ex dividends).

We will focus on a representative basket of 10 actively traded stocks over a two-period from July 1, 2018 to June 30, 2020.

Data Sources: RobinTrack, https://robintrack.net/data-download, Yahoo API

In [258]:
# Importing modules for data wrangling and collection
# yfinance is a library for downloading hstorical market data from Yahoo finance

import numpy as np
import pandas as pd
from datetime import datetime
import yfinance as yf
import os

In [259]:
# Representative basket of 10 actively traded stocks on Robinhood
# Source: Robinhood, https://robinhood.com/collections/100-most-popular

stock_ticker = ['AAPL', 'AMZN', 'FB', 'GOOGL', 'JPM', 'MSFT', 'NFLX', 'NVDA', 'SQ', 'TSLA']
stock_name = ['Apple', 'Amazon', 'Facebook', 'Google', 'JPMorgan', 'Microsoft', 'Netflix', 'NVIDIA', 'Square', 'Tesla']
N = len(stock_ticker)

stock_list = dict(zip(stock_ticker,stock_name))
stock_list

{'AAPL': 'Apple',
 'AMZN': 'Amazon',
 'FB': 'Facebook',
 'GOOGL': 'Google',
 'JPM': 'JPMorgan',
 'MSFT': 'Microsoft',
 'NFLX': 'Netflix',
 'NVDA': 'NVIDIA',
 'SQ': 'Square',
 'TSLA': 'Tesla'}

In [260]:
# The Robinhood Popularity data is stored in separate CSV files for each stock
# We're going to read in each CSV file as a dataframe and store it as a single list

df_list = list()

for ticker in stock_ticker:
    filepath = "../data/popularity_export/" + ticker + ".csv"
    df = pd.read_csv(filepath,parse_dates=[0])
    df.name = ticker
    df_list.append(df)

In [261]:
# Let's do a quick preview of the dataframes we've imported

for df in df_list:
    print(df.name)
    print(df.info())

AAPL
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19792 entries, 0 to 19791
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      19792 non-null  datetime64[ns]
 1   users_holding  19792 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 309.4 KB
None
AMZN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19756 entries, 0 to 19755
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      19756 non-null  datetime64[ns]
 1   users_holding  19756 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 308.8 KB
None
FB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19785 entries, 0 to 19784
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      19785

In [262]:
# We want to filter just dates during our 2-year period of analysis
# We also just want to keep the last observation for each date

start_date = datetime(2018,7,1)
end_date = datetime(2020,7,1)

for i in range(N):
    
    # Time period filtering
    filter = (df_list[i]['timestamp'] >= start_date) & (df_list[i]['timestamp'] < end_date)
    df_list[i] = df_list[i].loc[filter]
    
    # Keep last observation for each date
    df_list[i] = df_list[i].groupby([df_list[i]['timestamp'].dt.date]).max()
    
    # We can drop the redundant timestamp column
    df_list[i].drop(columns=['timestamp'], inplace=True)
    
    print(df_list[i])

            users_holding
timestamp                
2018-07-01         150718
2018-07-02         150897
2018-07-03         151073
2018-07-04         151076
2018-07-05         151258
...                   ...
2020-06-26         484175
2020-06-27         484175
2020-06-28         484144
2020-06-29         486061
2020-06-30         489538

[714 rows x 1 columns]
            users_holding
timestamp                
2018-07-01          78884
2018-07-02          78950
2018-07-03          78903
2018-07-04          78903
2018-07-05          79221
...                   ...
2020-06-26         283515
2020-06-27         283515
2020-06-28         283499
2020-06-29         286784
2020-06-30         291879

[714 rows x 1 columns]
            users_holding
timestamp                
2018-07-01         104202
2018-07-02         104294
2018-07-03         104629
2018-07-04         104631
2018-07-05         104632
...                   ...
2020-06-26         227796
2020-06-27         227796
2020-06-28      

In [263]:
# Let's pull in the prices and volume for each stock from the Yahoo API
# We want to right join on the stock price because it only includes trading days (exlcudes weekends and holidays)


for i in range(N):
    
    # Pulling pricing data
    price_data = yf.download(stock_ticker[i], start=start_date, end=end_date)
    
    # Closing price is the relevant price to keep, along with volume
    closing_data = price_data[['Close', 'Volume']]
    
    # Right join this data with our Robinhood participation data
    df_list[i] = df_list[i].merge(closing_data, how='right', left_index=True, right_index=True)
    
    # Let's also change the column names to be concise
    df_list[i].columns = ['Shares', 'Price', 'Volume']
    
    print(df_list[i])

[*********************100%***********************]  1 of 1 completed
              Shares      Price     Volume
Date                                      
2018-07-02  150897.0  46.794998   70925200
2018-07-03  151073.0  45.980000   55819200
2018-07-05  151258.0  46.349998   66416800
2018-07-06  151150.0  46.992500   69940800
2018-07-09  150664.0  47.645000   79026400
...              ...        ...        ...
2020-06-24  481357.0  90.014999  192623200
2020-06-25  483107.0  91.209999  137522400
2020-06-26  484175.0  88.407501  205256800
2020-06-29  486061.0  90.445000  130646000
2020-06-30  489538.0  91.199997  140223200

[503 rows x 3 columns]
[*********************100%***********************]  1 of 1 completed
              Shares        Price   Volume
Date                                      
2018-07-02   78950.0  1713.780029  3185700
2018-07-03   78903.0  1693.959961  2177300
2018-07-05   79221.0  1699.729980  2983100
2018-07-06   79175.0  1710.630005  2650300
2018-07-09   78968.0 

In [264]:
# Let's check for missing values

for df in df_list:
    print(df[df['Shares'].isnull()])

            Shares      Price     Volume
Date                                    
2018-08-09     NaN  52.220001   93970400
2019-01-24     NaN  38.174999  101766000
2019-01-25     NaN  39.439999  134142000
2019-01-28     NaN  39.075001  104768400
2019-01-29     NaN  38.669998  166348800
2020-01-07     NaN  74.597504  108872000
2020-01-08     NaN  75.797501  132079200
2020-01-09     NaN  77.407501  170108400
2020-01-10     NaN  77.582497  140644800
2020-01-13     NaN  79.239998  121532000
2020-01-14     NaN  78.169998  161954400
2020-01-15     NaN  77.834999  121923600
            Shares        Price   Volume
Date                                    
2018-08-09     NaN  1898.520020  4860400
2019-01-24     NaN  1654.930054  4089900
2019-01-25     NaN  1670.569946  4945900
2019-01-28     NaN  1637.890015  4837700
2019-01-29     NaN  1593.880005  4632800
2020-01-07     NaN  1906.859985  4044900
2020-01-08     NaN  1891.969971  3508000
2020-01-09     NaN  1901.050049  3167300
2020-01-10     N

In [265]:
# For the few missing Robinhood shares values, a forward fill makes sense

for i in range(N):
    df_list[i].ffill(axis=0,inplace=True)
    print(df_list[i])
 

              Shares      Price     Volume
Date                                      
2018-07-02  150897.0  46.794998   70925200
2018-07-03  151073.0  45.980000   55819200
2018-07-05  151258.0  46.349998   66416800
2018-07-06  151150.0  46.992500   69940800
2018-07-09  150664.0  47.645000   79026400
...              ...        ...        ...
2020-06-24  481357.0  90.014999  192623200
2020-06-25  483107.0  91.209999  137522400
2020-06-26  484175.0  88.407501  205256800
2020-06-29  486061.0  90.445000  130646000
2020-06-30  489538.0  91.199997  140223200

[503 rows x 3 columns]
              Shares        Price   Volume
Date                                      
2018-07-02   78950.0  1713.780029  3185700
2018-07-03   78903.0  1693.959961  2177300
2018-07-05   79221.0  1699.729980  2983100
2018-07-06   79175.0  1710.630005  2650300
2018-07-09   78968.0  1739.020020  3012000
...              ...          ...      ...
2020-06-24  277979.0  2734.399902  4526600
2020-06-25  280237.0  2754.580

In [266]:
# Let's check for duplicates rows

for df in df_list:
    print(df[df.duplicated()])

Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []
Empty DataFrame
Columns: [Shares, Price, Volume]
Index: []


In [267]:
# Finally, let's export the final dataset to separate CSVs

for i in range(N):
    filepath = "../data/" + stock_ticker[i] + ".csv"
    df = df_list[i].to_csv(filepath)
    