# Predictive Analysis of Robinhood Popularity Data - Data Wrangling

Purpose: In this module, we will analyze the Robinhood Popularity Data provided by RobinTrack, which tracks customer ownership of U.S. stocks on the Robinhood trading platform.

Our goal is to determine if there are any features in the Robinhood data that provides meaningful insights into both historical (ex post) and prospective (ex ante) price returns (ex dividends).

We will focus on a representative basket of 10 actively traded stocks over a two-period from July 1, 2018 to June 30, 2020.

Data Sources: RobinTrack, https://robintrack.net/data-download, Yahoo API

In [1]:
# Importing modules for data wrangling and collection
# yfinance is a library for downloading hstorical market data from Yahoo finance

import numpy as np
import pandas as pd
from datetime import datetime
import yfinance as yf

In [2]:
# Representative basket of 20 actively traded stocks on Robinhood
# Source: Robinhood, https://robinhood.com/collections/100-most-popular

stock_ticker = ['AAPL', 'AMD', 'AMZN', 'BABA', 'FB', 'GOOGL', 'INTC', 'JPM', 'MSFT', 'NFLX',
                'NKE', 'NVDA', 'PYPL', 'SQ', 'SNAP', 'T', 'TSLA', 'TWTR', 'V', 'ZNGA']


stock_name = ['Apple', 'AMD', 'Amazon', 'Alibaba', 'Facebook', 'Google', 'Intel', 'JPMorgan', 'Microsoft', 'Netflix',
              'Nike', 'NVIDIA', 'Paypal', 'Square', 'Snapchat', 'AT&T', 'Tesla', 'Twitter', 'Visa', 'Zynga']

N = len(stock_ticker)

stock_info = dict(zip(stock_ticker,stock_name))
stock_info

{'AAPL': 'Apple',
 'AMD': 'AMD',
 'AMZN': 'Amazon',
 'BABA': 'Alibaba',
 'FB': 'Facebook',
 'GOOGL': 'Google',
 'INTC': 'Intel',
 'JPM': 'JPMorgan',
 'MSFT': 'Microsoft',
 'NFLX': 'Netflix',
 'NKE': 'Nike',
 'NVDA': 'NVIDIA',
 'PYPL': 'Paypal',
 'SQ': 'Square',
 'SNAP': 'Snapchat',
 'T': 'AT&T',
 'TSLA': 'Tesla',
 'TWTR': 'Twitter',
 'V': 'Visa',
 'ZNGA': 'Zynga'}

In [3]:
# The Robinhood Popularity data is stored in separate CSV files for each stock
# We're going to read in each CSV file as a dataframe and store it as a single list

df_list = list()

for ticker in stock_ticker:
    filepath = "../data/raw_export/" + ticker + ".csv"
    df = pd.read_csv(filepath,parse_dates=[0])
    df.name = ticker
    df_list.append(df)

In [4]:
# Let's do a quick preview of the dataframes we've imported

for df in df_list:
    print(df.name)
    print(df.info())

AAPL
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19792 entries, 0 to 19791
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      19792 non-null  datetime64[ns]
 1   users_holding  19792 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 309.4 KB
None
AMD
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19775 entries, 0 to 19774
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      19775 non-null  datetime64[ns]
 1   users_holding  19775 non-null  int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 309.1 KB
None
AMZN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19756 entries, 0 to 19755
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   timestamp      1975

In [5]:
# We want to filter just dates during our 2-year period of analysis
# We also just want to keep the last observation for each date

start_date = datetime(2018,7,1)
end_date = datetime(2020,7,1)

for i in range(N):
    
    # Time period filtering
    filter = (df_list[i]['timestamp'] >= start_date) & (df_list[i]['timestamp'] < end_date)
    df_list[i] = df_list[i].loc[filter]
    
    # Keep last observation for each date
    df_list[i] = df_list[i].groupby([df_list[i]['timestamp'].dt.date]).max()
    
    # We can drop the redundant timestamp column
    df_list[i].drop(columns=['timestamp'], inplace=True)
    
    print(stock_ticker[i])
    print(df_list[i])

AAPL
            users_holding
timestamp                
2018-07-01         150718
2018-07-02         150897
2018-07-03         151073
2018-07-04         151076
2018-07-05         151258
...                   ...
2020-06-26         484175
2020-06-27         484175
2020-06-28         484144
2020-06-29         486061
2020-06-30         489538

[714 rows x 1 columns]
AMD
            users_holding
timestamp                
2018-07-01         125668
2018-07-02         125760
2018-07-03         125792
2018-07-04         125688
2018-07-05         125687
...                   ...
2020-06-26         227203
2020-06-27         227307
2020-06-28         227307
2020-06-29         229581
2020-06-30         229657

[714 rows x 1 columns]
AMZN
            users_holding
timestamp                
2018-07-01          78884
2018-07-02          78950
2018-07-03          78903
2018-07-04          78903
2018-07-05          79221
...                   ...
2020-06-26         283515
2020-06-27         283515
20

In [6]:
# Let's pull in the prices and volume for each stock from the Yahoo API
# We want to right join on the stock price because it only includes trading days (exlcudes weekends and holidays)


for i in range(N):
    
    # Pulling pricing data
    price_data = yf.download(stock_ticker[i], start=start_date, end=end_date)
    
    # Closing price is the relevant price to keep, along with volume
    closing_data = price_data[['Close', 'Volume']]
    
    # Right join this data with our Robinhood participation data
    df_list[i] = df_list[i].merge(closing_data, how='right', left_index=True, right_index=True)
    
    # Let's also change the column names to be concise
    df_list[i].columns = ['Robinhood', 'Price', 'Volume']
    
    # let's also add the the ticker and stock names
    df_list[i]['Ticker'] = stock_ticker[i]
    df_list[i]['Company'] = stock_name[i]
    
     
    print(df_list[i])

[*********************100%***********************]  1 of 1 completed
            Robinhood      Price     Volume Ticker Company
Date                                                      
2018-07-02   150897.0  46.794998   70925200   AAPL   Apple
2018-07-03   151073.0  45.980000   55819200   AAPL   Apple
2018-07-05   151258.0  46.349998   66416800   AAPL   Apple
2018-07-06   151150.0  46.992500   69940800   AAPL   Apple
2018-07-09   150664.0  47.645000   79026400   AAPL   Apple
...               ...        ...        ...    ...     ...
2020-06-24   481357.0  90.014999  192623200   AAPL   Apple
2020-06-25   483107.0  91.209999  137522400   AAPL   Apple
2020-06-26   484175.0  88.407501  205256800   AAPL   Apple
2020-06-29   486061.0  90.445000  130646000   AAPL   Apple
2020-06-30   489538.0  91.199997  140223200   AAPL   Apple

[503 rows x 5 columns]
[*********************100%***********************]  1 of 1 completed
            Robinhood      Price    Volume Ticker Company
Date         

[*********************100%***********************]  1 of 1 completed
            Robinhood       Price    Volume Ticker Company
Date                                                      
2018-07-02    16922.0   78.349998  11867000    NKE    Nike
2018-07-03    16869.0   76.279999   5794900    NKE    Nike
2018-07-05    16854.0   76.550003   6534500    NKE    Nike
2018-07-06    16877.0   76.480003   5916200    NKE    Nike
2018-07-09    16879.0   77.279999   4871400    NKE    Nike
...               ...         ...       ...    ...     ...
2020-06-24   101671.0  100.080002   8611600    NKE    Nike
2020-06-25   102498.0  101.400002  11531400    NKE    Nike
2020-06-26   103314.0   93.669998  24918500    NKE    Nike
2020-06-29   103774.0   95.870003   9624200    NKE    Nike
2020-06-30   103774.0   98.050003   9065500    NKE    Nike

[503 rows x 5 columns]
[*********************100%***********************]  1 of 1 completed
            Robinhood       Price    Volume Ticker Company
Date        

In [7]:
# Let's check for missing values

for df in df_list:
    print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 503 entries, 2018-07-02 to 2020-06-30
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Robinhood  491 non-null    float64
 1   Price      503 non-null    float64
 2   Volume     503 non-null    int64  
 3   Ticker     503 non-null    object 
 4   Company    503 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 43.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 503 entries, 2018-07-02 to 2020-06-30
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Robinhood  491 non-null    float64
 1   Price      503 non-null    float64
 2   Volume     503 non-null    int64  
 3   Ticker     503 non-null    object 
 4   Company    503 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 43.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 503 e

In [8]:
# For the few missing Robinhood shares values, a forward fill makes sense

for i in range(N):
    df_list[i].ffill(axis=0,inplace=True)
    print(df_list[i])
 

            Robinhood      Price     Volume Ticker Company
Date                                                      
2018-07-02   150897.0  46.794998   70925200   AAPL   Apple
2018-07-03   151073.0  45.980000   55819200   AAPL   Apple
2018-07-05   151258.0  46.349998   66416800   AAPL   Apple
2018-07-06   151150.0  46.992500   69940800   AAPL   Apple
2018-07-09   150664.0  47.645000   79026400   AAPL   Apple
...               ...        ...        ...    ...     ...
2020-06-24   481357.0  90.014999  192623200   AAPL   Apple
2020-06-25   483107.0  91.209999  137522400   AAPL   Apple
2020-06-26   484175.0  88.407501  205256800   AAPL   Apple
2020-06-29   486061.0  90.445000  130646000   AAPL   Apple
2020-06-30   489538.0  91.199997  140223200   AAPL   Apple

[503 rows x 5 columns]
            Robinhood      Price    Volume Ticker Company
Date                                                     
2018-07-02   125760.0  15.160000  43398800    AMD     AMD
2018-07-03   125792.0  15.000000  3

In [9]:
# Let's check for duplicates rows

for df in df_list:
    print(df[df.duplicated()])

Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, Ticker, Company]
Index: []
Empty DataFrame
Columns: [Robinhood, Price, Volume, 

In [10]:
# We're going to merge all the dataframes into one dataframe

df_export = df_list[0]

for i in range(1,N):
    df_export = df_export.append(df_list[i])

print(df_export.shape)
df_export.sample(10)


(10060, 5)


Unnamed: 0_level_0,Robinhood,Price,Volume,Ticker,Company
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-02-27,41002.0,107.839996,15461900,PYPL,Paypal
2019-01-29,112653.0,328.899994,7655200,NFLX,Netflix
2018-09-27,169148.0,32.59,87934400,AMD,AMD
2019-09-05,104787.0,293.25,8966800,NFLX,Netflix
2020-01-14,29839.0,1430.589966,1303800,GOOGL,Google
2018-10-24,109319.0,57.700001,100291500,TSLA,Tesla
2019-11-12,29572.0,1297.209961,1442600,GOOGL,Google
2020-06-30,489538.0,91.199997,140223200,AAPL,Apple
2019-06-04,109851.0,6.35,13559800,ZNGA,Zynga
2019-05-28,51039.0,43.57,34779900,INTC,Intel


In [11]:
# Finally, let's export the final dataset to a single CSV file
filepath = "../data/stock_data.csv"
df_export.to_csv(filepath)
    
# Let's also export a csv of the stock basket

filepath = "../data/stock_info.csv"
stock_df = pd.DataFrame.from_dict(stock_info, orient='index')
stock_df.index.rename('Ticker',inplace=True)
stock_df.rename(columns={0: 'Company'}, inplace=True)
stock_df.to_csv(filepath)