## Requirements
- Identifies which of the three proposals you outlined in your lightning talk you have chosen
- Articulates the main goal of your project (your problem statement)
- Outlines your proposed methods and models
- Defines the risks & assumptions of your data
- Revises initial goals & success criteria, as needed
- Documents your data source


## Problem Statement
- Specific:
  - What precisely do you plan to do?
  - What type of model will you need to develop?
- Measurable:
  - What metrics will you be using to assess performance?
  - MSE? Accuracy? Precision? AUC?
- Achievable:
  - Is your project appropriately scoped?
  - Is it too aggressive?  Too easy?
  - *Note:* If your project is too big, break it up into smaller pieces.  Sometimes a good project is the simply one part of a larger, longer-term agenda.
- Relevant:
  - Does anyone care about this?
  - Why should people be interested in your results?
  - What value will the completion of your project be adding?
- Time-bound
  - What's your deadline?

### Main goal & Problem statement

My project will be building a stock picker that utilizes fundamental data from the companies' financial statements for selection. The main goal would be to pick a portfolio of ten stocks that is able to outperform the market benchmark (S&P500). Since quarterly data is published, we are more interested in the stock's performance after a year, and this will not be used as a trading strategy. This will also be a regression problem, as we are dealing with predicting stock performance.

A big bulk of the work will be spent on feature engineering of the fundamental data (e.g. financial and accounting ratios), so that these features will correlate with business success and be good performance indicators. Working with time series is also important as we are interested in the performance of the stock when compared to its financial performance.

### Proposed methods and models

I will work with models learnt throughout the course that show predictive ability on stock performance, and use them again for backtesting. We can use linear regression, regularized regression (Lasso or Ridge or ENet), KNN, SVM Regressor, Decision Trees as potential models. MSE will likely be the metric to assess the model performance.


Feature importance is also something to look at as we want to know which variables/ratios gives us better predictive ability for stock performance. After running these models, we can put them in a table to view the results of the top 10 selected stock portfolios.

After this, we will do backtesting to find out the volatility as well as the return profile of each portfolio. Backtesting will follow roughly these steps:
- split into train test sets and train model on training set
- for given year, use model to predict stock returns and pick 10 top performances from test set to create equal weighted portfolio
- record daily/weekly portfolio value change to see performance of portfolio
- repeat for the number of years of available data, likely using loops? (we should get return and volatility from here)

From all the models, we should get some kind of return/volatility profile, which we can then use to determine the better stocks to pick.

### Relevance

Potentially, this can be used by portfolio managers or retail investors who wish to do better than market returns without taking on too much additional risk. As I do not have any idea how the tested models would do at this point in time, I generally hope that results would at least meet the benchmark return at its minimum. The number of stocks picked is also kept small to minimize potential transaction costs.

### Timeline

I will start working on the project right after the completion of Project 4 and aim to come up with a working prototype by the 2nd week of capstone project start. I will try to refine the model in the third week and do up the presentation in the 4th week.

## Project starts here

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Simfin is where we will be downloading our fundamentals data from. Instead of downloading a CSV file, we will use the API key to download directly.

In [2]:
import simfin as sf

# Set your SimFin + API-key for downloading data.
sf.set_api_key('nYdjeL237mt7QaQ4OZTaPa4Xn3YECrTH')

# Set the local directory where data-files are stored.
# The directory will be created if it does not already exist.
sf.set_data_dir('~/simfin_data/')

# Download the data from the SimFin server and load into a Pandas DataFrame.
# We will be getting 3 sets of financial statements (income statement, balance sheet, cashflow statement)
df_income = sf.load_income(variant='quarterly', market='us')
df_balance = sf.load_balance(variant='quarterly', market='us')
df_cashflow = sf.load_cashflow(variant='quarterly', market='us')

# Print the first rows of the data.
df_income.head()

Dataset "us-income-quarterly" on disk (6 days old).
- Loading from disk ... Done!
Dataset "us-balance-quarterly" on disk (6 days old).
- Loading from disk ... Done!
Dataset "us-cashflow-quarterly" on disk (6 days old).
- Loading from disk ... Done!


Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Revenue,Cost of Revenue,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
Ticker,Report Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A,2016-10-31,45846,USD,2016,Q4,2016-12-20,2018-12-20,324000000.0,328000000.0,1111000000.0,-523000000.0,...,-32000000.0,-16000000.0,151000000,,151000000,-25000000.0,126000000,,126000000,126000000
A,2017-01-31,45846,USD,2017,Q1,2017-03-08,2018-03-06,322000000.0,326000000.0,1067000000.0,-493000000.0,...,-13000000.0,-16000000.0,193000000,,193000000,-25000000.0,168000000,,168000000,168000000
A,2017-04-30,45846,USD,2017,Q2,2017-06-06,2018-05-31,321000000.0,325000000.0,1102000000.0,-510000000.0,...,-10000000.0,-15000000.0,191000000,,191000000,-27000000.0,164000000,,164000000,164000000
A,2017-07-31,45846,USD,2017,Q3,2017-09-06,2018-08-30,321000000.0,326000000.0,1114000000.0,-518000000.0,...,-8000000.0,-13000000.0,193000000,,193000000,-18000000.0,175000000,,175000000,175000000
A,2017-10-31,45846,USD,2017,Q4,2017-12-21,2018-12-20,324000000.0,327000000.0,1189000000.0,-542000000.0,...,-7000000.0,-13000000.0,226000000,,226000000,-49000000.0,177000000,,177000000,177000000


In [3]:
df_balance.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),"Cash, Cash Equivalents & Short Term Investments",Accounts & Notes Receivable,...,Short Term Debt,Total Current Liabilities,Long Term Debt,Total Noncurrent Liabilities,Total Liabilities,Share Capital & Additional Paid-In Capital,Treasury Stock,Retained Earnings,Total Equity,Total Liabilities & Equity
Ticker,Report Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A,2016-10-31,45846,USD,2016,Q4,2016-12-20,2017-12-21,324000000.0,328000000.0,2289000000.0,631000000.0,...,0.0,945000000,1904000000.0,2603000000.0,3548000000,9165000000.0,-10508000000.0,6089000000.0,4246000000.0,7794000000
A,2017-01-31,45846,USD,2017,Q1,2017-03-08,2017-03-08,322000000.0,326000000.0,2241000000.0,653000000.0,...,190000000.0,1089000000,1802000000.0,2483000000.0,3572000000,5239000000.0,0.0,-453000000.0,4300000000.0,7872000000
A,2017-04-30,45846,USD,2017,Q2,2017-06-06,2017-06-06,321000000.0,325000000.0,2389000000.0,677000000.0,...,241000000.0,1187000000,1802000000.0,2454000000.0,3641000000,5242000000.0,0.0,-393000000.0,4375000000.0,8016000000
A,2017-07-31,45846,USD,2017,Q3,2017-09-06,2017-09-06,321000000.0,326000000.0,2563000000.0,678000000.0,...,280000000.0,1241000000,1801000000.0,2409000000.0,3650000000,5285000000.0,0.0,-260000000.0,4611000000.0,8261000000
A,2017-10-31,45846,USD,2017,Q4,2017-12-21,2018-12-20,324000000.0,327000000.0,2678000000.0,724000000.0,...,210000000.0,1263000000,1801000000.0,2328000000.0,3591000000,5303000000.0,,-126000000.0,4835000000.0,8426000000


In [4]:
df_cashflow.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Net Income/Starting Line,Depreciation & Amortization,...,Net Cash from Operating Activities,Change in Fixed Assets & Intangibles,Net Change in Long Term Investment,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash
Ticker,Report Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A,2016-10-31,45846,USD,2016,Q4,2016-12-20,2018-12-20,324000000.0,328000000.0,126000000.0,56000000.0,...,234000000.0,-52000000.0,0.0,-26000000.0,-78000000.0,-38000000.0,27000000.0,-43000000.0,-56000000.0,90000000
A,2017-01-31,45846,USD,2017,Q1,2017-03-08,2018-03-06,322000000.0,326000000.0,168000000.0,55000000.0,...,116000000.0,-32000000.0,,-69000000.0,-101000000.0,-42000000.0,89000000.0,-93000000.0,-58000000.0,-48000000
A,2017-04-30,45846,USD,2017,Q2,2017-06-06,2018-05-31,321000000.0,325000000.0,164000000.0,54000000.0,...,257000000.0,-43000000.0,,0.0,-43000000.0,-43000000.0,52000000.0,-75000000.0,-67000000.0,148000000
A,2017-07-31,45846,USD,2017,Q3,2017-09-06,2018-08-30,321000000.0,326000000.0,175000000.0,51000000.0,...,228000000.0,-43000000.0,,-57000000.0,-101000000.0,-42000000.0,39000000.0,32000000.0,29000000.0,174000000
A,2017-10-31,45846,USD,2017,Q4,2017-12-21,2018-12-20,324000000.0,327000000.0,177000000.0,52000000.0,...,288000000.0,-58000000.0,0.0,0.0,-60000000.0,-43000000.0,-70000000.0,8000000.0,-106000000.0,115000000


In [5]:
print('Income Statement CSV data is: ', df_income.shape)
print('Balance Sheet CSV data is: ', df_balance.shape)
print('Cash Flow CSV data is: ', df_cashflow.shape)

Income Statement CSV data is:  (36685, 26)
Balance Sheet CSV data is:  (36685, 28)
Cash Flow CSV data is:  (36685, 26)


Number of rows of data is consistent. There are altogether 36685 rows to use.

In [6]:
# Merge the data together
# Define the column features where merge takes place
list_to_merge_on = ['Ticker', 'SimFinId', 'Currency', 'Fiscal Year', 'Report Date', 'Publish Date']

# Merge the income statement and balance sheet first
merge1 = pd.merge(df_income, df_balance, on = list_to_merge_on, how = 'inner')

# Merge previous result with cashflow statement
df_merged = pd.merge(merge1, df_cashflow, on = list_to_merge_on, how = 'inner')

# Reset the index
df_merged.reset_index(inplace=True)

# Make sure that the dates are in correct format
df_merged["Report Date"] = pd.to_datetime(df_merged["Report Date"])
df_merged["Publish Date"] = pd.to_datetime(df_merged["Publish Date"])

print('Merged data matrix shape is: ', df_merged.shape)

Merged data matrix shape is:  (36685, 74)


In [7]:
df_merged.head()

Unnamed: 0,Ticker,Report Date,SimFinId,Currency,Fiscal Year,Fiscal Period_x,Publish Date,Restated Date_x,Shares (Basic)_x,Shares (Diluted)_x,...,Net Cash from Operating Activities,Change in Fixed Assets & Intangibles,Net Change in Long Term Investment,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash
0,A,2016-10-31,45846,USD,2016,Q4,2016-12-20,2018-12-20,324000000.0,328000000.0,...,234000000.0,-52000000.0,0.0,-26000000.0,-78000000.0,-38000000.0,27000000.0,-43000000.0,-56000000.0,90000000
1,A,2017-01-31,45846,USD,2017,Q1,2017-03-08,2018-03-06,322000000.0,326000000.0,...,116000000.0,-32000000.0,,-69000000.0,-101000000.0,-42000000.0,89000000.0,-93000000.0,-58000000.0,-48000000
2,A,2017-04-30,45846,USD,2017,Q2,2017-06-06,2018-05-31,321000000.0,325000000.0,...,257000000.0,-43000000.0,,0.0,-43000000.0,-43000000.0,52000000.0,-75000000.0,-67000000.0,148000000
3,A,2017-07-31,45846,USD,2017,Q3,2017-09-06,2018-08-30,321000000.0,326000000.0,...,228000000.0,-43000000.0,,-57000000.0,-101000000.0,-42000000.0,39000000.0,32000000.0,29000000.0,174000000
4,A,2017-10-31,45846,USD,2017,Q4,2017-12-21,2018-12-20,324000000.0,327000000.0,...,288000000.0,-58000000.0,0.0,0.0,-60000000.0,-43000000.0,-70000000.0,8000000.0,-106000000.0,115000000


In [8]:
df_prices = sf.load_shareprices(variant='daily', market='us')

Dataset "us-shareprices-daily" on disk (5 days old).
- Loading from disk ... Done!


In [9]:
df_prices.reset_index(inplace=True)

In [10]:
df_prices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3157862 entries, 0 to 3157861
Data columns (total 11 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Ticker              object        
 1   Date                datetime64[ns]
 2   SimFinId            int64         
 3   Open                float64       
 4   Low                 float64       
 5   High                float64       
 6   Close               float64       
 7   Adj. Close          float64       
 8   Dividend            float64       
 9   Volume              int64         
 10  Shares Outstanding  float64       
dtypes: datetime64[ns](1), float64(7), int64(2), object(1)
memory usage: 265.0+ MB


In [11]:
df_prices.isnull().sum()

Ticker                      0
Date                        0
SimFinId                    0
Open                    21601
Low                     22052
High                    21166
Close                   21571
Adj. Close              21456
Dividend              3134252
Volume                      0
Shares Outstanding      44696
dtype: int64

## Function 1: to get price of a ticker (or find the nearest price given date)

In [12]:
# This is a function that takes in the ticker, start date and the prices dataframe
# It will return the ticker, price and date, and transaction value.

def getPrice (ticker, datestart, days_fwd, df=df_prices):
    
    # looks into a window of the next 5 days if date chosen falls on a non-trading day
    window = 5
    
    # returns all the rows between the start date and the window, filtered from df
    rows = df[
    (df["Date"].between(pd.to_datetime(datestart) + pd.Timedelta(days=days_fwd),\
                        pd.to_datetime(datestart) + pd.Timedelta(days=window + days_fwd)))\
        & (df["Ticker"]==ticker)]
    
    # if nothing between start date and window, return nothing
    # NaT stands for Not a Time
    if rows.empty:
        return [ticker, np.float("NaN"), np.datetime64('NaT'), np.float("NaN")]
    
    # else return the first row within the filtered df i.e. closest date
    # returns ticker, open price, date, transaction value (volume x open price)
    else:
        return [ticker, 
                rows.iloc[0]["Open"],\
                rows.iloc[0]["Date"],\
                rows.iloc[0]["Volume"] * rows.iloc[0]["Open"]]

### To demonstrate how above function works

In [13]:
# get daily AAPL data in Sep 2016
df_prices.query("Ticker=='AAPL' & Date<'2016-10-01'")

Unnamed: 0,Ticker,Date,SimFinId,Open,Low,High,Close,Adj. Close,Dividend,Volume,Shares Outstanding
10237,AAPL,2016-09-06,111052,26.98,26.88,27.07,26.93,25.11,,107521564,21553770000.0
10238,AAPL,2016-09-07,111052,26.96,26.77,27.19,27.09,25.26,,169457312,21553770000.0
10239,AAPL,2016-09-08,111052,26.81,26.31,26.82,26.38,24.6,,212008104,21553770000.0
10240,AAPL,2016-09-09,111052,26.16,25.78,26.43,25.78,24.04,,186227936,21553770000.0
10241,AAPL,2016-09-12,111052,25.66,25.63,26.43,26.36,24.58,,181171080,21553770000.0
10242,AAPL,2016-09-13,111052,26.88,26.81,27.2,26.99,25.16,,248704760,21553770000.0
10243,AAPL,2016-09-14,111052,27.18,27.15,28.26,27.94,26.06,,449361272,21553770000.0
10244,AAPL,2016-09-15,111052,28.46,28.37,28.93,28.89,26.94,,362452708,21553770000.0
10245,AAPL,2016-09-16,111052,28.78,28.51,29.03,28.73,26.79,,319547644,21553770000.0
10246,AAPL,2016-09-19,111052,28.8,28.31,29.05,28.39,26.48,,188092184,21553770000.0


In [14]:
# use function to get price data from AAPL 09 Sep 2016
getPrice('AAPL', '2016-09-09', 0, df_prices)

['AAPL', 26.16, Timestamp('2016-09-09 00:00:00'), 4871722805.76]

In [15]:
# use function to get price date from AAPL 10 Sep 2016
# as 10 Sep 2016 is non trading day, function will find the price info from 12 Sep instead
getPrice('AAPL', '2016-09-10', 0, df_prices)

['AAPL', 25.66, Timestamp('2016-09-12 00:00:00'), 4648849912.8]

In [16]:
# use function to get APPL price date 30 days in the future using the days_adj parameter
getPrice('AAPL', '2016-09-10', 30, df_prices)

['AAPL', 28.75, Timestamp('2016-10-10 00:00:00'), 4167134940.0]

## Function 2
- Take in all fundamental data (fund_data) and all stock prices (price_data), with a days forward modifier
- Returns stock price info for report date and stock price one year (can change using modifier) from that date

In [17]:
def getReportDatePrices(fund_data, price_data, days_fwd=365):
    
    # We want to get a list of 2 lists
    # [(price at date) (price at date + days_fwd)]
    
    # creates a list of lists of len(fund_data) rows and 8 columns
    # will update with 4 
    # instantiate null dataframe first
    
    y = [[None]*8 for i in range(len(fund_data))]
    
    # We can use Publish Date or Report Date, but to get performance we should use Publish Date
    # We cannot take action on Report Date, but we can trade when info is being published
    date_column = 'Publish Date' 

    # start from zero, loop through all rows and append to the empty null list
    
    i=0
    for index in range(len(fund_data)):
        y[i] = (getPrice(fund_data['Ticker'].iloc[index], fund_data[date_column].iloc[index], 0, price_data) +\
                getPrice(fund_data['Ticker'].iloc[index], fund_data[date_column].iloc[index], days_fwd, price_data))
        i=i+1
    return y

In [18]:
#getReportDatePrices(df_merged, df_prices, 365)

This will take quite a long time to run. Estimated time would be a few hours, and not efficient because it is looking at the 3 million rows of data from the df_prices dataset. Find another workaround which is to merge the fund_data and price_data to get a subset of the price data, before applying the getPrice function.

## STOPPED HERE TO TEST OUT AND REDO FUNCTIONS TO MAKE IT RUN FASTER

In [19]:
pd.set_option('display.max_columns', 200)
df_merged.head()

Unnamed: 0,Ticker,Report Date,SimFinId,Currency,Fiscal Year,Fiscal Period_x,Publish Date,Restated Date_x,Shares (Basic)_x,Shares (Diluted)_x,Revenue,Cost of Revenue,Gross Profit,Operating Expenses,"Selling, General & Administrative",Research & Development,Depreciation & Amortization_x,Operating Income (Loss),Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common),Fiscal Period_y,Restated Date_y,Shares (Basic)_y,Shares (Diluted)_y,"Cash, Cash Equivalents & Short Term Investments",Accounts & Notes Receivable,Inventories,Total Current Assets,"Property, Plant & Equipment, Net",Long Term Investments & Receivables,Other Long Term Assets,Total Noncurrent Assets,Total Assets,Payables & Accruals,Short Term Debt,Total Current Liabilities,Long Term Debt,Total Noncurrent Liabilities,Total Liabilities,Share Capital & Additional Paid-In Capital,Treasury Stock,Retained Earnings,Total Equity,Total Liabilities & Equity,Fiscal Period,Restated Date,Shares (Basic),Shares (Diluted),Net Income/Starting Line,Depreciation & Amortization_y,Non-Cash Items,Change in Working Capital,Change in Accounts Receivable,Change in Inventories,Change in Accounts Payable,Change in Other,Net Cash from Operating Activities,Change in Fixed Assets & Intangibles,Net Change in Long Term Investment,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash
0,A,2016-10-31,45846,USD,2016,Q4,2016-12-20,2018-12-20,324000000.0,328000000.0,1111000000.0,-523000000.0,588000000.0,-405000000.0,-321000000.0,-84000000.0,,183000000.0,-32000000.0,-16000000.0,151000000,,151000000,-25000000.0,126000000,,126000000,126000000,Q4,2017-12-21,324000000.0,328000000.0,2289000000.0,631000000.0,533000000.0,3635000000.0,639000000.0,135000000.0,3385000000.0,4159000000.0,7794000000,257000000.0,0.0,945000000,1904000000.0,2603000000.0,3548000000,9165000000.0,-10508000000.0,6089000000.0,4246000000.0,7794000000,Q4,2018-12-20,324000000.0,328000000.0,126000000.0,56000000.0,2000000.0,50000000.0,-52000000.0,4000000.0,12000000.0,86000000.0,234000000.0,-52000000.0,0.0,-26000000.0,-78000000.0,-38000000.0,27000000.0,-43000000.0,-56000000.0,90000000
1,A,2017-01-31,45846,USD,2017,Q1,2017-03-08,2018-03-06,322000000.0,326000000.0,1067000000.0,-493000000.0,574000000.0,-368000000.0,-289000000.0,-79000000.0,,206000000.0,-13000000.0,-16000000.0,193000000,,193000000,-25000000.0,168000000,,168000000,168000000,Q1,2017-03-08,322000000.0,326000000.0,2241000000.0,653000000.0,551000000.0,3635000000.0,653000000.0,133000000.0,3451000000.0,4237000000.0,7872000000,268000000.0,190000000.0,1089000000,1802000000.0,2483000000.0,3572000000,5239000000.0,0.0,-453000000.0,4300000000.0,7872000000,Q1,2018-03-06,322000000.0,326000000.0,168000000.0,55000000.0,45000000.0,-152000000.0,-31000000.0,-26000000.0,9000000.0,-104000000.0,116000000.0,-32000000.0,,-69000000.0,-101000000.0,-42000000.0,89000000.0,-93000000.0,-58000000.0,-48000000
2,A,2017-04-30,45846,USD,2017,Q2,2017-06-06,2018-05-31,321000000.0,325000000.0,1102000000.0,-510000000.0,592000000.0,-391000000.0,-307000000.0,-84000000.0,,201000000.0,-10000000.0,-15000000.0,191000000,,191000000,-27000000.0,164000000,,164000000,164000000,Q2,2017-06-06,321000000.0,325000000.0,2389000000.0,677000000.0,548000000.0,3800000000.0,675000000.0,134000000.0,3407000000.0,4216000000.0,8016000000,265000000.0,241000000.0,1187000000,1802000000.0,2454000000.0,3641000000,5242000000.0,0.0,-393000000.0,4375000000.0,8016000000,Q2,2018-05-31,321000000.0,325000000.0,164000000.0,54000000.0,27000000.0,12000000.0,-17000000.0,-3000000.0,-3000000.0,35000000.0,257000000.0,-43000000.0,,0.0,-43000000.0,-43000000.0,52000000.0,-75000000.0,-67000000.0,148000000
3,A,2017-07-31,45846,USD,2017,Q3,2017-09-06,2018-08-30,321000000.0,326000000.0,1114000000.0,-518000000.0,596000000.0,-395000000.0,-308000000.0,-87000000.0,,201000000.0,-8000000.0,-13000000.0,193000000,,193000000,-18000000.0,175000000,,175000000,175000000,Q3,2017-09-06,321000000.0,326000000.0,2563000000.0,678000000.0,566000000.0,3996000000.0,716000000.0,137000000.0,3412000000.0,4265000000.0,8261000000,289000000.0,280000000.0,1241000000,1801000000.0,2409000000.0,3650000000,5285000000.0,0.0,-260000000.0,4611000000.0,8261000000,Q3,2018-08-30,321000000.0,326000000.0,175000000.0,51000000.0,83000000.0,-81000000.0,19000000.0,-17000000.0,5000000.0,-88000000.0,228000000.0,-43000000.0,,-57000000.0,-101000000.0,-42000000.0,39000000.0,32000000.0,29000000.0,174000000
4,A,2017-10-31,45846,USD,2017,Q4,2017-12-21,2018-12-20,324000000.0,327000000.0,1189000000.0,-542000000.0,647000000.0,-414000000.0,-325000000.0,-89000000.0,,233000000.0,-7000000.0,-13000000.0,226000000,,226000000,-49000000.0,177000000,,177000000,177000000,Q4,2018-12-20,324000000.0,327000000.0,2678000000.0,724000000.0,575000000.0,4169000000.0,757000000.0,138000000.0,3362000000.0,4257000000.0,8426000000,305000000.0,210000000.0,1263000000,1801000000.0,2328000000.0,3591000000,5303000000.0,,-126000000.0,4835000000.0,8426000000,Q4,2018-12-20,324000000.0,327000000.0,177000000.0,52000000.0,38000000.0,21000000.0,-52000000.0,-15000000.0,-9000000.0,97000000.0,288000000.0,-58000000.0,0.0,0.0,-60000000.0,-43000000.0,-70000000.0,8000000.0,-106000000.0,115000000


In [20]:
df_prices.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,Low,High,Close,Adj. Close,Dividend,Volume,Shares Outstanding
0,A,2016-09-06,45846,46.9,46.72,47.11,46.95,44.72,,1460928,324384755.0
1,A,2016-09-07,45846,47.07,46.86,47.17,47.11,44.87,,1542576,324384755.0
2,A,2016-09-08,45846,47.07,46.94,47.16,47.02,44.79,,884738,324384755.0
3,A,2016-09-09,45846,46.51,44.87,46.53,44.88,42.75,,2507018,324384755.0
4,A,2016-09-12,45846,44.59,44.47,45.81,45.75,43.58,,1835434,324384755.0


In [31]:
df_prices.isnull().sum()

Ticker                      0
Date                        0
SimFinId                    0
Open                    21601
Low                     22052
High                    21166
Close                   21571
Adj. Close              21456
Dividend              3134252
Volume                      0
Shares Outstanding      44696
dtype: int64

In [45]:
results = pd.merge(df_merged, df_prices, how='left',
                   left_on=['Ticker', 'SimFinId', 'Report Date'],
                   right_on=['Ticker', 'SimFinId', 'Date'])

In [46]:
results.shape

(36685, 83)

In [47]:
getPrice('A','2016-10-31',0,df_prices)

['A', 43.33, Timestamp('2016-10-31 00:00:00'), 67454497.46]

In [48]:
pd.set_option('max_rows', 200)
results.isnull().sum()

Ticker                                                 0
Report Date                                            0
SimFinId                                               0
Currency                                               0
Fiscal Year                                            0
Fiscal Period_x                                        0
Publish Date                                           0
Restated Date_x                                        0
Shares (Basic)_x                                     182
Shares (Diluted)_x                                   182
Revenue                                              973
Cost of Revenue                                     3642
Gross Profit                                        3682
Operating Expenses                                    55
Selling, General & Administrative                   1744
Research & Development                             20893
Depreciation & Amortization_x                      20700
Operating Income (Loss)        

In [24]:
df_merged.shape

(36685, 74)

In [50]:
df_prices_columns = df_prices.columns.to_list()
df_prices_on_report_date = results[['Report Date'] + df_prices_columns]

In [51]:
df_prices_on_report_date

Unnamed: 0,Report Date,Ticker,Date,SimFinId,Open,Low,High,Close,Adj. Close,Dividend,Volume,Shares Outstanding
0,2016-10-31,A,2016-10-31,45846,43.33,43.18,43.70,43.57,41.60,,1556762.0,324384755.0
1,2017-01-31,A,2017-01-31,45846,48.04,48.01,49.15,48.97,46.89,,2552612.0,322000000.0
2,2017-04-30,A,NaT,45846,,,,,,,,
3,2017-07-31,A,2017-07-31,45846,60.32,59.76,60.62,59.79,57.52,,1038744.0,322000000.0
4,2017-10-31,A,2017-10-31,45846,67.74,67.54,68.17,68.03,65.59,,1609938.0,322000000.0
...,...,...,...,...,...,...,...,...,...,...,...,...
36680,2020-06-30,ZYXI,2020-06-30,171401,20.95,20.74,22.95,22.61,22.37,,1160230.0,36668496.0
36681,2020-09-30,ZYXI,2020-09-30,171401,15.55,15.49,16.25,15.86,15.69,,1115434.0,38210992.0
36682,2020-12-31,ZYXI,2020-12-31,171401,12.71,12.15,12.75,12.24,12.11,,318435.0,38271124.0
36683,2021-03-31,ZYXI,2021-03-31,171401,13.60,13.45,14.05,13.88,13.73,,423240.0,38334769.0


In [54]:
def nearestDatewithPrice (ticker, report_date, df=df_prices):
    
    # looks into a window of the next 5 days if date chosen falls on a non-trading day
    window = 5
    
    # returns all the rows between the start date and the window, filtered from df
    rows = df[
    (df["Date"].between(pd.to_datetime(report_date),\
                        pd.to_datetime(report_date) + pd.Timedelta(days=window)))\
        & (df["Ticker"]==ticker)]
    
    # if nothing between start date and window, return nothing
    # NaT stands for Not a Time
    if rows.empty:
        return [ticker, np.float("NaN"), np.datetime64('NaT'), np.float("NaN")]
    
    # else return the first row within the filtered df i.e. closest date
    # returns ticker, open price, date, transaction value (volume x open price)
    else:
        return  rows.iloc[0]["Date"]

In [115]:
def nearestDatewithPrice2 (ticker, report_date):
    
    df = df_prices[['Ticker', 'Date']].sort_values(['Ticker', 'Date'])
    return df[(df['Ticker']==ticker) & (df['Date']>=report_date)]['Date'].iloc[0]

In [119]:
nearestDatewithPrice2('A','2016-09-11')

Timestamp('2016-09-12 00:00:00')

In [105]:
nearestDatewithPrice('A', '2016-09-10', df_prices)

Timestamp('2016-09-12 00:00:00')

In [52]:
getPrice('A', '2017-04-30', 0, df_prices)

['A', 55.5, Timestamp('2017-05-01 00:00:00'), 105284499.0]

In [53]:
df_prices_on_report_date.isnull().sum()

Report Date               0
Ticker                    0
Date                  14437
SimFinId                  0
Open                  14445
Low                   14446
High                  14445
Close                 14446
Adj. Close            14445
Dividend              36507
Volume                14437
Shares Outstanding    14508
dtype: int64

In [64]:
df_tick_report = df_merged[['Ticker','Report Date']]
df_tick_report

Unnamed: 0,Ticker,Report Date
0,A,2016-10-31
1,A,2017-01-31
2,A,2017-04-30
3,A,2017-07-31
4,A,2017-10-31
...,...,...
36680,ZYXI,2020-06-30
36681,ZYXI,2020-09-30
36682,ZYXI,2020-12-31
36683,ZYXI,2021-03-31


In [69]:
df_tick_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36685 entries, 0 to 36684
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Ticker       36685 non-null  object        
 1   Report Date  36685 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 573.3+ KB


In [116]:
df_tick_report['NearestDate'] = df_tick_report.apply(lambda x: nearestDatewithPrice2(x['Ticker'], x['Report Date']), axis=1)

IndexError: single positional indexer is out-of-bounds

In [29]:
def get_report_date_price(df_prices_on_report_date, days_fwd=365):
    
    # We want to get a list of 2 lists
    # [(price at date) (price at date + days_fwd)]
    
    # creates a list of lists of len(fund_data) rows and 8 columns
    # will update with 4 
    # instantiate null dataframe first
    
    y = [[None]*8 for i in range(len(df_prices_on_report_date))]
    
    # We can use Publish Date or Report Date, but to get performance we should use Publish Date
    # We cannot take action on Report Date, but we can trade when info is being published
    date_column = 'Date' 

    # start from zero, loop through all rows and append to the empty null list
    
    i=0
    for index in range(len(df_prices_on_report_date)):
        y[i] = (getPrice(df_prices_on_report_date['Ticker'].iloc[index], df_prices_on_report_date[date_column].iloc[index], 0, df_prices_on_report_date) +\
                getPrice(df_prices_on_report_date['Ticker'].iloc[index], df_prices_on_report_date[date_column].iloc[index], days_fwd, df_prices_on_report_date))
        i=i+1
    return y

In [30]:
#y = get_report_date_price(df_prices_on_report_date, days_fwd=365)

start at 315pm end at 320pm