# Fundamental Stock Data
In this homework we will guide you through how to download free historical fundamentals for stocks. This data can be used to construct value and other fundamental strategies. There will be no "solution" for this homework as it is more of a guide.

### Packages
We will be using the <a href='https://pypi.org/project/simfin/' target="_blank" >simfin</a> and  <a href='https://pypi.org/project/yfinance/' target="_blank" >yfinance</a> packages. Install them using "pip install simfin" and "pip install yfinance" if you haven't already and import them as below.

In [1]:
pip install simfin

Note: you may need to restart the kernel to use updated packages.


In [2]:
import yfinance as yf 
import simfin as sf

### Get Your SimFin API Key

First you must obtain an API key. Make an account at https://simfin.com. Confirm your account via email and then head to https://simfin.com/data/api to obtain your api key.

In [3]:
# set your api key here
api_key = '2a2c99fb-4974-4998-b958-935fb3edb4ec'
sf.config.set_api_key(api_key=api_key)

### Set Your SimFin Data Directory

Simfin requires you to set a data directory where simfin data will be downloaded. Downloaded data is used for faster retrieval in the future. The default location is a folder named simfin_data in the home directory.

In [4]:
# set simfin_data
sf.set_data_dir(r'C:\Users\mroha\Documents\wallStreetQuantNotes\Finances\simfin_data')

### Download historical financial data from simfin.
The three financial statements containing fundamental data are the quarterly income, balance sheet and cash flows statements. Data from these statements can be loaded for all us tickers from simfin as below.

In [5]:
income = sf.load_income(variant='quarterly', market='us')

balance_sheet = sf.load_balance(variant='quarterly', market='us')

cash_flow = sf.load_cashflow(variant='quarterly', market='us')

Dataset "us-income-quarterly" on disk (0 days old).
- Loading from disk ... Done!
Dataset "us-balance-quarterly" on disk (0 days old).
- Loading from disk ... 

  df = pd.read_csv(path, sep=';', header=0,
  df = pd.read_csv(path, sep=';', header=0,


Done!
Dataset "us-cashflow-quarterly" on disk (0 days old).
- Loading from disk ... Done!


  df = pd.read_csv(path, sep=';', header=0,


Below we observe the shape of the data.

All of the results are multi-index dataframes where the row is (ticker,date) and the column is a financial statement item. 

One of the important columns for backtesting is the Publish Date. This is the date the information was available to the public. 

For backtesting, we can assume the simfin data is available 1 business day after the Publish Date, since companies sometimes publish after market close or because we may not always be able to get the data immediately on publication for trading, depending on which data vendor you use. 

Because there are restatements to financial statements, even this will not be a fully "point-in-time" backtest where we only use information available at the time. It should be a reasonable approximation in many cases, however.

In [6]:
cash_flow.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Net Income/Starting Line,Depreciation & Amortization,...,Net Cash from Operating Activities,Change in Fixed Assets & Intangibles,Net Change in Long Term Investment,Net Cash from Acquisitions & Divestitures,Net Cash from Investing Activities,Dividends Paid,Cash from (Repayment of) Debt,Cash from (Repurchase of) Equity,Net Cash from Financing Activities,Net Change in Cash
Ticker,Report Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
A,2019-10-31,45846,USD,2019,Q4,2019-12-19,2019-08-30,309000000.0,313000000.0,194000000.0,76000000.0,...,314000000,-30000000.0,,-1160000000.0,-1193000000.0,-51000000.0,600000000.0,-47000000.0,497000000.0,-383000000
A,2020-01-31,45846,USD,2020,Q1,2020-03-03,2021-03-02,310000000.0,313000000.0,197000000.0,79000000.0,...,-59000000,-34000000.0,,,-35000000.0,-56000000.0,56000000.0,-28000000.0,-61000000.0,-156000000
A,2020-04-30,45846,USD,2020,Q2,2020-06-01,2021-03-02,309000000.0,312000000.0,101000000.0,76000000.0,...,313000000,-33000000.0,,,-53000000.0,-55000000.0,25000000.0,-126000000.0,-156000000.0,97000000
A,2020-07-31,45846,USD,2020,Q3,2020-09-01,2021-06-01,309000000.0,312000000.0,199000000.0,77000000.0,...,290000000,-24000000.0,,,-32000000.0,-56000000.0,-161000000.0,-9000000.0,-231000000.0,35000000
A,2020-10-31,45846,USD,2020,Q4,2020-12-18,2021-09-01,308000000.0,311000000.0,222000000.0,76000000.0,...,377000000,-27000000.0,,,-27000000.0,-55000000.0,35000000.0,-246000000.0,-269000000.0,83000000


Let's observe which columns are available to us below from each financial statement.

In [7]:
# observe available income columns
income.columns

Index(['SimFinId', 'Currency', 'Fiscal Year', 'Fiscal Period', 'Publish Date',
       'Restated Date', 'Shares (Basic)', 'Shares (Diluted)', 'Revenue',
       'Cost of Revenue', 'Gross Profit', 'Operating Expenses',
       'Selling, General & Administrative', 'Research & Development',
       'Depreciation & Amortization', 'Operating Income (Loss)',
       'Non-Operating Income (Loss)', 'Interest Expense, Net',
       'Pretax Income (Loss), Adj.', 'Abnormal Gains (Losses)',
       'Pretax Income (Loss)', 'Income Tax (Expense) Benefit, Net',
       'Income (Loss) from Continuing Operations',
       'Net Extraordinary Gains (Losses)', 'Net Income',
       'Net Income (Common)'],
      dtype='object')

In [8]:
# observe available balance_sheet columns
balance_sheet.columns

Index(['SimFinId', 'Currency', 'Fiscal Year', 'Fiscal Period', 'Publish Date',
       'Restated Date', 'Shares (Basic)', 'Shares (Diluted)',
       'Cash, Cash Equivalents & Short Term Investments',
       'Accounts & Notes Receivable', 'Inventories', 'Total Current Assets',
       'Property, Plant & Equipment, Net',
       'Long Term Investments & Receivables', 'Other Long Term Assets',
       'Total Noncurrent Assets', 'Total Assets', 'Payables & Accruals',
       'Short Term Debt', 'Total Current Liabilities', 'Long Term Debt',
       'Total Noncurrent Liabilities', 'Total Liabilities',
       'Share Capital & Additional Paid-In Capital', 'Treasury Stock',
       'Retained Earnings', 'Total Equity', 'Total Liabilities & Equity'],
      dtype='object')

In [9]:
# observe available cash_flow columns
cash_flow.columns

Index(['SimFinId', 'Currency', 'Fiscal Year', 'Fiscal Period', 'Publish Date',
       'Restated Date', 'Shares (Basic)', 'Shares (Diluted)',
       'Net Income/Starting Line', 'Depreciation & Amortization',
       'Non-Cash Items', 'Change in Working Capital',
       'Change in Accounts Receivable', 'Change in Inventories',
       'Change in Accounts Payable', 'Change in Other',
       'Net Cash from Operating Activities',
       'Change in Fixed Assets & Intangibles',
       'Net Change in Long Term Investment',
       'Net Cash from Acquisitions & Divestitures',
       'Net Cash from Investing Activities', 'Dividends Paid',
       'Cash from (Repayment of) Debt', 'Cash from (Repurchase of) Equity',
       'Net Cash from Financing Activities', 'Net Change in Cash'],
      dtype='object')

Below we observe the number of tickers available for each statement and the start and end dates.

In [10]:
def describe_data(data, data_name):
    size = len(set(data.index.get_level_values(0)))
    start_dt = data.index.get_level_values(1).min().strftime('%Y%m%d')
    end_dt = data.index.get_level_values(1).max().strftime('%Y%m%d')
    print (f'{data_name}: {size} tickers. Date range: {start_dt} to {end_dt}')

describe_data(income ,'Income Data')
describe_data(balance_sheet ,'Balance Sheet Data')
describe_data(cash_flow ,'Cash Flow Data')

Income Data: 3646 tickers. Date range: 20190930 to 20240731
Balance Sheet Data: 3646 tickers. Date range: 20190930 to 20240731
Cash Flow Data: 3646 tickers. Date range: 20190930 to 20240731


### Yahoo Finance Data
If you notice above, the simfin data ends a year ago. They only have 1 year lagged data. To get the most recent financial statement information, you can use yahoo finance.

First you must obtain a yfinance.Ticker object for your desired ticker.

In [23]:
yf_ticker = yf.Ticker('CAT')

Download information from the 3 financial statements as below.

In [24]:
# Get the quarterly cash flow statements from yfinance

income_yf = yf_ticker.quarterly_financials

cash_flow_yf = yf_ticker.quarterly_cashflow

balance_sheet = yf_ticker.quarterly_balance_sheet

In [25]:
income_yf

Unnamed: 0,2025-06-30,2025-03-31,2024-12-31,2024-09-30,2024-06-30,2024-03-31,2023-12-31
Tax Effect Of Unusual Items,-3667850.0,18732000.0,37691030.0,621000.0,31051430.0,,
Tax Rate For Calcs,0.229241,0.223,0.142769,0.207,0.238857,,
Normalized EBITDA,3514000000.0,3142000000.0,3641000000.0,3763000000.0,4038000000.0,,
Total Unusual Items,-16000000.0,84000000.0,264000000.0,3000000.0,130000000.0,,
Total Unusual Items Excluding Goodwill,-16000000.0,84000000.0,264000000.0,3000000.0,130000000.0,,
Net Income From Continuing Operation Net Minority Interest,2179000000.0,2003000000.0,2791000000.0,2464000000.0,2681000000.0,,
Reconciled Depreciation,554000000.0,540000000.0,555000000.0,543000000.0,531000000.0,,
Reconciled Cost Of Revenue,11143000000.0,9291000000.0,10659000000.0,10402000000.0,10464000000.0,,
EBITDA,3498000000.0,3226000000.0,3905000000.0,3766000000.0,4168000000.0,,
EBIT,2944000000.0,2686000000.0,3350000000.0,3223000000.0,3637000000.0,,
