<h1 style="text-align:center">Pandas' read_html function, Part 2</h1>

#### In my [read_html Part 1](read_html_part1.ipynb), I provided a simple example of using [Pandas](https://pandas.pydata.org/) to read tables from a static HTML file you have on disk.  This is certainly valid for some use cases.  However, if you're like me, you'll have other use cases where you'll want to read tables live from the Internet.  This "Part 2" provides an example of that.

#### My go-to Python package for reading files from the Internet is [requests](http://docs.python-requests.org/en/master/).  Indeed, I started this example with *requests*, but quickly found it wouldn't work for me with the page I wanted to read.  Some pages on the internet already contain their data pre-loaded in the HTML.  Increasingly, though, web developers are using Javascript to load data on their pages.  Unfortunately, *requests* isn't savvy enough to pick up data loaded with Javascript.  So, I had to turn to a slightly more sophisticated approach.  [Selenium](https://www.seleniumhq.org/) proved to be the solution I needed.

#### In this example, I use Python's selenium package to retrieve [Federal Reserve balance sheet data](https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm), which is loaded onto the page with Javascript.  As part of the setup, I had to do two things:
1. pip/conda install the selenium package
1. download Mozilla's [gecko driver](https://github.com/mozilla/geckodriver/releases) to my hard drive

#### Below is my code.  Here are a few things to note:
 - By default, when you run selenium, a new instance of your browser will launch and run all the commands you programmatically issue to it.  This can be very helpful debugging your code, but can also get annoying after a while, so I suppress the launch of the browser window with the [Options library](https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.firefox.options).
 - Load time can be important.  Not all web pages load quickly.  Selenium includes a variety of options to pause your code until the data you're looking for has been loaded.  For simplicity, I just made my code sleep for five seconds to ensure the page full loaded.
 - Pandas [read_html](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html) isn't without its challenges.  Initially, I tried to pass the entire HTML to the function, but the function was unable to find the tables.  I suspect the tables may be too deeply nested in the HTML for pandas to find, but I don't know for sure.  So, I used other features of selenium to find the table elements I wanted and passed that HTML into the read_html function.  There are several tables on this page that I'll probably want to process, so I'll probably have to write a loop to grab them all.  This code only shows me grabbing the first table.
 
#### This is a great way to easy grab the data you need from the Internet!

In [23]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

In [24]:
options = Options()
options.headless = True  # stop the browser from popping up

driver = webdriver.Firefox(options=options, executable_path="C:\geckodriver-v0.24.0-win64\geckodriver.exe")
driver.get("https://www.federalreserve.gov/monetarypolicy/bst_recenttrends_accessible.htm")
assert "Federal Reserve Board - Balance Sheet Trends - Accessible" in driver.title

time.sleep(5)  # wait 5 seconds for the page to load the data
df_total_assets = pd.read_html(driver.find_element_by_tag_name("table").get_attribute('outerHTML'))[0]

driver.quit()

In [25]:
df_total_assets.head()

Unnamed: 0,Date,Total Assets
0,Date,Total Assets
1,1-Aug-2007,870261
2,8-Aug-2007,865453
3,15-Aug-2007,864931
4,22-Aug-2007,862775


In [26]:
df_total_assets.tail()

Unnamed: 0,Date,Total Assets
597,2-Jan-2019,4058378
598,9-Jan-2019,4056563
599,16-Jan-2019,4050044
600,23-Jan-2019,4047052
601,30-Jan-2019,4039678
