# Examples

1. Simple HTML table
2. Custom table selector
3. Apache server directory

## Read a simple HTML table

Read data from `<table>` element(s), from a local file path or web URL:

In [1]:
from intake_html_table import HtmlTableSource

In [2]:
src = HtmlTableSource("table.html")

In [3]:
df = src.read()

In [4]:
df.head()

Unnamed: 0,Column1,Column2,Column3
0,Row1 Column1,Row1 Column2,Row1 Column3
1,Row2 Column1,Row2 Column2,Row2 Column3
2,Row3 Column1,Row3 Column2,Row3 Column3


## Use `pandas.read_html` kwargs

Optionally pass through kwargs for more control over reading the source table(s):

In [5]:
src = HtmlTableSource("document.html", attrs={"id": "data"})

In [6]:
df = src.read()

In [7]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
0,11,R1C2,False,2021-04-03 12:23
1,22,R2C2,True,2021-03-30 08:39
2,33,R3C2,False,2021-04-02 18:17
3,44,R4C2,True,2021-07-09 01:23


## Read an Apache Server directory

For example, the National Centers for Environmental Information (NCEI) makes data available in an Apache directory structure
at <https://www.ncei.noaa.gov/data>.

The *Global Summary of the Day* is organized by year, with a CSV file for each station for each year at
<https://www.ncei.noaa.gov/data/global-summary-of-the-day/access>.

Here we build a [catalog](https://intake.readthedocs.io/en/latest/catalog.html) from the root of the directory and list the subdirectories.



In [8]:
from intake_html_table import ApacheDirectoryCatalog

In [9]:
cat = ApacheDirectoryCatalog("https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/")

In [10]:
# [:10] to print only the first 10 items
list(cat)[:10]


['parent',
 '1929',
 '1930',
 '1931',
 '1932',
 '1933',
 '1934',
 '1935',
 '1936',
 '1937']

We can list an individual subdirectory as well 

In [11]:
list(cat['2021'])[:10]

['parent',
 '01001099999.csv',
 '01001499999.csv',
 '01002099999.csv',
 '01003099999.csv',
 '01006099999.csv',
 '01007099999.csv',
 '01008099999.csv',
 '01009099999.csv',
 '01010099999.csv']

This directory contains CSV files. We can read them via the catalog:

In [12]:
df = cat['2021']['01001099999.csv'].read()

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   STATION           188 non-null    int64  
 1   DATE              188 non-null    object 
 2   LATITUDE          188 non-null    float64
 3   LONGITUDE         188 non-null    float64
 4   ELEVATION         188 non-null    float64
 5   NAME              188 non-null    object 
 6   TEMP              188 non-null    float64
 7   TEMP_ATTRIBUTES   188 non-null    int64  
 8   DEWP              188 non-null    float64
 9   DEWP_ATTRIBUTES   188 non-null    int64  
 10  SLP               188 non-null    float64
 11  SLP_ATTRIBUTES    188 non-null    int64  
 12  STP               188 non-null    float64
 13  STP_ATTRIBUTES    188 non-null    int64  
 14  VISIB             188 non-null    float64
 15  VISIB_ATTRIBUTES  188 non-null    int64  
 16  WDSP              188 non-null    float64
 1