<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Getting Data from HTML Tables with `pd.read_html()` - Solution

_Author: Jeff Hale (DC)_

---

## Learning Objectives

After this lesson students will be able to:
- Scrape tables from websites with `pd.read_html()`
- Read tables in pandas DataFrames
- Know when to try `pd.read_html()`


###  Import pandas under the usual alias

In [1]:
import pandas as pd

#### %autoreload

The following Jupyter magic commands will autoreload packages so you don't have to restart your Jupyter server if you install a new package after you launch the server. 

In [2]:
# autoreload packages after install 
%load_ext autoreload
%autoreload 2

## Let's see a faster way to scrape the basketball reference example Riley used

Let's look at the function signature for `pd.read_html()`. This is another way to see the help docs:

In [5]:
help(pd.read_html)
# note, no brackets because you aren't calling the function

Help on function read_html in module pandas.io.html:

read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)
    Read HTML tables into a ``list`` of ``DataFrame`` objects.
    
    Parameters
    ----------
    io : str, path object or file-like object
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.
    
    match : str or compiled regular expression, optional
        The set of tables containing text matching this regex or string will be
        returned. Unless the HTML is extremely simple you will probably need to
        pass a non-empty string here. Defaults to '.+' (match any non-empty
        string). The default value wil

#### Try to read the data directly with `pd.read_html()`

In [6]:
url = 'https://www.basketball-reference.com/'
bball = pd.read_html(url)

In [8]:
bball

[         East Unnamed: 1 Unnamed: 2   W   L
 0   MIL * (1)          F          $  53  12
 1   TOR * (2)          F          $  46  18
 2   BOS * (3)          F          $  43  21
 3     MIA (4)          F          $  41  24
 4     IND (5)          F          $  39  26
 5     PHI (6)          F          $  39  26
 6     BRK (7)          F          $  30  34
 7     ORL (8)          F          $  30  35
 8     WAS (9)          F          $  24  40
 9    CHO (10)          F          $  23  42
 10   CHI (11)          F          $  22  43
 11   NYK (12)          F          $  21  45
 12   DET (13)          F          $  20  46
 13   ATL (14)          F          $  20  47
 14   CLE (15)          F          $  19  46,
           West Unnamed: 1 Unnamed: 2   W   L
 0    LAL * (1)          F          $  49  14
 1      LAC (2)          F          $  44  20
 2      DEN (3)          F          $  43  22
 3      UTA (4)          F          $  41  23
 4      OKC (5)          F          $  40  24
 5 

#### Print the top two rows of each DataFrame

In [18]:
for i in bball:
    print(bball[:2])

[         East Unnamed: 1 Unnamed: 2   W   L
0   MIL * (1)          F          $  53  12
1   TOR * (2)          F          $  46  18
2   BOS * (3)          F          $  43  21
3     MIA (4)          F          $  41  24
4     IND (5)          F          $  39  26
5     PHI (6)          F          $  39  26
6     BRK (7)          F          $  30  34
7     ORL (8)          F          $  30  35
8     WAS (9)          F          $  24  40
9    CHO (10)          F          $  23  42
10   CHI (11)          F          $  22  43
11   NYK (12)          F          $  21  45
12   DET (13)          F          $  20  46
13   ATL (14)          F          $  20  47
14   CLE (15)          F          $  19  46,           West Unnamed: 1 Unnamed: 2   W   L
0    LAL * (1)          F          $  49  14
1      LAC (2)          F          $  44  20
2      DEN (3)          F          $  43  22
3      UTA (4)          F          $  41  23
4      OKC (5)          F          $  40  24
5      HOU (6)          

#### Make a DataFrame from the first table

In [21]:
df1 = bball[0]
df1

Unnamed: 0,East,Unnamed: 1,Unnamed: 2,W,L
0,MIL * (1),F,$,53,12
1,TOR * (2),F,$,46,18
2,BOS * (3),F,$,43,21
3,MIA (4),F,$,41,24
4,IND (5),F,$,39,26
5,PHI (6),F,$,39,26
6,BRK (7),F,$,30,34
7,ORL (8),F,$,30,35
8,WAS (9),F,$,24,40
9,CHO (10),F,$,23,42


#### What columns are in the first DataFrame?

In [23]:
df1.columns

Index(['East', 'Unnamed: 1', 'Unnamed: 2', 'W', 'L'], dtype='object')

In [24]:
df1.drop(columns=['Unnamed: 1', 'Unnamed: 2'], inplace = True)

#### Clean the DataFrame

In [25]:
df1

Unnamed: 0,East,W,L
0,MIL * (1),53,12
1,TOR * (2),46,18
2,BOS * (3),43,21
3,MIA (4),41,24
4,IND (5),39,26
5,PHI (6),39,26
6,BRK (7),30,34
7,ORL (8),30,35
8,WAS (9),24,40
9,CHO (10),23,42


In [26]:
df1['East'] = df1['East'].str[:3]

In [27]:
df1

Unnamed: 0,East,W,L
0,MIL,53,12
1,TOR,46,18
2,BOS,43,21
3,MIA,41,24
4,IND,39,26
5,PHI,39,26
6,BRK,30,34
7,ORL,30,35
8,WAS,24,40
9,CHO,23,42


### Wasn't that lovely? 💐

## Let's do another example with a soccer website ⚽️

- Go to fivethirtyeight's website with soccer predictions: https://projects.fivethirtyeight.com/soccer-predictions/champions-league  

- Inspect it with the Chrome dev tools:

- Right click in the table of numbers on the page and select `Inspect` from the dropdown menu.

#### Do you see HTML tags that indicate it's an HTML table?

```html
<table>
    <th>
    <tr>
        <td>
```


#### Let's try to read it 

In [28]:
soccer_url = 'https://projects.fivethirtyeight.com/soccer-predictions/champions-league'

In [29]:
soccer_list = pd.read_html(soccer_url)

#### How many tables (DataFrames) are there?

In [32]:
len(soccer_list)

217

#### Let's look at the mess

In [34]:
soccer_list

AttributeError: 'list' object has no attribute 'head'

### Print the top two rows of each DataFrame

In [38]:
df_soccer = soccer_list[-1]
df_soccer

Unnamed: 0_level_0,Unnamed: 0_level_0,Team rating,Team rating,Team rating,Unnamed: 4_level_0,round-by-round probabilities,round-by-round probabilities,round-by-round probabilities,round-by-round probabilities,round-by-round probabilities,Unnamed: 10_level_0
Unnamed: 0_level_1,team,spi,off.,def.,make round of 16,make qtrs,make semis,make final,win final,Unnamed: 9_level_1,Unnamed: 10_level_1
0,Bayern Munich,94.4,3.5,0.4,✓,>99%,71%,48%,29%,,
1,Man. City,94.8,3.2,0.2,✓,90%,64%,43%,26%,,
2,PSG,89.2,3.2,0.7,✓,✓,54%,26%,12%,,
3,Barcelona,90.3,2.9,0.4,✓,87%,49%,25%,11%,,
4,RB Leipzig,88.1,2.6,0.4,✓,✓,49%,21%,9%,,
5,Atalanta,82.8,2.7,0.7,✓,✓,37%,12%,4%,,
6,Atlético Madrid,83.8,2.1,0.4,✓,✓,37%,12%,4%,,
7,Juventus,84.3,2.4,0.5,✓,53%,22%,8%,3%,,
8,Real Madrid,89.5,2.7,0.4,✓,10%,6%,3%,2%,,
9,Lyon,73.5,1.8,0.6,✓,47%,10%,2%,<1%,,


#### Which DataFrame do we want?

In [None]:
The last table

### Now we're rockin! 🎸

## Inspect the DataFrame

#### Q: What do you notice about the column names?

In [None]:
# Multi Indexed.

In [None]:
df_soccer.columns = df_soccwer.columns.droplevel(0)

## Clean the Data

Those commas in the column names and the layout of the DF tell us we have a mult-index. That's a bit more advanced. Let's get rid of the highest level index, because we don't need it.

In [42]:
# Drop multi index

df_soccer.columns = df_soccer.columns.droplevel(0)
# found this SO answer from Google https://stackoverflow.com/questions/44023770/pandas-get-rid-of-multiindex/44023799




ValueError: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.

In [45]:
df_soccer

Unnamed: 0,team,spi,off.,def.,make round of 16,make qtrs,make semis,make final,win final,Unnamed: 9_level_1,Unnamed: 10_level_1
0,Bayern Munich,94.4,3.5,0.4,✓,>99%,71%,48%,29%,,
1,Man. City,94.8,3.2,0.2,✓,90%,64%,43%,26%,,
2,PSG,89.2,3.2,0.7,✓,✓,54%,26%,12%,,
3,Barcelona,90.3,2.9,0.4,✓,87%,49%,25%,11%,,
4,RB Leipzig,88.1,2.6,0.4,✓,✓,49%,21%,9%,,
5,Atalanta,82.8,2.7,0.7,✓,✓,37%,12%,4%,,
6,Atlético Madrid,83.8,2.1,0.4,✓,✓,37%,12%,4%,,
7,Juventus,84.3,2.4,0.5,✓,53%,22%,8%,3%,,
8,Real Madrid,89.5,2.7,0.4,✓,10%,6%,3%,2%,,
9,Lyon,73.5,1.8,0.6,✓,47%,10%,2%,<1%,,


### Fix the strings so we can make them into numbers in a minute

In [69]:
df_soccer['make_final'] = df_soccer['make final'].str.replace('%','')
df_soccer['make final'] = df_soccer['make final'].str.replace('—','')
df_soccer['make final'] = df_soccer['make final'].str.replace('<','')


#df_soccer['make final'] = df_soccer['win final'].astype(int)
#df_soccer['win final'] = df_soccer['win final']/100
df_soccer

Unnamed: 0,team,spi,off.,def.,make round of 16,make qtrs,make semis,make final,win final,Unnamed: 9_level_1,Unnamed: 10_level_1,make_final
0,Bayern Munich,94.4,3.5,0.4,✓,>99%,71%,48%,29%,,,48.0
1,Man. City,94.8,3.2,0.2,✓,90%,64%,43%,26%,,,43.0
2,PSG,89.2,3.2,0.7,✓,✓,54%,26%,12%,,,26.0
3,Barcelona,90.3,2.9,0.4,✓,87%,49%,25%,11%,,,25.0
4,RB Leipzig,88.1,2.6,0.4,✓,✓,49%,21%,9%,,,21.0
5,Atalanta,82.8,2.7,0.7,✓,✓,37%,12%,4%,,,12.0
6,Atlético Madrid,83.8,2.1,0.4,✓,✓,37%,12%,4%,,,12.0
7,Juventus,84.3,2.4,0.5,✓,53%,22%,8%,3%,,,8.0
8,Real Madrid,89.5,2.7,0.4,✓,10%,6%,3%,2%,,,3.0
9,Lyon,73.5,1.8,0.6,✓,47%,10%,2%,<1%,,,2.0


#### Turn the object columns into numbers

## Let's make a quick plot

## Summary

You saw how to scrape a website's tables and get them into a pandas DataFrame. 🎉

> Check for Understanding


- What does `pd.read_html()` return?
- When might you want to use `pd.read_html()`?

### Go forth and save time scraping tables with `pd.read_html()`! 🎺