# How to Scrape a HTML Table with Python Pandas

## Load HTML Page
The URL must contain a vaild HTTP URL. If the URL starts with https, please remove the final s.

In [113]:
import pandas as pd

df_list = pd.read_html("source/euro2020_groups.html")

In [114]:
df_list[0]

Unnamed: 0.1,Unnamed: 0,P,+/-,Pts
0,1 ITA Italy,3,7,9
1,2 WAL Wales,3,1,4
2,3 SUI Switzerland,3,-1,4
3,4 TUR Turkey,3,-7,0


## Group all the retrieved tables
The `read_html()` function retrieves all the html tables contained in the HTML page and returns a list of dataframes, one for each table. 

We group all the dataframes into a single dataframe.

In [115]:
N = len(df_list)

In [116]:
import string
groups_names = list(string.ascii_uppercase[0:N])
groups_names

['A', 'B', 'C', 'D', 'E', 'F']

In [117]:
df = pd.DataFrame()
for i in range(0,N):
    group_col = [groups_names[i]] * len(df_list[i])
    df_list[i]['Group'] = group_col
    df = df.append(df_list[i])

In [118]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,P,+/-,Pts,Group
0,1 ITA Italy,3,7,9,A
1,2 WAL Wales,3,1,4,A
2,3 SUI Switzerland,3,-1,4,A
3,4 TUR Turkey,3,-7,0,A
0,1 BEL Belgium,3,6,9,B
1,2 DEN Denmark,3,1,3,B
2,3 FIN Finland,3,-2,3,B
3,4 RUS Russia,3,-5,3,B
0,1 NED Netherlands,3,6,9,C
1,2 AUT Austria,3,1,6,C


## Clean the dataframe
Expand the `Team` column.

In [119]:
df.rename(columns={"Unnamed: 0": "Team"}, inplace=True)

In [120]:
df_new = df['Team'].str.split(' ',expand=True)
df_new.head(5)

Unnamed: 0,0,1,2,3
0,1,ITA,Italy,
1,2,WAL,Wales,
2,3,SUI,Switzerland,
3,4,TUR,Turkey,
0,1,BEL,Belgium,


Select only interesting columns

In [121]:
df[['N', 'ID', 'Country']] = df_new[[0,1,2]]

Drop the original collapsed column

In [122]:
df.drop(['Team'], axis=1, inplace=True)

In [123]:
df.head()

Unnamed: 0,P,+/-,Pts,Group,N,ID,Country
0,3,7,9,A,1,ITA,Italy
1,3,1,4,A,2,WAL,Wales
2,3,-1,4,A,3,SUI,Switzerland
3,3,-7,0,A,4,TUR,Turkey
0,3,6,9,B,1,BEL,Belgium


Reorder columns

In [109]:
df = df.reindex(columns=['Group','N', 'ID', 'Country', 'P', '+/-', 'Pts'])

In [110]:
df.head()

Unnamed: 0,Group,N,ID,Country,P,+/-,Pts
0,A,1,ITA,Italy,3,7,9
1,A,2,WAL,Wales,3,1,4
2,A,3,SUI,Switzerland,3,-1,4
3,A,4,TUR,Turkey,3,-7,0
0,B,1,BEL,Belgium,3,6,9


## Save the Dataframe as a CSV file

In [56]:
df.to_csv('euro_2020_groups.csv')