## Creating a DataFrame

In the [cc_scraping.ipynb](Documents/Coding/Notebooks/tutorials/cc_scraping.ipynb) demo, we extracted data already from a pretty friendly F1 table.

What if we didn't have that? 

In this self-assigned exercise I will extract simple data from all the matches played in the EURO 2020 from the [Flash Score](https://www.flashscore.com/football/europe/euro-2020/results/) website. We will form a dataframe from the following info
1. Date
2. Stage of Competition
3. Home Team
4. Away Team
5. Score
6. Penalty Score

All using Requests, BeautifulSoup, then Pandas

In [1]:
import requests
from bs4 import BeautifulSoup

euro_url = "https://terrikon.com/en/euro-2020"
e_page = requests.get(euro_url)
e_html = e_page.text
e_soup = BeautifulSoup(e_html)

e_tables = e_soup.find("div", class_="maincol")
e_headers = e_tables.find_all("h2")
e_comps = ["".join(header.text.split("Euro-2020."))[1:] for header in e_headers]

##### Sections HTML by Matchtype
*i.e Final, Semi-Finals . . . Group F*

In [2]:
e_sections = []
for cnt, header in enumerate(e_headers):
    tracker = header.next_sibling
    cur_html = ""
    next_header = False
    try:
        next_header = e_headers[cnt +1]
    except:
        while tracker: 
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
        if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)
        continue
    while next_header and tracker != next_header and tracker:
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
    if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)

##### Attaches the relevant match type to the data
*i.e All Group A-F marked as Group Stage matches*



In [3]:
import pandas as pd
import io

tester = e_sections[0]
table = tester.select("table.gameresult")[0].prettify()
table = pd.read_html(io.StringIO(table))[0]

**To Note Above**

Pandas works best with pure html, in a fashion like so

`page = request.get(page_url) `<br>
`df = pd.read_html(io.StringIO(page.text), match="table title")`

Above, s_tester was formatted as beautifulSoup data, we had to prettify to get it back to irignal html text to parse through pandas 

So, this was an extra hurdle we made for ourselves. Now we know. However, our method above in code cell 71 was a great exercise to partition sections of the html if it wasn't handed to us like tables.

In [135]:
all_matches = []

In [140]:
for i, section in enumerate(e_sections): 
    s_tester = e_sections[i]
    s_table = s_tester.select("table.gameresult")[0].prettify()
    s_table = pd.read_html(io.StringIO(s_table))[0]
    s_table = s_table.drop(columns=[0])
    
    s_filler = [pd.NA for row in s_table.iterrows()]
    mType = e_comps[i]
    if "Group" in mType:
        mType = "Group Stage"
    s_matchtype = [mType for row in s_table.iterrows()]
    
    s_table = s_table.rename(columns={1:"Home Team", 2:"Home Goals", 3:"Away Team", 4:"Away Goals", 5:"Match Date"})
    s_table.insert(2, "Home Penalties", s_filler, True)
    s_table.insert(5, "Away Penalties", s_filler, True)
    s_table.insert(6, "Match Type", s_matchtype, True)
    
    for index, row in s_table.iterrows(): 
        scoring = row["Home Goals"]
        hG = scoring[0]
        hP = pd.NA
        aG = scoring[2]
        aP = pd.NA
        if len(scoring) > 3: # penalties for match
            hP = scoring[6]
            aP = scoring[8]
            s_table.loc[index, "Home Penalties"] = int(hP)
            s_table.loc[index, "Away Penalties"] = int(aP)
    
        s_table.loc[index, "Home Goals"] = int(hG)
        s_table.loc[index, "Away Goals"] = int(aG)
    
    s_table[['Home Goals', 'Away Goals']] = s_table[['Home Goals', 'Away Goals']].astype('int')
    #s_table.dtypes
    all_matches.append(s_table)

In [144]:
matches_df = pd.concat(all_matches)

In [148]:
matches_df

Unnamed: 0,Home Team,Home Goals,Home Penalties,Away Team,Away Goals,Away Penalties,Match Type,Match Date
0,Italy,1,3,England,1,2,Final,11.07.21
0,Spain,1,2,Italy,1,4,Semi-finals,06.07.21
1,England,2,,Denmark,1,,Semi-finals,07.07.21
0,Switzerland,1,1,Spain,1,3,Quarter-finals,02.07.21
1,Belgium,1,,Italy,2,,Quarter-finals,02.07.21
...,...,...,...,...,...,...,...,...
1,France,1,,Germany,0,,Group Stage,15.06.21
2,Hungary,1,,France,1,,Group Stage,19.06.21
3,Portugal,2,,Germany,4,,Group Stage,19.06.21
4,Portugal,2,,France,2,,Group Stage,23.06.21
