## Creating a DataFrame

In the [cc_scraping.ipynb](Documents/Coding/Notebooks/tutorials/cc_scraping.ipynb) demo, we extracted data already from a pretty friendly F1 table.

What if we didn't have that? 

In this self-assigned exercise I will extract simple data from all the matches played in the EURO 2020 from the [Flash Score](https://www.flashscore.com/football/europe/euro-2020/results/) website. We will form a dataframe from the following info
1. Date
2. Stage of Competition
3. Home Team
4. Away Team
5. Score
6. Penalty Score

All using Requests, BeautifulSoup, then Pandas

In [109]:
import requests
from bs4 import BeautifulSoup

euro_url = "https://terrikon.com/en/euro-2020"
e_page = requests.get(euro_url)
e_html = e_page.text
e_soup = BeautifulSoup(e_html)

e_tables = e_soup.find("div", class_="maincol")
e_headers = e_tables.find_all("h2")
e_headers = ["".join(header.text.split("Euro-2020."))[1:] for header in e_headers]

['Final',
 'Semi-finals',
 'Quarter-finals',
 'Round of 16',
 'Group A',
 'Group B',
 'Group C',
 'Group D',
 'Group E',
 'Group F']

##### Sections HTML by Matchtype
*i.e Final, Semi-Finals . . . Group F*

In [71]:
e_sections = []
for cnt, header in enumerate(e_headers):
    tracker = header.next_sibling
    cur_html = ""
    next_header = False
    try:
        next_header = e_headers[cnt +1]
    except:
        while tracker: 
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
        if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)
        continue
    while next_header and tracker != next_header and tracker:
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
    if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)

##### Attaches the relevant match type to the data
*i.e All Group A-F marked as Group Stage matches*



In [93]:
import pandas as pd
import io

s_tester = e_sections[0]
s_table = s_tester.select("table.gameresult")[0].prettify()
s_table = pd.read_html(io.StringIO(s_table))[0]

Unnamed: 0,0,1,2,3,4,5
0,,Italy,1:1 (3:2),England,,11.07.21


**To Note Above**

Pandas works best with pure html, in a fashion like so

`page = request.get(page_url) `<br>
`df = pd.read_html(io.StringIO(page.text), match="table title")`

Above, s_tester was formatted as beautifulSoup data, we had to prettify to get it back to irignal html text to parse through pandas 

So, this was an extra hurdle we made for ourselves. Now we know. However, our method above in code cell 71 was a great exercise to partition sections of the html if it wasn't handed to us like tables.

In [155]:
import pandas as pd
import io

s_tester = e_sections[0]
s_table = s_tester.select("table.gameresult")[0].prettify()
s_table = pd.read_html(io.StringIO(s_table))[0]
s_table = s_table.drop(columns=[0])

s_filler = ["NaN" for row in s_table.iterrows()]
s_matchtype = [e_headers[0] for row in s_table.iterrows()]

s_table = s_table.rename(columns={1:"Home Team", 2:"Home Goals", 3:"Away Team", 4:"Away Goals", 5:"Match Date"})
s_table.insert(2, "Home Penalties", s_filler, True)
s_table.insert(4, "Away Penalties", s_filler, True)
s_table.insert(5, "Match Type", s_matchtype, True)
s_table

Unnamed: 0,Home Team,Home Goals,Home Penalties,Away Team,Away Penalties,Match Type,Away Goals,Match Date
0,Italy,1:1 (3:2),,England,,Final,,11.07.21


In [None]:
# Home team | Home Goals |  Home Penalties | Away Team | Away Goals | Away Penalties | Match Type | Match Date

# df.insert(2, "Age", [21, 23, 24, 21], True)
#df.rename(columns={"A": "a", "B": "c"})