## Creating a DataFrame

In my personal [cc_scraping.ipynb](Documents/Coding/Notebooks/tutorials/cc_scraping.ipynb) demo, I extracted data already from a pretty friendly F1 table.

What if I didn't have that? 

In this self-assigned exercise I will extract simple data from all the matches played in the EURO 2020 from the [Terrikon](https://terrikon.com/en/euro-2020) website. We will form a dataframe from the following info
1. Date
2. Stage of Competition
3. Home Team
4. Away Team
5. Score
6. Penalty Score

All using Requests, BeautifulSoup, then Pandas

In [38]:
import requests
from bs4 import BeautifulSoup

euro_url = "https://terrikon.com/en/euro-2020"
e_page = requests.get(euro_url)
e_html = e_page.text
e_soup = BeautifulSoup(e_html)

e_tables = e_soup.find("div", class_="maincol")
e_headers = e_tables.find_all("h2")
e_comps = ["".join(header.text.split("Euro-2020."))[1:] for header in e_headers]

##### Sections HTML by Matchtype
*i.e Final, Semi-Finals . . . Group F*

In [39]:
e_sections = []
for cnt, header in enumerate(e_headers):
    tracker = header.next_sibling
    cur_html = ""
    next_header = False
    try:
        next_header = e_headers[cnt +1]
    except:
        while tracker: 
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
        if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)
        continue
    while next_header and tracker != next_header and tracker:
            cur_html = cur_html + str(tracker)
            tracker = tracker.next_sibling
    if cur_html:
            e_sections.append(BeautifulSoup(cur_html).div)

##### Attaches the relevant match type to the data
*i.e All Group A-F marked as Group Stage matches*



In [40]:
import pandas as pd
import io

tester = e_sections[0]
table = tester.select("table.gameresult")[0].prettify()
table = pd.read_html(io.StringIO(table))[0]

**To Note Above**

Pandas works best with pure html, in a fashion like so

`page = request.get(page_url) `<br>
`df = pd.read_html(io.StringIO(page.text), match="table title")`

Above, s_tester was formatted as beautifulSoup data, we had to prettify to get it back to irignal html text to parse through pandas 

So, this was an extra hurdle we made for ourselves. Now we know. However, our method above in code cell 71 was a great exercise to partition sections of the html if it wasn't handed to us like tables.

In [41]:
t_all_matches = []

In [42]:
for i, section in enumerate(e_sections): 
    s_tester = e_sections[i]
    s_table = s_tester.select("table.gameresult")[0].prettify()
    s_table = pd.read_html(io.StringIO(s_table))[0]
    s_table = s_table.drop(columns=[0])
    
    s_filler = [pd.NA for row in s_table.iterrows()]
    mType = e_comps[i]
    if "Group" in mType:
        mType = "Group Stage"
    s_matchtype = [mType for row in s_table.iterrows()]
    
    s_table = s_table.rename(columns={1:"Home Team", 2:"Home Goals", 3:"Away Team", 4:"Away Goals", 5:"Match Date"})
    s_table.insert(2, "Home Penalties", s_filler, True)
    s_table.insert(5, "Away Penalties", s_filler, True)
    s_table.insert(6, "Match Type", s_matchtype, True)
    
    for index, row in s_table.iterrows(): 
        scoring = row["Home Goals"]
        hG = scoring[0]
        hP = pd.NA
        aG = scoring[2]
        aP = pd.NA
        if len(scoring) > 3: # penalties for match
            hP = scoring[6]
            aP = scoring[8]
            s_table.loc[index, "Home Penalties"] = int(hP)
            s_table.loc[index, "Away Penalties"] = int(aP)
    
        s_table.loc[index, "Home Goals"] = int(hG)
        s_table.loc[index, "Away Goals"] = int(aG)
    
    s_table[['Home Goals', 'Away Goals']] = s_table[['Home Goals', 'Away Goals']].astype('int')
    #s_table.dtypes
    t_all_matches.append(s_table)

In [43]:
t_matches_df = pd.concat(t_all_matches)
t_matches_df.index = range(len(t_matches_df["Home Penalties"]))
t_matches_df.head()

Unnamed: 0,Home Team,Home Goals,Home Penalties,Away Team,Away Goals,Away Penalties,Match Type,Match Date
0,Italy,1,3.0,England,1,2.0,Final,11.07.21
1,Spain,1,2.0,Italy,1,4.0,Semi-finals,06.07.21
2,England,2,,Denmark,1,,Semi-finals,07.07.21
3,Switzerland,1,1.0,Spain,1,3.0,Quarter-finals,02.07.21
4,Belgium,1,,Italy,2,,Quarter-finals,02.07.21


### Pulling it All Together

Below will be re-iterated code, but formatted in a fashion that it iterates through multiple pages to coalcese data from each Euro's 1976 - 2020 including the Qualifiers if that is practical

In [47]:
import requests 
import pandas as pd
import io
from bs4 import BeautifulSoup

years = list(range(2020, 1960, -4))
url = "https://terrikon.com/en/euro-"

all_matches = []

for year in years:
    cur_url = f"{url}{year}"
    cur_page = requests.get(cur_url)
    cur_html = cur_page.text
    cur_soup = BeautifulSoup(io.StringIO(cur_html))

    cur_tables = cur_soup.find("div", class_="maincol")
    cur_headers = cur_tables.find_all("h2")
    cur_comps = ["".join(header.text.split(f"Euro-{year}."))[1:] for header in cur_headers]

    cur_sections = []
    for cnt, header in enumerate(cur_headers):
        tracker = header.next_sibling
        cur_html = ""
        next_header = False
        try:
            next_header = cur_headers[cnt +1]
        except:
            while tracker: 
                cur_html = cur_html + str(tracker)
                tracker = tracker.next_sibling
            if cur_html:
                cur_sections.append(BeautifulSoup(cur_html).div)
            continue
        while next_header and tracker != next_header and tracker:
                cur_html = cur_html + str(tracker)
                tracker = tracker.next_sibling
        if cur_html:
                cur_sections.append(BeautifulSoup(cur_html).div)

    for i, section in enumerate(cur_sections): 
        s_tester = cur_sections[i]
        s_table = s_tester.select("table.gameresult")[0].prettify()
        s_table = pd.read_html(io.StringIO(s_table))[0]
        s_table = s_table.drop(columns=[0])
        
        s_filler = [pd.NA for row in s_table.iterrows()]
        mType = cur_comps[i]
        cType = f"Euros {year}"
        if "Group" in mType:
            mType = "Group Stage"
        s_matchtype = [mType for row in s_table.iterrows()]
        s_competiton = [cType for row in s_table.iterrows()] 
        
        s_table = s_table.rename(columns={1:"home team", 2:"home goals", 3:"away team", 4:"away goals", 5:"match date"})
        s_table.insert(2, "home penalties", s_filler, True)
        s_table.insert(5, "away penalties", s_filler, True)
        s_table.insert(6, "match type", s_matchtype, True)
        s_table.insert(8, "competition", s_competiton, True)
        
        for index, row in s_table.iterrows(): 
            scoring = row["home goals"]
            hG = scoring[0]
            hP = pd.NA
            aG = scoring[2]
            aP = pd.NA
            if len(scoring) > 3: # penalties for match
                hP = scoring[6]
                aP = scoring[8]
                s_table.loc[index, "home penalties"] = int(hP)
                s_table.loc[index, "away penalties"] = int(aP)
        
            s_table.loc[index, "home goals"] = int(hG)
            s_table.loc[index, "away goals"] = int(aG)
        
        s_table[['home goals', 'away goals']] = s_table[['home goals', 'away goals']].astype('int')
        #s_table.dtypes
        all_matches.append(s_table)

matches_df = pd.concat(all_matches)
matches_df.index = range(len(matches_df["home team"]))
matches_df.head()

Unnamed: 0,home team,home goals,home penalties,away team,away goals,away penalties,match type,match date,competition
0,Italy,1,3.0,England,1,2.0,Final,11.07.21,Euros 2020
1,Spain,1,2.0,Italy,1,4.0,Semi-finals,06.07.21,Euros 2020
2,England,2,,Denmark,1,,Semi-finals,07.07.21,Euros 2020
3,Switzerland,1,1.0,Spain,1,3.0,Quarter-finals,02.07.21,Euros 2020
4,Belgium,1,,Italy,2,,Quarter-finals,02.07.21,Euros 2020


In [48]:
matches_df

Unnamed: 0,home team,home goals,home penalties,away team,away goals,away penalties,match type,match date,competition
0,Italy,1,3,England,1,2,Final,11.07.21,Euros 2020
1,Spain,1,2,Italy,1,4,Semi-finals,06.07.21,Euros 2020
2,England,2,,Denmark,1,,Semi-finals,07.07.21,Euros 2020
3,Switzerland,1,1,Spain,1,3,Quarter-finals,02.07.21,Euros 2020
4,Belgium,1,,Italy,2,,Quarter-finals,02.07.21,Euros 2020
...,...,...,...,...,...,...,...,...,...
378,Switzerland,1,,Netherlands,1,,Qualifying Round,31.03.63,Euros 1964
379,GDR,2,,ČSSR,1,,Qualifying Round,21.11.62,Euros 1964
380,ČSSR,1,,GDR,1,,Qualifying Round,31.03.63,Euros 1964
381,Italy,6,,Türkiye,0,,Qualifying Round,02.12.62,Euros 1964


In [50]:
matches_df.to_csv("euros_data.csv")