## Modeling with a Dataframe

After webscraping all of the Euros data and formatting into a dataframe then csv in my [euros_scraping.ipynb](http://localhost:8888/files/Documents/Coding/Notebooks/repos/euros-WebScraping/euros_scraping.ipynb?_xsrf=2%7C8c3fecd8%7C02d37b9b4c7c7af0dd7bca94af94aa7f%7C1716421226) notebook, I will now attempt to perform Modeling and predictions

THe hope is I scraped the relevant data, having all results recorded from Euro's and Euro's qualifiers into the csv.

In this notebook, I will use pandas as the main method of modeling this data.

In [2]:
import pandas as pd

In [3]:
matches = pd.read_csv("euros.csv", index_col=0)
matches["mDate"] = pd.to_datetime(matches["mDate"], format="%d.%m.%y")
matches.loc[matches["mDate"].dt.year > 2024, "mDate"] -= pd.DateOffset(years=100) 

# necesssary to offset pre-2000 mDates being interpreted as 20%yy (10.05.59 as 1959)

In [4]:
matches["hTeam"].value_counts().head()

hTeam
Italy       84
Germany     84
Spain       83
Denmark     81
Portugal    80
Name: count, dtype: int64

In [5]:
matches["aTeam"].value_counts().head()

aTeam
Spain          88
Italy          79
Netherlands    77
Denmark        76
Portugal       76
Name: count, dtype: int64

### Next Steps 

Currently, a single National team's data is split between home and away sections. For example, when we scraped the page data, sometimes Italy was listed as the left team and sometimes it was listed as the right team.

For brevity, I selected to set Italy as the home team whenever it appeared on the left and designated it as the away team when it appeared on the right.

It will be better to reorganize the data, coalescing all of Italy's matches together, re-stiching the data from

> hTeam --> team <BR>
> hGoals --> gScored <BR>
> hPenalties --> pScored <BR>
> hResult --> result <BR>
> aTeam --> opponent <BR>
> aGoals --> gAllowed <BR>
> aPenalties --> pAllowed <BR>

This will be difficult and require duplicate row data <br>
(i.e Italy as home team, Denmark as away, recording same match data for both in their own "team" as opposite's "opponent" so we can focus on a Nation's individual performance)

Instead of manipulating our current dataframe, it may be best to update into a new dataframe

In [6]:
# comp | mType | team | opponent | result | goals | gAllowed | penalties | pAllowed | mDate
mInfo = {"team": ["filler"], "opponent": ["fller"],
        "result": ['f'], "goals": [1.1], "gAllowed": [1.1], 
        "penalties": [1.1], "pAllowed": [1.1], "mDate": [pd.NA], 
        "SoC": ["filler"], 'mType': ["filler"], 'comp': ["filler"],}
cMatches = pd.DataFrame(data=mInfo)
cMatches[["mDate"]] = cMatches[["mDate"]].astype("datetime64[ns]")
cMatches = cMatches.drop(index=0)
cMatches

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,SoC,mType,comp


In [7]:
mType_corrections = { "Group Stage" : "Group Stage",             
                    "Quarter-finals": "Quarter-finals",       
                    "Round of 16": "Round of 16",                
                    "Semi-finals": "Semi-finals",                
                    "Qualifying Round": "Qualifying Round",
                    "Third place play-off": "Third place play-off",      
                    "Finals": "Final",                     
                    "Quater-finals": "Quarter-finals",   
                    "Final":"Final",     
                    "Head to head round": "Head-To-Head",                             
                    "Head to Head": "Head-To-Head"}

SoC_corrections = { "Group Stage" : "Group Stage",             
                    "Quarter-finals": "Knockout Round",       
                    "Round of 16": "Knockout Round",                
                    "Semi-finals": "Knockout Round",                
                    "Qualifying Round": "Knockout Round",
                    "Third place play-off": "Knockout Round",      
                    "Finals": "Knockout Round",                     
                    "Quater-finals": "Knockout Round",   
                    "Final":"Knockout Round",     
                    "Head to head round": "Head-To-Head",                             
                    "Head to Head": "Head-To-Head"}          
           
for index, row in matches.iterrows():
    mType = mType_corrections[row["mType"]] 
    SoC = SoC_corrections[row["mType"]] 
    cMatches.loc[-1] = [row["hTeam"], row["aTeam"], row["hResult"], row["hGoals"], row["aGoals"], row["hPenalties"], row["aPenalties"], row["mDate"], SoC, mType, row["comp"], ]
    cMatches.index += 1
    cMatches.loc[-1] = [row["aTeam"], row["hTeam"], row["aResult"], row["aGoals"], row["hGoals"], row["aPenalties"], row["hPenalties"], row["mDate"], SoC, mType, row["comp"]]
    cMatches.index += 1
cMatches.index = list(range(len(cMatches["team"])))

In [8]:
cMatches.head()

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,SoC,mType,comp
0,Georgia,Greece,W,0,0,4.0,2.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers"
1,Greece,Georgia,L,0,0,2.0,4.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers"
2,Wales,Poland,L,0,0,4.0,5.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers"
3,Poland,Wales,W,0,0,5.0,4.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers"
4,Ukraine,Iceland,W,2,1,,,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers"


In [9]:
cMatches.tail()

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,SoC,mType,comp
6089,Denmark,ČSSR,L,1,5,,,1959-10-18,Knockout Round,Round of 16,Euros 1960
6090,Ireland,ČSSR,W,2,0,,,1959-04-05,Knockout Round,Qualifying Round,Euros 1960
6091,ČSSR,Ireland,L,0,2,,,1959-04-05,Knockout Round,Qualifying Round,Euros 1960
6092,ČSSR,Ireland,W,4,0,,,1959-05-10,Knockout Round,Qualifying Round,Euros 1960
6093,Ireland,ČSSR,L,0,4,,,1959-05-10,Knockout Round,Qualifying Round,Euros 1960


In [10]:
cMatches["mType"].value_counts()

mType
Group Stage             5590
Quarter-finals           136
Round of 16               96
Semi-finals               88
Head-To-Head              70
Qualifying Round          54
Final                     48
Third place play-off      12
Name: count, dtype: int64

In [11]:
cMatches["SoC"].value_counts()

SoC
Group Stage       5590
Knockout Round     434
Head-To-Head        70
Name: count, dtype: int64

## Setting category predictors

Below, we are adding additional columns as metrics for predicting a match. A team may perform differently based on the match type of the competition and their competitor 

The target column is if a team achieved its desired result, 

> a "W" was desired, equating to a 3 <br>
> a "D" equating to a 1 <br>
> a "L" equating to a 0

In [13]:
cMatches["team"] = cMatches["team"].astype("category")
cMatches["opponent"] = cMatches["opponent"].astype("category")
cMatches["SoC"] = cMatches["SoC"].astype("category")
cMatches["mType"] = cMatches["mType"].astype("category")

dT_keys = cMatches["team"].to_list() 
dO_keys = cMatches["opponent"].to_list()
dS_keys = cMatches["SoC"].to_list()
dM_keys = cMatches["mType"].to_list()

dT_values = cMatches["team"].cat.codes
dO_values = cMatches["opponent"].cat.codes
dS_values = cMatches["SoC"].cat.codes
dM_values = cMatches["mType"].cat.codes

d_Teams = dict(zip(dT_keys, dT_values)) # functionally same dict as d_opp, maybe overlap 
d_Opp = dict(zip(dO_keys, dO_values)) 
d_SoC = dict(zip(dS_keys, dS_values))
d_Matches = dict(zip(dM_keys, dM_values))

cMatches["tCode"] = cMatches["team"].cat.codes
cMatches["oppCode"] = cMatches["opponent"].cat.codes
cMatches["socCode"] = cMatches["SoC"].cat.codes
cMatches["mCode"] = cMatches["mType"].cat.codes

In [23]:
cMatches.loc[(cMatches["team"] == "Georgia") | (cMatches["opponent"] == "Georgia")].head()
    # dataframe.loc with multiple conditionals prefers 
    # () to isolate each coniditional and 
    # uses & for conditional and
    # uses | for conditional or 

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,SoC,mType,comp,tCode,oppCode,socCode,mCode,target
0,Georgia,Greece,W,0,0,4.0,2.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers",19,22,2,0,3
1,Greece,Georgia,L,0,0,2.0,4.0,2024-03-26,Knockout Round,Final,"Euros 2024, Qualifiers",22,19,2,0,0
6,Georgia,Luxembourg,W,2,0,,,2024-03-21,Knockout Round,Semi-finals,"Euros 2024, Qualifiers",19,33,2,6,3
7,Luxembourg,Georgia,L,0,2,,,2024-03-21,Knockout Round,Semi-finals,"Euros 2024, Qualifiers",33,19,2,6,0
22,Georgia,Norway,D,1,1,,,2023-03-28,Group Stage,Group Stage,"Euros 2024, Qualifiers",19,40,0,1,1


In [24]:
points = {"W": 3, "D": 1, "L": 0} 
pointsR = {3: "W", 1: "D", 0: "L"}

cMatches["target"] = [ points[row["result"]] for index, row in cMatches.iterrows()]
cMatches["result"].value_counts()

result
W    2460
L    2460
D    1174
Name: count, dtype: int64

## Model 
With a RandomForestClassifer, we can run simulations with our category codes non-linearity,

We then split from a "trained" and "test" dataframe 

> A Train dataframe to . . . train our rf model on our selected predictors (opponent and match type) to see what the expected result (target) would be in our test model <br>
> A Test data frame to see how our predictions actually compare with that of the outcome 

In [158]:
from sklearn.ensemble import RandomForestClassifier

# associates non-linearity 
# i.e oppCode 22 doesn't imply numerical difference between oppCode 23
# just categorical differences

rf = RandomForestClassifier(n_estimators=100, min_samples_split=10, random_state=42)

train_set = cMatches[cMatches["mDate"] < "2000-01-01"] 
test_set = cMatches[cMatches["mDate"] > "2000-01-01"]

predictors = ["tCode", "oppCode", "socCode", "mCode"]

rf.fit(train_set[predictors], train_set["target"])
# runs our forest model on known oppCode mCode combos, then we provide the result that came out of it "target"

preds = rf.predict(test_set[predictors])
# then with our test set, we produce a precition column on what result would be given oppCode and mCode

## Comparisons

With our trained rf, we then create predictions based on the same metrics for the test data set

Using the accuracy_score, we're able to see how often our trained model guessed correctlty

In [159]:
from sklearn.metrics import accuracy_score

# measures our predictions 
acc = accuracy_score(test_set["target"], preds)
acc

# compares the likeness of each prediction to the actual result (target) and reports our accuracy

0.5600797266514806

In [160]:
combined = pd.DataFrame(dict(actual=test_set["target"], prediction=preds))
pd.crosstab(index=combined["actual"], columns=combined["prediction"])

prediction,0,1,3
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,956,133,358
1,274,72,272
3,352,156,939


In [161]:
from sklearn.metrics import precision_score

"""
Precision Score then defined as 

    ps = tp / (tp + fp) 

    tp = true positive 
    fp = false positive 

Essentially reports the score "how often did we correctly call a winning result
"""

precision_score(test_set["target"], preds, average=None)

array([0.60429836, 0.19944598, 0.59847036])

## Predictions 

With our model, we will now predict the results of Group Stage Fixtures with our training model and move on from there until a winner is decided

Here's what happening below 

1. Web-scraping for all the Group Stages in the Euros 2024
2. Isolating to just the fixtures
3. Cleaning up given fixture table into our new tables
4. Repeating the data twice per table (with swapped team and opponent)

Now en_tables should be an array of our tables cleaned up with doubled match data


In [129]:
import requests
import io 
from bs4 import BeautifulSoup

e_2024 = "https://terrikon.com/en/euro-2024"
e_2024_page = requests.get(e_2024)
e_html = e_2024_page.text

e_tables = pd.read_html(io.StringIO(e_html))
group_tables = [table for table in e_tables if table.shape[0] != 6]
group_tables = [table.drop(["Unnamed: 0", "G", "S", "M"], axis=1) for table in group_tables]
group_tables = [table.rename(columns={"Unnamed: 1": "Team"}) for table in group_tables]
lineup_tables = [table for table in e_tables if table.shape[0] == 6]
en_tables = [] 

mInfo = {"team": ["filler"], "opponent": ["fller"],  "mDate": [pd.NA], 
        'mType': ["filler"], 'comp': ["filler"], "tCode": "filler", "oppCode": ["filler"], "socCode": ["filler"], "mCode": ["filler"]}
for table in lineup_tables:
    f_table = pd.DataFrame(data=mInfo)
    f_table = f_table.drop(index=0)
    table[5] = table[5].apply(lambda x: x.split(" ")[0])
    table[5] = pd.to_datetime(table[5], format="%d.%m.%y")
    for index, row in table.iterrows():
        f_table.loc[-1] = [row[1], row[3], row[5], "Group Stage", "Euros 2024", d_Teams[row[1]], d_Opp[row[3]], d_SoC["Group Stage"], d_Matches["Group Stage"]]
        f_table.index += 1
        f_table.loc[-1] = [row[3], row[1], row[5], "Group Stage", "Euros 2024", d_Teams[row[3]], d_Opp[row[1]], d_SoC["Group Stage"], d_Matches["Group Stage"]]
        f_table.index += 1
    f_table.index = range(f_table.shape[0])
    en_tables.append(f_table)


## Predicting Group Stage 

Per table, we are now going to produce a prediction column to see who the result of the match.

Should the same-matchup produce different results, we will chalk it up as a tie



In [131]:
group_tables[0]

Unnamed: 0,Team,W,D,L,-,Pts
0,Switzerland,0,0,0,-,0
1,Hungary,0,0,0,-,0
2,Scotland,0,0,0,-,0
3,Germany,0,0,0,-,0


In [134]:
en_tables[0]

Unnamed: 0,team,opponent,mDate,mType,comp,tCode,oppCode,socCode,mCode
0,Germany,Scotland,2024-06-14,Group Stage,Euros 2024,20,46,0,1
1,Scotland,Germany,2024-06-14,Group Stage,Euros 2024,46,20,0,1
2,Hungary,Switzerland,2024-06-15,Group Stage,Euros 2024,23,52,0,1
3,Switzerland,Hungary,2024-06-15,Group Stage,Euros 2024,52,23,0,1
4,Germany,Hungary,2024-06-19,Group Stage,Euros 2024,20,23,0,1
5,Hungary,Germany,2024-06-19,Group Stage,Euros 2024,23,20,0,1
6,Scotland,Switzerland,2024-06-19,Group Stage,Euros 2024,46,52,0,1
7,Switzerland,Scotland,2024-06-19,Group Stage,Euros 2024,52,46,0,1
8,Switzerland,Germany,2024-06-23,Group Stage,Euros 2024,52,20,0,1
9,Germany,Switzerland,2024-06-23,Group Stage,Euros 2024,20,52,0,1


### Making a prediction 

In [136]:
def run_prediction(uMatches, kMatches, metrics): 
    comp_set = kMatches[cMatches["mDate"] > "2000-01-01"] 
    training_set = kMatches # add conditional here if desired to select the training set
    forest = RandomForestClassifier(n_estimators=100, min_samples_split=10, random_state=42) 

    forest.fit(training_set[metrics], training_set["target"])

    cPreds = forest.predict(comp_set[metrics])
    uPreds = forest.predict(uMatches[metrics])

    ps = precision_score(comp_set["target"], cPreds, average=None)
    ac = accuracy_score(comp_set["target"], cPreds)
    
    return uPreds, ps, ac  

def decide_result(resultA, resultB): 
    match resultA:
        case "W":
            match resultB:
                case "W": return "D", "D"
                case "D": return "W", "L"
                case "L": return "W", "L"
        case "D":
            match resultB:
                case "W": return "L", "W"
                case "D": return "D", "D"
                case "L": return "W", "L"
        case "L":
            match resultB:
                case "W": return "L", "W"
                case "D": return "L", "W"
                case "L": return "D", "D"

In [137]:
for g, group in enumerate(en_tables): 
    preds, confidence, acc = run_prediction(group, cMatches, ["tCode", "oppCode", "socCode", "mCode"])
    group["prediction"] = [ pointsR[pred] for pred in preds]
    print(preds, confidence, acc)
    corrected_results = []
    for index in range(group.shape[0]):
        if index % 2 == 0: 
            team_a = group.loc[index]["team"]
            team_b = group.loc[index]["opponent"]
            result_a = group.loc[index]["prediction"] 
            result_b = group.loc[index+1]["prediction"]
            result, result_opp = decide_result(result_a, result_b)

            corrected_results.append(result)
            corrected_results.append(result_opp)

            i_a= group_tables[g].loc[group_tables[g]["Team"] == team_a].index
            i_b= group_tables[g].loc[group_tables[g]["Team"] == team_b].index
            
            group_tables[g].loc[i_a, result] += 1
            group_tables[g].loc[i_a, "Pts"] += points[result]

            group_tables[g].loc[i_b, result_opp] += 1
            group_tables[g].loc[i_b, "Pts"] += points[result_opp]
    group["correctedPred"] = corrected_results
    group_tables[g] = group_tables[g].sort_values(by="Pts", ascending=False)

[3 0 0 3 1 1 0 3 3 3 3 3] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997
[3 3 3 0 1 1 1 1 0 3 1 1] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997
[0 3 0 3 3 0 0 3 3 0 3 0] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997
[3 0 0 3 3 0 3 0 3 0 1 1] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997
[1 1 3 0 3 3 0 3 0 3 3 0] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997
[3 0 3 0 3 0 0 3 0 3 0 3] [0.74707692 0.62162162 0.7463145 ] 0.7374715261958997


In [149]:
group_tables[0]

Unnamed: 0,Team,W,D,L,-,Pts
0,Switzerland,2,1,0,-,7
3,Germany,1,2,0,-,5
1,Hungary,0,2,1,-,2
2,Scotland,0,1,2,-,1


In [150]:
en_tables[0]

Unnamed: 0,team,opponent,mDate,mType,comp,tCode,oppCode,socCode,mCode,prediction,correctedPred
0,Germany,Scotland,2024-06-14,Group Stage,Euros 2024,20,46,0,1,W,W
1,Scotland,Germany,2024-06-14,Group Stage,Euros 2024,46,20,0,1,L,L
2,Hungary,Switzerland,2024-06-15,Group Stage,Euros 2024,23,52,0,1,L,L
3,Switzerland,Hungary,2024-06-15,Group Stage,Euros 2024,52,23,0,1,W,W
4,Germany,Hungary,2024-06-19,Group Stage,Euros 2024,20,23,0,1,D,D
5,Hungary,Germany,2024-06-19,Group Stage,Euros 2024,23,20,0,1,D,D
6,Scotland,Switzerland,2024-06-19,Group Stage,Euros 2024,46,52,0,1,L,L
7,Switzerland,Scotland,2024-06-19,Group Stage,Euros 2024,52,46,0,1,W,W
8,Switzerland,Germany,2024-06-23,Group Stage,Euros 2024,52,20,0,1,W,D
9,Germany,Switzerland,2024-06-23,Group Stage,Euros 2024,20,52,0,1,W,D


In [103]:
lineup_tables[0]

Unnamed: 0,0,1,2,3,4,5
0,,Germany,-:-,Scotland,,2024-03-26
1,,Hungary,-:-,Switzerland,,2024-03-26
2,,Germany,-:-,Hungary,,2024-03-26
3,,Scotland,-:-,Switzerland,,2024-03-21
4,,Switzerland,-:-,Germany,,2024-03-21
5,,Scotland,-:-,Hungary,,2024-03-21
