## Modeling with a Dataframe

After webscraping all of the Euros data and formatting into a dataframe then csv in my [euros_scraping.ipynb](http://localhost:8888/files/Documents/Coding/Notebooks/repos/euros-WebScraping/euros_scraping.ipynb?_xsrf=2%7C8c3fecd8%7C02d37b9b4c7c7af0dd7bca94af94aa7f%7C1716421226) notebook, I will now attempt to perform Modeling and predictions

THe hope is I scraped the relevant data, having all results recorded from Euro's and Euro's qualifiers into the csv.

In this notebook, I will use pandas as the main method of modeling this data.

In [70]:
import pandas as pd

In [71]:
matches = pd.read_csv("euros.csv", index_col=0)
matches["mDate"] = pd.to_datetime(matches["mDate"], format="%d.%m.%y")
matches.loc[matches["mDate"].dt.year > 2024, "mDate"] -= pd.DateOffset(years=100) 

# necesssary to offset pre-2000 mDates being interpreted as 20%yy (10.05.59 as 1959)

In [72]:
matches["hTeam"].value_counts().head()

hTeam
Italy       84
Germany     84
Spain       83
Denmark     81
Portugal    80
Name: count, dtype: int64

In [73]:
matches["aTeam"].value_counts().head()

aTeam
Spain          88
Italy          79
Netherlands    77
Denmark        76
Portugal       76
Name: count, dtype: int64

### Next Steps 

Currently, a single National team's data is split between home and away sections. For example, when we scraped the page data, sometimes Italy was listed as the left team and sometimes it was listed as the right team.

For brevity, I selected to set Italy as the home team whenever it appeared on the left and designated it as the away team when it appeared on the right.

It will be better to reorganize the data, coalescing all of Italy's matches together, re-stiching the data from

> hTeam --> team <BR>
> hGoals --> gScored <BR>
> hPenalties --> pScored <BR>
> hResult --> result <BR>
> aTeam --> opponent <BR>
> aGoals --> gAllowed <BR>
> aPenalties --> pAllowed <BR>

This will be difficult and require duplicate row data <br>
(i.e Italy as home team, Denmark as away, recording same match data for both in their own "team" as opposite's "opponent" so we can focus on a Nation's individual performance)

Instead of manipulating our current dataframe, it may be best to update into a new dataframe

In [74]:
# comp | mType | team | opponent | result | goals | gAllowed | penalties | pAllowed | mDate
mInfo = {"team": ["filler"], "opponent": ["fller"],
        "result": ['f'], "goals": [1.1], "gAllowed": [1.1], 
        "penalties": [1.1], "pAllowed": [1.1], "mDate": [pd.NA], 
        'mType': ["filler"], 'comp': ["filler"],}
cMatches = pd.DataFrame(data=mInfo)
cMatches[["mDate"]] = cMatches[["mDate"]].astype("datetime64[ns]")
cMatches = cMatches.drop(index=0)
cMatches

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,mType,comp


In [75]:
for index, row in matches.iterrows():
    cMatches.loc[-1] = [row["hTeam"], row["aTeam"], row["hResult"], row["hGoals"], row["aGoals"], row["hPenalties"], row["aPenalties"], row["mDate"], row["mType"], row["comp"], ]
    cMatches.index += 1
    cMatches.loc[-1] = [row["aTeam"], row["hTeam"], row["aResult"], row["aGoals"], row["hGoals"], row["aPenalties"], row["hPenalties"], row["mDate"], row["mType"], row["comp"]]
    cMatches.index += 1
cMatches.index = list(range(len(cMatches["team"])))

In [76]:
cMatches.head()

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,mType,comp
0,Georgia,Greece,W,0,0,4.0,2.0,2024-03-26,Final,"Euros 2024, Qualifiers"
1,Greece,Georgia,L,0,0,2.0,4.0,2024-03-26,Final,"Euros 2024, Qualifiers"
2,Wales,Poland,L,0,0,4.0,5.0,2024-03-26,Final,"Euros 2024, Qualifiers"
3,Poland,Wales,W,0,0,5.0,4.0,2024-03-26,Final,"Euros 2024, Qualifiers"
4,Ukraine,Iceland,W,2,1,,,2024-03-26,Final,"Euros 2024, Qualifiers"


In [77]:
cMatches.tail()

Unnamed: 0,team,opponent,result,goals,gAllowed,penalties,pAllowed,mDate,mType,comp
6089,Denmark,ČSSR,L,1,5,,,1959-10-18,Round of 16,Euros 1960
6090,Ireland,ČSSR,W,2,0,,,1959-04-05,Qualifying Round,Euros 1960
6091,ČSSR,Ireland,L,0,2,,,1959-04-05,Qualifying Round,Euros 1960
6092,ČSSR,Ireland,W,4,0,,,1959-05-10,Qualifying Round,Euros 1960
6093,Ireland,ČSSR,L,0,4,,,1959-05-10,Qualifying Round,Euros 1960


In [78]:
cMatches["team"].value_counts().head()

team
Spain          171
Italy          163
Denmark        157
Portugal       156
Netherlands    156
Name: count, dtype: int64

## Setting category predictors

Below, we are adding additional columns as metrics for predicting a match. A team may perform differently based on the match type of the competition and their competitor 

The target column is if a team achieved its desired result, 

> a "W" was desired, equating to a 1 <br>
> a "L" or "D" was not desired, equating to a 0

In [79]:
cMatches["oppCode"] = cMatches["opponent"].astype("category").cat.codes
cMatches["mCode"] = cMatches["mType"].astype("category").cat.codes

In [80]:
cMatches["target"] = (cMatches["result"] == "W").astype("int")

## Predictions 
With a RandomForestClassifer, we can run simulations with our category codes non-linearity,

We then split from a "trained" and "test" dataframe 

> A Train dataframe to . . . train our rf model on our selected predictors (opponent and match type) to see what the expected result (target) would be in our test model <br>
> A Test data frame to see how our predictions actually compare with that of the outcome 

In [120]:
from sklearn.ensemble import RandomForestClassifier

# associates non-linearity 
# i.e oppCode 22 doesn't imply numerical difference between oppCode 23
# just categorical differences

rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

train_set = cMatches[cMatches["mDate"] < "2022-01-01"] 
test_set = cMatches[cMatches["mDate"] > "2022-01-01"]

predictors = ["oppCode", "mCode"]

rf.fit(train_set[predictors], train_set["target"])
# runs our forest model on known oppCode mCode combos, then we provide the result that came out of it "target"

preds = rf.predict(test_set[predictors])
# then with our test set, we produce a precition column on what result would be given oppCode and mCode

## Comparisons

With our trained rf, we then create predictions based on the same metrics for the test data set

Using the accuracy_score, we're able to see how often our trained model guessed correctlty

In [121]:
from sklearn.metrics import accuracy_score

# measures our predictions 
acc = accuracy_score(test_set["target"], preds)
acc

# compares the likeness of each prediction to the actual result (target) and reports our accuracy

0.698744769874477

In [122]:
combined = pd.DataFrame(dict(actual=test_set["target"], prediction=preds))
pd.crosstab(index=combined["actual"], columns=combined["prediction"])

prediction,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,213,69
1,75,121


In [123]:
from sklearn.metrics import precision_score

"""
Precision Score then defined as 

    ps = tp / (tp + fp) 

    tp = true positive 
    fp = false positive 

Essentially reports the score "how often did we correctly call a winning result
"""

precision_score(test_set["target"], preds)

0.6368421052631579

## Predictions 

With our model, we will now predict the results of Group Stage Fixtures with our training model and move on from there until a winner is decided

Here's what happening below 

1. Web-scraping for all the Group Stages in the Euros 2024
2. Isolating to just the fiixtures
3. Cleaning up given fixture table into our new tables
4. Repeating the data twice per table (with swapped team and opponent)

Now en_tables should be an array of our tables cleaned up with doubled match data


In [212]:
import requests
import io 
from bs4 import BeautifulSoup

e_2024 = "https://terrikon.com/en/euro-2024"
e_2024_page = requests.get(e_2024)
e_html = e_2024_page.text

e_tables = pd.read_html(io.StringIO(e_html))
group_tables = [table for table in e_tables if table.shape[0] != 6]
group_tables = [table.drop(["Unnamed: 0", "G", "S", "M"], axis=1) for table in group_tables]
group_tables = [table.rename(columns={"Unnamed: 1": "Team"}) for table in group_tables]
lineup_tables = [table for table in e_tables if table.shape[0] == 6]
en_tables = [] 

mInfo = {"team": ["filler"], "opponent": ["fller"],  "mDate": [pd.NA], 
        'mType': ["filler"], 'comp': ["filler"]}
for table in lineup_tables:
    f_table = pd.DataFrame(data=mInfo)
    f_table = f_table.drop(index=0)
    table[5] = pd.to_datetime(matches["mDate"], format="%d.%m.%y %h:%m")
    for index, row in table.iterrows():
        f_table.loc[-1] = [row[1], row[3], row[5], "Group Stage", "Euros 2024"]
        f_table.index += 1
        f_table.loc[-1] = [row[3], row[1], row[5], "Group Stage", "Euros 2024"]
        f_table.index += 1
    en_tables.append(f_table)


## Predicting Group Stage 

Per table, we are now going to produce a prediction column to see who the result of the match.

Should the same-matchup produce different results, we will chalk it up as a tie



In [225]:
group_tables[0]

Unnamed: 0,Team,W,D,L,-,Pts
0,Switzerland,0,0,0,-,0
1,Hungary,0,0,0,-,0
2,Scotland,0,0,0,-,0
3,Germany,0,0,0,-,0


In [226]:
en_tables[0]

Unnamed: 0,team,opponent,mDate,mType,comp
11,Germany,Scotland,2024-03-26,Group Stage,Euros 2024
10,Scotland,Germany,2024-03-26,Group Stage,Euros 2024
9,Hungary,Switzerland,2024-03-26,Group Stage,Euros 2024
8,Switzerland,Hungary,2024-03-26,Group Stage,Euros 2024
7,Germany,Hungary,2024-03-26,Group Stage,Euros 2024
6,Hungary,Germany,2024-03-26,Group Stage,Euros 2024
5,Scotland,Switzerland,2024-03-21,Group Stage,Euros 2024
4,Switzerland,Scotland,2024-03-21,Group Stage,Euros 2024
3,Switzerland,Germany,2024-03-21,Group Stage,Euros 2024
2,Germany,Switzerland,2024-03-21,Group Stage,Euros 2024
