# CPSC322 Final Project

# Premier League Match Outcome Prediction

Author: Arjuna Herbst | Date: 11/11/2024 | Gonzaga University

---

### Dataset

### Source and Format
The dataset used in this project was scraped from [FBref](https://fbref.com/) using Python libraries BeautifulSoup and requests. The data is stored in CSV format using pandas, with each row representing a single Premier League match and containing various match details, team statistics, and outcome information. All code used to gather the dataset can be found below.

### Contents
The dataset includes 2281 instances with the following attributes:

- **Date**: Date of the match
- **Time**: Start time of the match
- **Comp**: Competition type (e.g., Premier League)
- **Round**: Round of the competition (e.g., Group Stage, Round of 16)
- **Day**: Day of the week the match was played
- **Venue**: Venue type, indicating if the match was played at home or away
- **Result**: Outcome of the match (Win, Loss, or Draw)
- **GF**: Goals scored by the team
- **GA**: Goals allowed by the team
- **Opponent**: Opposing team
- **Opp Formation**: Formation used by the opponent
- **Referee**: Referee who officiated the match
- **Sh**: Total shots taken by the team
- **SoT**: Shots on target
- **Dist**: Average shot distance
- **FK**: Free kicks taken
- **PK**: Penalties scored
- **PKatt**: Penalties attempted

### Target Attribute (Class Information)
We aim to classify the **Result** of each match (Win, Loss, or Draw) based on the various match statistics and conditions provided. This will help us understand and predict the outcome based on factors like team performance, match venue, and opponent strength.

---

## Implementation/Technical Merit

The project will involve:
1. **Data Cleaning**: Removing or handling missing values, converting data types where necessary, and extracting relevant features from the raw data.
2. **Feature Engineering**: Creating new features from existing ones to improve the predictive power, such as recent team performance or opponent strength metrics.
3. **Classification Algorithms**: Experimenting with multiple classification algorithms (e.g., Logistic Regression, Random Forest, k-Nearest Neighbors) to find the model that best predicts match outcomes.
4. **Evaluation Metrics**: Using metrics like accuracy, precision, recall, and F1-score to evaluate the models' performance.

---

## Anticipated Challenges

1. **Data Pre-processing**: Handling missing values, standardizing formats, and converting categorical features like `Comp` and `Opponent` into numerical representations.
2. **Class Imbalance**: There may be more frequent outcomes (like Wins) compared to others (like Draws), which could affect model performance.
3. **Feature Selection**: Some attributes may not contribute significantly to predicting the outcome. Techniques like correlation analysis or model-based feature selection will be explored to reduce dimensionality.
4. **Noise in Data**: Attributes like referee or day of the week may introduce noise rather than useful information, so careful consideration will be needed to include or exclude such features.

---

## Potential Impact of the Results

### Usefulness of Results
Predicting match outcomes can offer insights into team performance and game strategy. Sports analysts, betting agencies, and even coaching staff may find value in understanding the factors that contribute to a match result. Additionally, fans may find it interesting to see how statistical factors can affect match outcomes.

### Stakeholders
1. **Sports Analysts and Statisticians**: Professionals interested in identifying patterns in match data and understanding predictors of success or failure in sports.
2. **Betting Agencies**: Companies that offer betting options on football matches could use these insights to set odds more accurately.
3. **Coaches and Team Staff**: Insights could help teams adjust strategies based on factors that influence winning chances.
4. **Fans**: Football enthusiasts interested in analytics and predictive insights may find this project engaging.


In [None]:
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# website with the data
STANDINGS_URL = "https://fbref.com/en/comps/9/2023-2024/2023-2024-Premier-League-Stats"

# get the data with requests library, store in data var
data = requests.get(STANDINGS_URL)

In [3]:
# initialize BeautifulSoup object
soup = BeautifulSoup(data.text)

# find the table with league standings
standings_table = soup.select('table.stats_table')[0]
standings_table

# extract all links from the table
links = standings_table.find_all('a')
links = [l.get('href') for l in links]

In [56]:
# find only squad links, no player links
links = [l for l in links if '/squads/' in l]
team_urls = [f"https://fbref.com{l}" for l in links]


In [68]:
team_url = team_urls[0]
data = requests.get(team_url)

In [69]:
# use pandas to read the "Scores & Fixtures" table from Man City page
matches = pd.read_html(data.text, match='Scores & Fixtures')
matches[0].head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Opp Formation,Referee,Match Report,Notes
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (1),1 (4),Arsenal,,,55,81145.0,Kyle Walker,4-2-3-1,4-3-3,Stuart Attwell,Match Report,Arsenal won on penalty kicks following normal ...
1,2023-08-11,20:00,Premier League,Matchweek 1,Fri,Away,W,3,0,Burnley,1.9,0.3,65,21572.0,Kevin De Bruyne,4-2-3-1,5-4-1,Craig Pawson,Match Report,
2,2023-08-16,22:00,Super Cup,UEFA Super Cup,Wed,Home,D,1 (5),1 (4),es Sevilla,,,74,,Kyle Walker,4-2-3-1,4-2-3-1,François Letexier,Match Report,
3,2023-08-19,20:00,Premier League,Matchweek 2,Sat,Home,W,1,0,Newcastle Utd,1.0,0.3,59,53419.0,Kyle Walker,4-2-3-1,4-3-3,Robert Jones,Match Report,
4,2023-08-27,14:00,Premier League,Matchweek 3,Sun,Away,W,2,1,Sheffield Utd,3.5,0.7,79,31336.0,Kyle Walker,4-2-3-1,3-5-2,Jarred Gillett,Match Report,


In [59]:
# init another BeautifulSoup object
soup = BeautifulSoup(data.text)
links = soup.find_all('a')

# use bs to find all shooting stats links
links = [l.get('href') for l in links]
links = [l for l in links if l and 'all_comps/shooting/' in l]

In [62]:
data = requests.get(f"https://fbref.com{links[0]}")

# pandas data frame of shooting stats
shooting = pd.read_html(data.text, match='Shooting')[0]

shooting.columns = shooting.columns.droplevel()
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (1),1 (4),Arsenal,...,,,0,0,,,,,,Match Report
1,2023-08-11,20:00,Premier League,Matchweek 1,Fri,Away,W,3,0,Burnley,...,13.9,0.0,0,0,1.9,1.9,0.12,1.1,1.1,Match Report
2,2023-08-16,22:00,Super Cup,UEFA Super Cup,Wed,Home,D,1 (5),1 (4),es Sevilla,...,,,0,0,,,,,,Match Report
3,2023-08-19,20:00,Premier League,Matchweek 2,Sat,Home,W,1,0,Newcastle Utd,...,17.9,0.0,0,0,1.0,1.0,0.07,0.0,0.0,Match Report
4,2023-08-27,14:00,Premier League,Matchweek 3,Sun,Away,W,2,1,Sheffield Utd,...,17.3,2.0,0,1,3.5,2.8,0.1,-1.5,-0.8,Match Report


In [None]:
# combine shooting data with match data into one data frame
team_data = matches[0].merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on='Date')
team_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Opp Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (1),1 (4),Arsenal,...,4-3-3,Stuart Attwell,Match Report,Arsenal won on penalty kicks following normal ...,8,4,,,0,0
1,2023-08-11,20:00,Premier League,Matchweek 1,Fri,Away,W,3,0,Burnley,...,5-4-1,Craig Pawson,Match Report,,17,8,13.9,0.0,0,0
2,2023-08-16,22:00,Super Cup,UEFA Super Cup,Wed,Home,D,1 (5),1 (4),es Sevilla,...,4-2-3-1,François Letexier,Match Report,,23,7,,,0,0
3,2023-08-19,20:00,Premier League,Matchweek 2,Sat,Home,W,1,0,Newcastle Utd,...,4-3-3,Robert Jones,Match Report,,14,4,17.9,0.0,0,0
4,2023-08-27,14:00,Premier League,Matchweek 3,Sun,Away,W,2,1,Sheffield Utd,...,3-5-2,Jarred Gillett,Match Report,,29,9,17.3,2.0,0,1


In [None]:
# set bounds for what years to scrape data from
years = list(range(2023, 2020, -1))

# data frame to store all matches
all_matches = []

STANDINGS_URL = "https://fbref.com/en/comps/9/2023-2024/2023-2024-Premier-League-Stats"

for year in years:
    data = requests.get(STANDINGS_URL)
    soup = BeautifulSoup(data.text)
    standings_table = soup.select('table.stats_table')[0]
    
    links = [l.get('href') for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads/' in l]
    team_urls = [f"https://fbref.com{l}" for l in links]
    
    previous_season = soup.select('a.prev')[0].get('href')
    STANDINGS_URL = f"https://fbref.com{previous_season}"
    
    for team_url in team_urls:
        team_name = team_url.split('/')[-1].replace("-Stats", "").replace("-", " ")
        
        data = requests.get(team_url)
        matches = pd.read_html(data.text, match='Scores & Fixtures')[0]
        
        soup = BeautifulSoup(data.text)
        links = [l.get('href') for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match='Shooting')[0]
        shooting.columns = shooting.columns.droplevel()
        
        try:
            team_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on='Date')
        except ValueError:
            continue
        
        team_data = team_data[team_data["Comp"] == "Premier League"]
        team_data["Season"] = year
        team_data["Team"] = team_name
        all_matches.append(team_data)
        time.sleep(20)
    
    
        
        

In [8]:
match_df = pd.concat(all_matches)

match_df.columns = [c.lower() for c in match_df.columns]

match_df.to_csv("premier_league_data.csv")