## **Title: Player Performance Prediction in Euro League Football**

## Team Members:

1. Resham Deepak Bahira     -  Reshamm13
2. Sudhanshu Gopalrao Pawar    - Psudhanshu
3. Nithin Kumar Hadhge Girish Kumar -  NK12131
4. Aaryan wani          -    aaryanwani

## INTRODUCTION

The goal of this project is to develop a predictive model that forecasts player performance by focusing on following aspects:  
**Predicting the form score for the upcoming season** based on historical form scores from previous years and by incorporating contextual factors such as match location (home or away games) and weather conditions (e.g., likelihood of rain on game day).  

Our primary stakeholders include team managers and coaches, sports analysts, and fans or enthusiasts. Each group relies on accurate predictions of player performance for decision-making, analysis, and engagement.  

Traditional methods of performance prediction often fail to capture the dynamic nature of sports performance, relying heavily on static data and ignoring trends or external factors. This leads to suboptimal decision-making and missed opportunities for strategic planning.  

By predicting the **form score for the upcoming season**, our model helps stakeholders in several ways:  
- **For Team Managers and Coaches**: Seasonal form predictions provide a clear understanding of a player’s likely trajectory, enabling proactive decisions about lineups, player rotation, and training focus. For example, identifying players expected to have a strong season allows coaches to build strategies around their strengths, while those predicted to underperform can receive targeted interventions.  
- **For Sports Analysts**: These insights improve the accuracy of long-term analyses and predictions, helping analysts provide data-driven recommendations and insights to teams or media outlets.  
- **For Fans and Enthusiasts**: Knowing which players are likely to perform well throughout the season enhances fan engagement and deepens their connection to the game.  

 By analyzing how players perform under specific conditions, such as rainy weather or away games, the model helps coaches make game-specific decisions. For instance:  
- **Strategic Lineups**: Coaches can identify players who excel in certain contexts, such as wet conditions or high-pressure away games, and adjust their lineup accordingly.  
- **Tailored Strategies**: Insights into contextual performance enable coaches to adapt their game plans to exploit player strengths and mitigate weaknesses under expected conditions.  
- **Optimized Rotations**: Combining form and performance predictions helps coaches plan rotations that maintain peak performance while reducing the risk of injuries or fatigue.  

This project represents an important step toward creating a comprehensive predictive system. While not exhaustive, it lays the groundwork for future enhancements, such as integrating additional contextual factors or refining predictions for specific sports or leagues, ensuring broader adoption and greater impact across stakeholders.  

## LITERATURE REVIEW

Player Performance Prediction in Sports(https://ieeexplore.ieee.org/document/8474750)

Predicting player performance is crucial in sports analytics for optimizing strategies and player selection. Various methods, from traditional statistics to advanced machine learning, analyze historical data, including player stats and contextual factors, to forecast future performance. Pariath et al. (2023) proposed a model for football performance prediction, using KPIs like goals and assists, and machine learning models like Random Forest and Gradient Boosting for accurate predictions.

**Limitations of Existing Research**
Pariath et al.’s work has some limitations:
Football-Specific Metrics: Focuses on football, limiting broader applicability.
Static Form Representation: Ignores dynamic changes in player performance over time.
Limited Time-Series Use: Does not model temporal trends in player form.

**Our Contribution**
Building on the foundation laid by Pariath et al., our project aims to enhance player performance prediction by introducing a more dynamic approach to modeling player form scores over time. We address the limitations of previous works by incorporating time-series modeling techniques, which capture trends in player performance across multiple games. Our contributions include:

Time-Series Analysis: We employ advanced time-series methods, such as ARIMA, Long Short-Term Memory (LSTM) networks, and Prophet, to model player form scores over multiple games, allowing us to capture temporal dependencies and trends that static models cannot.
Form Score Metric: We define a robust form score by aggregating key performance indicators (e.g., goals, assists, passing accuracy) into a dynamic metric that evolves over time, reflecting a player’s consistency and performance trends.
Confidence Intervals: We provide predictions with uncertainty bounds, offering teams and analysts not only the expected form score but also the confidence associated with each prediction. This allows for more informed decision-making under uncertainty.
Generalized Framework: While our primary focus is on a specific sport (e.g., football or basketball), our methodology is designed to be extensible to other sports where performance can be tracked longitudinally, allowing for broader applicability across various domains.
Incorporating Contextual Factors: We enhance our prediction model by incorporating contextual factors such as match location (home or away games) and weather conditions (e.g., likelihood of rain on game day). These factors can significantly influence player performance and provide a more accurate and comprehensive prediction model.

## DATA 

Dataset Link : https://data.world/cclayford/statbunker-football-statistics

This project utilizes a comprehensive dataset sourced from the **Statbunker football statistics database**, publicly available through **data.world**, a reliable platform known for hosting high-quality and verified datasets. The dataset comprises detailed information about European football leagues, covering historical match data from the **2014/15 season to the 2022/23 season**. It includes over **100,000 rows** and **117 features**, offering a rich and diverse array of statistics for analysis.  

#### Data Summary  

The dataset spans seven prominent European leagues:  
- **Premier League**  
- **Bundesliga**  
- **La Liga**  
- **French Ligue 1**  
- **Eredivisie**  
- **Serie A**  
- **Scottish Premiership**  

It contains granular details about player performance, match outcomes, and contextual factors, categorized into the following disciplines:  
- **Player Stats**: Appearances, goals, assists, yellow/red cards, and performance metrics.  
- **Team Defense/Offense**: Goals for/against, clean sheets, and team-based performance metrics.  
- **League Tables**: Seasonal rankings and team performance.  
- **Match Context**: Home/away stats, weather conditions, and match dates.  

#### Key Features  
The dataset includes columns such as:  
- **Player Performance Metrics**: Goals, assists, minutes played, form score, rolling mean/variance of form scores.  
- **Contextual Data**: Match location (home/away), weather data (temperature, precipitation, wind speed), and match date.  
- **Team Metrics**: Goals scored/conceded, clean sheets, and set-piece stats.  

#### Data Quality  
The dataset is sourced from **data.world**, ensuring its reliability and accuracy. Data.world is a well-established platform known for hosting curated datasets with proper documentation and verification. The data is clean, well-structured, and includes normalized features like **Form Score** and **Normalized Performance**, making it suitable for machine learning applications.  

#### Data Balance and Distribution  
The dataset covers a balanced representation of players and teams across seven leagues, ensuring diversity in match contexts (home vs. away) and performance scenarios (weather conditions, seasonal trends). Initial exploratory data analysis (EDA) reveals no significant class imbalances or missing data in critical features.  



#### Visualizations  
![image.png](attachment:5325bc63-e39f-45be-9c22-aedb08c304ab.png)


![image.png](attachment:5002b2f9-2eb4-4fba-8949-a2d461410963.png)

![image.png](attachment:bf70ea04-0d42-430c-9135-0aeb0f0e3156.png)

![image.png](attachment:e1cfad3c-9179-42f5-942c-039069151f95.png)

![image.png](attachment:2abc462a-37cc-4a53-997c-3dddc42fef9a.png)

![image.png](attachment:ca3d5bbf-93b5-4e9a-8507-0182ea5514d6.png)
There are strong positive correlations between certain weather variables (e.g., tavg, tmin, tmax) and performance metrics like Appearances and Goals.
Precipitation (prcp) and wind speed (wspd) show mostly negative correlations with the performance metrics.
The correlations provide insights into how different weather conditions may impact the team or player's performance across various metrics.


## METHODS

### Step 1: Merging the data

In this project, we merged 63 CSV files from the 2014/15 to 2022/23 seasons, covering datasets such as attendance, league nationalities, player stats, and team performance. To ensure consistency, key columns like 'League', 'Team', and 'Season' were standardized to string data types using the `astype(str)` method.

The merging process began by combining the `league_nationalities` and `attendance` DataFrames using an outer join on the 'League' and 'Team' columns. Additional datasets were then merged based on 'League', 'Team', and 'Season' using the same outer join to preserve all records.

The `pd.merge` function was used for merging, with the 'outer' join method to ensure no data was lost. After each merge, the integrity of the combined DataFrame was verified using methods like `head()` and column checks. This systematic approach successfully integrated the datasets, providing a comprehensive dataset for analysis.

### Step 2:Data Cleaning and Exporting

In this project, a systematic approach was applied to clean the merged dataset, ensuring it was ready for analysis and modeling. The cleaning process involved feature selection, handling missing values, verification, and data export.

To streamline the dataset, unnecessary columns were filtered out, retaining only those relevant to the analysis. This reduced complexity and focused the data on meaningful variables. Missing values in numerical columns, where they represented the absence of an event (e.g., no goals scored), were replaced with zeros to maintain dataset integrity. Additionally, the '% Clean Sheets' column was converted to a float, and missing values were imputed using the median to minimize the impact of outliers.

After handling missing values, the dataset was thoroughly checked to ensure no residual missing values remained. Finally, the cleaned data was exported to a CSV file for easy access and further analysis. This process ensured the dataset was accurate, consistent, and ready for modeling.

### Step 3: EDA

In this step we explored the Data and saw some Correlations.

Some insights from EDA are as follows:

Playing time strongly correlates with goal-scoring (0.81), while goals and assists show a moderate link (0.43), indicating some overlap in scoring and assisting. Weaker correlations between assists and appearances (0.27) and appearances and minutes (0.35) suggest that more games do not necessarily lead to higher statistics due to factors like substitutions.
Forwards are the primary goal scorers, with midfielders and defenders scoring fewer goals. Goalkeepers rarely score, except for a few outliers. Teams tend to score more goals at home than away, highlighting the "home advantage," likely due to factors such as crowd support and less travel stress. Additionally, there is a weak positive correlation between goals and average temperature, indicating minimal impact of weather on goal-scoring.

### Step 4: Feature engineering

**1.Feature Engineering for Weather Summary**

The weather summary feature engineering process involves cleaning, validating, and enriching the dataset with relevant weather data to improve its analytical value. Initially, the `meteostat` package is used to fetch historical weather data based on specific geographical coordinates and dates. The dataset, loaded from the 'Cleaned_data.csv' file, is validated to ensure it contains the necessary columns—'latitude', 'longitude', and 'game_date'—for fetching weather data.

A custom function, `get_weather_data`, is defined to retrieve daily weather statistics such as average temperature, minimum and maximum temperature, precipitation, and wind speed for each row in the dataset. This function is applied to each row using the `apply` method, expanding the results into new columns within the DataFrame.

To clean the weather data, irrelevant columns, like 'snow', are dropped, and rows with placeholder values, such as 'Unknown' in the 'weather_summary' column, are removed. This ensures that the data reflects actual weather conditions. After cleaning, the dataset is verified by checking its shape and analyzing the frequency of weather conditions to understand the distribution of weather types.

Finally, the cleaned data is exported to a CSV file ('data_with_weather_summary_cleaned.csv') for future analysis. This process ensures that the weather data is accurate, relevant, and ready for further use in analyses, such as evaluating the impact of weather on sports events or other outdoor activities.


**2. Feature Engineering for Position-Specific Form Score Feature Engineering**

**Position-Specific Feature Selection**
To accurately evaluate player performance, we identified key metrics that vary by position. Each position on the field has distinct responsibilities and contributions to the game. Therefore, a tailored feature set was developed for each position to capture the most relevant performance indicators:
- **Midfielders**:
  - **Primary Metrics**: Goals, Assists, Minutes Played
  - **Contextual Metrics**: Open Play, Free Kicks, Penalties
  - **Negative Indicators**: Yellow Cards, Red Cards
- **Attackers**:
  - **Primary Metrics**: Goals, Assists, Goal Efficiency (Minutes per Goal)
  - **Contextual Metrics**: Open Play, Crosses, First/Last Scorer
  - **Negative Indicators**: Yellow Cards, Red Cards
- **Defenders**:
  - **Primary Metrics**: Clean Sheets (CS), Goals Against
  - **Defensive Metrics**: Goals Against per Match, Tackles, Clearances
  - **Negative Indicators**: Yellow Cards, Red Cards
- **Goalkeepers**:
  - **Primary Metrics**: Clean Sheets (CS)
  - **Defensive Metrics**: Goals Against, Goals Against per Match
  - **Negative Indicators**: Yellow Cards, Red Cards

This position-specific feature selection ensures that each player's performance is assessed using metrics that are most relevant to their role on the field.

**Feature Normalization**
To account for differences in the scale of various performance metrics (e.g., goals scored vs. minutes played), we applied `StandardScaler` to standardize the features. This normalization removes inherent biases in raw statistics, making it easier to compare players across different positions and teams. The resulting features are centered around zero with a standard deviation of one.

**Weighted Performance Calculation**
**Position-Specific Weighting**
Each position was assigned a specific weight for the selected metrics. These weights reflect the relative importance of each metric for that position. For example, goals are weighted more heavily for attackers, while clean sheets are given more importance for goalkeepers and defenders.
- **Midfielder Weights**:
  - Goals: 25%
  - Assists: 25%
  - Minutes Played: 20%
  - Open Play: 10%
  - Yellow Cards (Negative Weight): -10%

- **Attacker Weights**:
  - Goals: 35%
  - Assists: 25%
  - Goal Efficiency (Minutes per Goal): 20%
  - Open Play: 10%
  - First Scorer Bonus: 10%

- **Defender Weights**:
  - Clean Sheets (CS): 30%
  - Goals Against (Negative): -20%
  - Goals Against per Match (Negative): -20%
  - Minutes Played: 20%
  - Yellow Cards (Negative): -10%

- **Goalkeeper Weights**:
  - Clean Sheets (CS): 40%
  - Goals Against (Negative): -30%
  - Minutes Played: 30%
These weights were derived from domain expertise and historical performance data, ensuring that the scoring system reflects the relative contributions of each position to team success.

**Weighted Score Calculation**
For each player, the weighted score was calculated by multiplying the normalized feature values by their respective weights. This process ensures that each metric is appropriately accounted for based on the player’s position. The final score for each player is the sum of the weighted contributions across all relevant metrics.

**Performance Score Normalization**
To ensure that the final form score is interpretable and consistent, we used the `MinMaxScaler` to transform the weighted performance score into a range between 0.5 and 1. The minimum score (0.5) represents the baseline performance, while the maximum score (1) reflects the highest possible performance. This scaling ensures that the form score is meaningful and comparable across players and seasons.

**Overall Process Flow**
The methodology for calculating the position-specific form score follows these steps:
1. **Position-Specific Feature Selection**: Identify relevant features for each position.
2. **Missing Value Handling**: Impute missing values using the mean strategy.
3. **Feature Normalization**: Standardize features using `StandardScaler`.
4. **Weighted Performance Calculation**: Apply position-specific weights to the normalized features.
5. **Score Normalization**: Normalize the final score to a [0.5, 1] range using `MinMaxScaler`.

This approach results in a position-specific form score that reflects a player’s overall performance in a way that is tailored to their role on the field.

**3.Rolling Feature Creation**

In our analysis, capturing temporal performance dynamics is essential to understanding how a player's form evolves over time. To achieve this, we implemented a feature engineering strategy that generates rolling statistical measures for the player's form score. These rolling features help to capture the player's performance trends and consistency over recent matches.

1. Rolling Mean
The **rolling mean** is calculated over sliding windows of 3 matches. This feature provides a smoothed version of the player's form score by averaging the scores within each window. By doing so, it helps the model capture short-term performance trends, making it easier to identify patterns in the player's form over recent matches.

2. Rolling Variance
The **rolling variance** measures the consistency of the player's performance across the same sliding windows\. It provides insight into how much the player's form score fluctuates within each window. A high variance indicates inconsistent performance, while a low variance suggests stable performance. It helps to measure the consistency of the player's performance over recent matches and identify fluctuations in form.

The rolling features contribute significantly to the model by:
- **Capturing short-term performance trends**: The rolling mean helps the model to track the player's recent performance and identify whether they are improving or declining.
- **Providing insights into player consistency**: The rolling variance reveals how consistent the player's form is, which is crucial for evaluating reliability.
- **Allowing the model to learn from recent performance patterns**: These features enable the model to focus on recent data, making it more responsive to changes in player form and improving its predictive accuracy.

By incorporating rolling mean and variance features, we enhance the model's ability to track and understand the temporal dynamics of a player's form. This allows for more accurate predictions based on recent performance trends and consistency.


### Step5: Preprocessing and Modelling


**Part1:** To predict the form score for the upcoming season using two methods—a traditional machine learning (ML) model and a time series approach. Here the feature engineered **form score is our target variable**

Preprocessing:
The preprocessing began with the categorization of features into numerical and categorical groups to ensure appropriate preprocessing. Irrelevant columns such as *game_date*, *latitude*, and *longitude* were identified and removed using the `drop()` method. For high-cardinality features, the *Player* column was target-encoded by calculating the mean *Form Score* for each player, after which the original *Player* column was dropped. Separate preprocessing pipelines were then created for numerical and categorical features: numerical features were standardized using `StandardScaler` to ensure consistent scaling, while categorical features were one-hot encoded using `OneHotEncoder` to transform categories into binary variables. These pipelines were combined using a `ColumnTransformer` to streamline the preprocessing process during model training and evaluation.

The dataset was split into training (80%) and testing (20%) subsets to assess model performance. Additionally, weather-related features (*tavg*, *tmin*, *tmax*, etc.) and rolling performance metrics (e.g., *Form_Score_Rolling_Mean_3*) were retained as they were deemed relevant for predictive modeling. Since the target variable is continuous, we did not use SMOTE for handling class imbalance. Following preprocessing, the total number of features increased due to the application of one-hot encoding. To maintain interpretability, the feature names were retrieved after transformation.

Modelling:
We used RandomForestRegressor, Linear Regression, and XGBRegressor to predict the Form Score for the upcoming season, as it is a continuous target variable. These models are specifically designed for regression tasks, optimizing for metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). Since SMOTE is used to address class imbalance in classification problems, it was not applicable here. Instead, the focus was on building robust regression models capable of capturing both linear and nonlinear relationships to ensure accurate predictions.Random forests and gradient boosting algorithms (like XGBoost) are relatively robust to class imbalance because they use splitting criteria (e.g., Gini impurity, entropy) that consider all classes. 


### Time Series Models
This section explores the application of time series models to analyze seasonal trends and predict player form scores, with the goal of creating interpretable models and visualizations for stakeholders. These models aim to provide insights into historical performance and forecast form score intervals, enabling stakeholders to assess a player’s potential performance in upcoming seasons.

**Data Preprocessing and Feature Engineering**
To address the research question of whether weather conditions influence player form over time, we incorporated weather data as a contextual feature. Additionally, we implemented position-specific feature mapping to construct form scores tailored to each player's role. For example, metrics such as goals scored, assists, and penalties were emphasized for attackers, while metrics like goals conceded and clean sheets were prioritized for goalkeepers.

A simplified dataset was created for the time series analysis, comprising the computed form scores and contextual features such as weather and player position. Since the data was organized by season (e.g., "2018/19"), a new column representing the season's starting year in datetime format was introduced to facilitate time series modeling. To enhance model performance, rolling statistics such as mean and variance were computed and included as additional features.

**Time Series Modeling**
Given the seasonal nature of the data, models such as SARIMA and Facebook Prophet were employed, as they are well-suited for seasonal time series data. Initial exploration revealed that the dataset lacked precise date-specific timestamps, making these models particularly appropriate.

*SARIMA (Seasonal Autoregressive Integrated Moving Average):*
SARIMA extends ARIMA by incorporating seasonal components, accounting for periodic fluctuations in the data. The model's parameters include:

(p, d, q): Representing the order of autoregression (AR), differencing (I), and moving average (MA), respectively.
(P, D, Q, m): Representing the seasonal counterparts of AR, I, and MA, with m denoting the number of periods in a seasonal cycle.

The time series was first tested for stationarity using the Augmented Dickey-Fuller (ADF) test, which confirmed stationarity without the need for differencing. Autocorrelation (ACF) and partial autocorrelation (PACF) plots were then analyzed to determine the optimal values for p and q. The SARIMA model was fitted using the identified parameters, and predictions were generated to forecast form scores for upcoming seasons. Forecast accuracy was evaluated by plotting predicted values against historical data. We futher included some more contextual features in SARIMA for better and more dynamic results.

*Facebook Prophet*:
Prophet is a robust model designed for time series forecasting, particularly effective for data with strong seasonal and trend components. The model requires two key columns:

ds (date): Representing the time variable.
y (value): Representing the target variable (form score).
After structuring the dataset to meet Prophet’s requirements, the model was trained and used to generate forecasts for future seasons. The model also provided confidence intervals for predictions, offering stakeholders a range of potential outcomes and associated uncertainties.

## RESULTS 

First part  of this project was to predict the form score for the upcoming season based on historical form scores from previous years. To achieve this, three machine learning models—Linear Regression, Random Forest Regressor, and XGBoost Regressor—were implemented and evaluated.  The performance of each model was compared using metrics such as Root Mean Squared Error (RMSE) and R² Score on both cross-validation and test sets. Both RandomForestRegressor and XGBRegressor are regression models, which means they predict continuous numerical values. Metrics like F1 Score and Confusion Matrix are used for classification tasks, not regression.

 1. **Linear Regression**:  
The model achieved an almost perfect **R² score of 0.9999** with an extremely low **Mean Squared Error (2.8926e-13)**.  
While this result suggests an excellent fit, such near-perfect performance often indicates overfitting or that the data exhibits highly linear characteristics. This raises concerns about the model's ability to generalize to unseen, real-world data.  

2. **Random Forest Regressor**:  
   - **Cross-validation results**:  
      RMSE: **0.0071 ± 0.0008**  
      R²: **0.9239 ± 0.0176**  
   - **Training set**:  
     RMSE: **0.0058**, R²: **0.9491**  
   - **Test set**:  
     RMSE: **0.0085**, R²: **0.8983**  
The Random Forest model demonstrated strong performance across both the training and test sets, with consistent cross-validation results. The relatively small difference between training and test R² scores highlights its ability to generalize well to unseen data.  

3. **XGBoost Regressor**:  
   - **Cross-validation results**:  
      RMSE: **0.0017 ± 0.0009**  
      R²: **0.9945 ± 0.0065**  
   - **Training set**:  
      RMSE: **0.0003**, R²: **0.9999**  
   - **Test set**:  
     RMSE: **0.0034**, R²: **0.9834**  
XGBoost achieved very high accuracy on both the training and test sets, with an **R² score of 0.9834** on the test set. However, the extremely low error and near-perfect fit on the training set (**R² = 0.9999**) suggest that the model may be overfitting, which could reduce its robustness when applied to new data.

While **Linear Regression** and **XGBoost Regressor** produced near-perfect results, their extremely low errors and exceptionally high R² scores indicate a strong likelihood of overfitting, limiting their reliability for real-world predictions. In contrast, the **Random Forest Regressor** achieved a strong balance between accuracy and generalization, with a **cross-validated R² score of 0.9239 ± 0.0176** and a **test R² score of 0.8983**. The model demonstrated consistent performance across training, validation, and test sets, making it the most robust and reliable choice for predicting player form scores for the upcoming season.

**Sensitivity to weather summary**
The sensitivity test was performed by varying the weather_summary feature (e.g., "Hot," "Cold," "Windy," "Snowy") and observing the impact on the predicted Form Score. Here are the results:

Hot: 0.5586
Cold: 0.5585
Windy: 0.5585
Snowy: 0.5585
The predicted Form Score remains nearly identical across different weather conditions.
Differences in predictions are only within the 4th or 5th decimal place, indicating that weather has a negligible influence on the player's form as per the current model and data.

4. **SARIMA**


- ![Screenshot 2024-12-17 at 10.04.46 PM.png](attachment:fe9c65d0-c069-416e-b164-dff109ab3c42.png)
- The results presented are from the SARIMA model applied to the player Erling Haaland, with the accompanying graph displaying the forecast plots. These plots illustrate the predicted form scores along with their confidence intervals for upcoming seasons. Such visualizations enable stakeholders to assess the potential performance trajectory of the player, facilitating data-driven decision-making regarding his expected form and overall contribution.
- 
- ![WhatsApp Image 2024-12-17 at 21.37.00.jpeg](attachment:90e4483a-c68c-4249-af90-eb1b0337366a.jpeg)

5. **SARIMA with additional contextual features**

- ![Screenshot 2024-12-17 at 10.01.19 PM.png](attachment:20e1b817-bb7e-499d-a3b7-ce38177fdefc.png)
- The SARIMA model was enhanced by incorporating additional contextual features such as home vs. away status, minutes played, and open play contributions. This integration resulted in more refined confidence intervals and a more accurate prediction of the form score. When comparing the predicted form scores to the players' current real-time performance, the results aligned closely with actual observations, demonstrating that adding context-specific features significantly improves forecasting precision.

- The figures illustrate the improved forecasting accuracy for players Erling Haaland and Lionel Messi. The plots clearly depict the predicted form scores for upcoming seasons, providing a robust visualization that strengthens the validity of our results.
- ![image.png](attachment:c2068a2d-18f4-45c1-966f-0aed9da0d7e8.png)
- ![WhatsApp Image 2024-12-17 at 21.43.25.jpeg](attachment:ef780f91-b2a5-47dc-967a-83a91eeb0bf3.jpeg)


6. **Prophet**

![Screenshot 2024-12-17 at 10.11.01 PM.png](attachment:cfa821c6-e3b6-4f49-ba24-8044cd3741d4.png)
- The historical form scores are shown from 1970 to 1974, with the "yhat_lower" and "yhat_upper" columns indicating the lower and upper bounds of the form scores.
The "Predicted Form Scores for the next 5 years" section shows the forecasted form scores for the next 5 years, with the values remaining constant at 0.68, 0.68, 0.67, 0.67, and 0.67.
![WhatsApp Image 2024-12-17 at 21.53.35.jpeg](attachment:23e722b5-8403-45fa-83bb-347641f9d60e.jpeg)
- The data points are scattered, showing the player's form score fluctuating over time from around 0.65 to 0.80.
- The graph provides a detailed view of the player's past form scores, complementing the forecasted information shown in the previous table.
Overall, the information provided in these two images suggests that the forecasted form scores for the next 5 years are expected to remain relatively stable, while the historical data indicates the player's performance has been variable over time.

## DICUSSION

The primary goal of this project was to predict player form scores for the upcoming season using historical performance data and contextual features, thereby addressing stakeholder needs for data-driven decision-making. While the project successfully implemented and evaluated three machine learning models—Linear Regression, Random Forest Regressor, and XGBoost Regressor—each with its strengths and limitations, the results reflect varying degrees of success in achieving the stated objectives.

The models demonstrated strong predictive capabilities, with the Random Forest Regressor emerging as the most robust and reliable. It achieved consistent performance across training, validation, and test sets, with an R² score of 0.8983 on the test set and 0.9239 ± 0.0176 during cross-validation. This suggests that the model generalizes well to unseen data and aligns closely with the stakeholders' needs for reliable and interpretable predictions.

However, the Linear Regression and XGBoost Regressor models, despite their near-perfect performance metrics (e.g., R² scores of 0.9999), likely suffer from overfitting. While these results might appear ideal on the surface, they raise concerns about the models’ ability to generalize to real-world scenarios. This undermines their reliability for practical application, as stakeholders require models that can perform well on unseen data, not just on the training set.

Stakeholders, such as team managers and analysts, require actionable insights into a player’s potential performance for strategic decision-making. This includes identifying trends in form scores and understanding how external factors, such as weather, impact player performance. The Random Forest Regressor provides a reasonable balance of accuracy and generalization, making it the most suitable model for addressing these needs. Its ability to consistently predict form scores across datasets enhances its utility in real-world applications.

On the other hand, the sensitivity analysis revealed that the weather_summary feature had negligible impact on the predicted form scores, with differences only appearing in the fourth or fifth decimal place. This result suggests that weather conditions, as modeled in the current dataset, do not significantly influence player performance. While this aligns with the findings, it may not fully satisfy stakeholders who expected more nuanced insights into the interplay between weather and form. Future work could involve incorporating more granular weather data, such as temperature, humidity, or wind speed, to better capture potential effects.


The **time series** models developed in this project provide valuable insights and serve as a strong supplementary tool to our primary machine learning models, which should be the preferred choice due to their greater reliability. While the time series models occasionally produce vague results and do not account for major career changes, such as player retirements or transfers to leagues outside the Euro League, their predictions are often closely aligned with actual performance data. This makes them a useful resource for stakeholders, offering realistic projections of player performance based on contextual factors.

These models can aid stakeholders in making informed decisions, such as scouting and sorting players based on their current form. Although not suitable for sole reliance in critical decision-making, they serve as an excellent starting point for evaluating players and gaining actionable insights. To address stakeholder needs more comprehensively, future work could focus on incorporating career shift variables and improving prediction granularity. These enhancements would increase the models' practical applicability and ensure they align even more closely with the stakeholders' requirements



## LIMITATIONS

1. One notable limitation of our models is the minimal impact observed from including weather as a contextual factor. Although we initially hypothesized that weather conditions might significantly influence player performance, the results did not demonstrate a substantial effect.

2. Additionally, the time series models, while useful for supplementary analysis, have their own constraints. They can produce vague predictions at times and fail to account for major career changes, such as player injuries or transfers to leagues outside the Euro League. These omissions can reduce the accuracy and relevance of predictions in certain scenarios.

## FUTUREWORK

To enhance the accuracy and reliability of predictions, future work could focus on incorporating additional contextual features, such as player injuries, team formations, player synergy with clubs they have played for, and interactions with other players. These factors could provide deeper insights into player performance and make the models more robust.

Furthermore, developing an interactive dashboard or frontend as a visual medium would greatly benefit stakeholders. Such a tool would allow stakeholders to explore predictions and insights in an intuitive way, facilitating more effective decision-making.