# **Premier League  Match Estimator**

Welcome to my Premier League Match Estimator Project! 🏆

This project predicts match outcomes in the Premier League by combining:
- Player SofaScore ratings
- Team's recent form
- Machine learning models trained on historical match data

Key features:
- Interactive input for teams and players
- Integration of player statistics and team performance
- Weighted prediction combining model results with SofaScore and form data

# Let's dive into the details and predictions! ⚽

## 1. Importing Required Libraries

Below are the libraries used in this project, along with their purposes:

- **NumPy**: Efficient numerical operations and arrays.
- **Pandas**: Data manipulation and handling.
- **LabelEncoder (from scikit-learn)**: Encoding categorical variables into numerical format.
- **TensorFlow**: Framework for machine learning and neural network models.
- **TensorFlow Decision Forests (TFDF)**: Specialized for decision tree-based models.
- **Matplotlib**: Visualization library for plotting graphs and figures.

The above imports set the foundation for the project, and specific functionalities will be utilized in later cells.

In [2]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Optional: Verify library versions
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")


NumPy version: 1.26.4
Pandas version: 2.2.2


## 2. Loading Dataset

In this section, we load the dataset containing match data directly from a Google Sheets link in CSV format. This approach allows for seamless integration of external data sources, enabling dynamic updates when the data is modified in the sheet.

### Steps:
1. **Google Sheets CSV Export**: The Google Sheets file is exported as a CSV using the `export?format=csv` endpoint in the URL.
2. **Load Data**: The `pandas` library is used to read the CSV file and load it into a DataFrame.
3. **Display Dataset**:
   - The first 5 rows are displayed to understand the data structure.
   - Column data types are printed to ensure proper formatting for further processing.

In [3]:
# Google Sheets CSV URL
file_url = "premier_league_data_all.xlsx" #This should be your file path

# Load the data as a CSV file
data = pd.read_csv(file_url)

# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(data.head())

# Display the data types of the columns
print("\nColumn data types:")
print(data.dtypes)

First 5 rows of the dataset:
         Date             Team Venue Result   GF   GA         Opponent   xG  \
0  2022-08-07  Manchester City  Away      W  2.0  0.0         West Ham  2,2   
1  2022-08-13  Manchester City  Home      W  4.0  0.0      Bournemouth  1,7   
2  2022-08-21  Manchester City  Away      D  3.0  3.0    Newcastle Utd  2,1   
3  2022-08-27  Manchester City  Home      W  4.0  2.0   Crystal Palace  2,2   
4  2022-08-31  Manchester City  Home      W  6.0  0.0  Nott'ham Forest  3,3   

   xGA  Poss  Attendance Formation Opp Formation  
0  0,5  75.0     62443.0     4-3-3       4-2-3-1  
1  0,1  67.0     53453.0   4-2-3-1         3-4-3  
2  1,8  69.0     52258.0     4-3-3         4-3-3  
3  0,1  74.0     53112.0   4-2-3-1         5-4-1  
4  0,7  74.0     53409.0   4-2-3-1         5-3-2  

Column data types:
Date              object
Team              object
Venue             object
Result            object
GF               float64
GA               float64
Opponent          ob

## 3. Data Filtering: Home Matches

In this step, we filter the dataset to retain only the rows where the match venue is marked as "Home". This filtering simplifies the analysis and model training by focusing solely on matches played by the home team.

### Steps:
1. **Filter Home Matches**: Rows where the "Venue" column is "Home" are kept.
2. **Drop Venue Column**: Since the venue information is now redundant (all rows are home matches), the column is removed for cleaner data.
3. **Verify the Changes**:
   - Display the first few rows of the filtered dataset.
   - Ensure that only home matches are retained in the dataset.

In [4]:
# Filter data to keep only rows where Venue is "Home"
data = data[data["Venue"] == "Home"]

# Drop the "Venue" column as it's no longer needed
data = data.drop(columns=["Venue"])

# Display the first few rows of the cleaned dataset
print("Cleaned dataset (only 'Home' matches retained):")
print(data.head())


Cleaned dataset (only 'Home' matches retained):
         Date             Team Result   GF   GA         Opponent   xG  xGA  \
1  2022-08-13  Manchester City      W  4.0  0.0      Bournemouth  1,7  0,1   
3  2022-08-27  Manchester City      W  4.0  2.0   Crystal Palace  2,2  0,1   
4  2022-08-31  Manchester City      W  6.0  0.0  Nott'ham Forest  3,3  0,7   
7  2022-10-02  Manchester City      W  6.0  3.0   Manchester Utd  3,2  1,7   
8  2022-10-08  Manchester City      W  4.0  0.0      Southampton  2,4  0,2   

   Poss  Attendance Formation Opp Formation  
1  67.0     53453.0   4-2-3-1         3-4-3  
3  74.0     53112.0   4-2-3-1         5-4-1  
4  74.0     53409.0   4-2-3-1         5-3-2  
7  53.0     53475.0     4-3-3       4-2-3-1  
8  65.0     53365.0   4-2-3-1         4-4-2  


## 4. Data Preprocessing: Date Formatting and Missing Values

In this step, we preprocess the dataset to ensure that the data is well-structured and ready for analysis.

### Steps:
1. **Date Conversion**:
   - Convert the `Date` column into a datetime format to enable proper sorting and temporal analysis.

2. **Sorting**:
   - Sort the dataset by date to ensure chronological order, which is essential for analyzing trends and patterns over time.

3. **Dataset Information**:
   - Display basic information such as column names, data types, and non-null counts to understand the structure of the dataset.

4. **Missing Value Check**:
   - Identify any missing values in the dataset to decide if additional cleaning or imputation is required.

By organizing the data chronologically and verifying its structure, we ensure that subsequent steps in the pipeline work seamlessly.

In [5]:
# Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'])

# Sort data by date
data = data.sort_values(by='Date')

# Display basic information about the dataset
print("\nBasic information about the dataset:")
print(data.info())

# Check for missing values
print("\nMissing values in the dataset:")
print(data.isnull().sum())


Basic information about the dataset:
<class 'pandas.core.frame.DataFrame'>
Index: 920 entries, 1102 to 1711
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           920 non-null    datetime64[ns]
 1   Team           920 non-null    object        
 2   Result         919 non-null    object        
 3   GF             919 non-null    float64       
 4   GA             919 non-null    float64       
 5   Opponent       920 non-null    object        
 6   xG             919 non-null    object        
 7   xGA            919 non-null    object        
 8   Poss           919 non-null    float64       
 9   Attendance     914 non-null    float64       
 10  Formation      919 non-null    object        
 11  Opp Formation  919 non-null    object        
dtypes: datetime64[ns](1), float64(4), object(7)
memory usage: 93.4+ KB
None

Missing values in the dataset:
Date             0
Team            

## 5. Handling Missing Values

In this step, we ensure that the dataset contains no missing values by performing the following:

### Steps:
1. **Dropping Missing Values**:
   - Rows with any missing values are removed from the dataset. This is done to maintain data integrity and avoid issues during modeling.

2. **Verification**:
   - The total number of rows after dropping missing values is displayed.
   - A final check is performed to confirm that no missing values remain.

By cleaning the dataset of missing values, we prepare it for further analysis and modeling.

In [6]:
# Drop rows with missing values
data_cleaned = data.dropna()

# Verify the data
print("\nRemaining rows after dropping missing values:", len(data_cleaned))
print(data_cleaned.isnull().sum())



Remaining rows after dropping missing values: 914
Date             0
Team             0
Result           0
GF               0
GA               0
Opponent         0
xG               0
xGA              0
Poss             0
Attendance       0
Formation        0
Opp Formation    0
dtype: int64


## 6. Dropping Unnecessary Columns

In this step, we remove the following columns from the dataset:

- **`GF` (Goals For)**: Represents the number of goals scored by the team.
- **`GA` (Goals Against)**: Represents the number of goals conceded by the team.

### Reason:
These columns are dropped because:
- They may not be directly relevant for the current modeling process.
- The model will rely on aggregated or alternative features to evaluate team performance.

### Verification:
After dropping the specified columns, the dataset is displayed to ensure the removal was successful.

In [7]:
# Drop the specified columns
columns_to_drop = ['GF', 'GA']
data = data.drop(columns=columns_to_drop, errors='ignore')

# Verify the remaining columns
print("Remaining columns in the dataset:")
print(data.head())

Remaining columns in the dataset:
           Date            Team Result         Opponent   xG  xGA  Poss  \
1102 2022-08-05  Crystal Palace      L          Arsenal  1,2  1,0  56.0   
1178 2022-08-06     Bournemouth      W      Aston Villa  0,6  0,7  35.0   
912  2022-08-06       Tottenham      W      Southampton  1,5  0,5  58.0   
988  2022-08-06       Newcastle      W  Nott'ham Forest  1,7  0,3  61.0   
1444 2022-08-06           Leeds      W           Wolves  0,8  1,3  40.0   

      Attendance Formation Opp Formation  
1102     25286.0   4-2-3-1         4-3-3  
1178     11013.0     3-4-3         4-3-3  
912      61732.0     3-4-3         5-3-2  
988      52245.0     4-3-3       3-4-1-2  
1444     36347.0   4-2-3-1       4-2-3-1  


## 7. Calculating Team Statistics

### Data Cleaning:
- The `xG` (Expected Goals) and `xGA` (Expected Goals Against) columns are converted to numeric format:
  - Commas are replaced with periods (`.`) for proper decimal representation.
  - The columns are then cast to the `float` data type to enable mathematical operations.

### Aggregating Team Statistics:
For each team, the following statistics are computed:
- **`xG`**: Average expected goals across all matches.
- **`xGA`**: Average expected goals against across all matches.
- **`Possession`**: Average ball possession percentage.
- **`Attendance`**: Average attendance for the team's home matches.
- **`Formation`**: The most frequently used formation (calculated using the mode).

These aggregated statistics provide an overview of each team's performance and will be utilized in later stages of the analysis or modeling process.

In [8]:
# Ensure xG and xGA columns are numeric by replacing commas and converting
data["xG"] = data["xG"].str.replace(",", ".").astype(float)
data["xGA"] = data["xGA"].str.replace(",", ".").astype(float)

# Compute average xG, xGA, and Possession for each team
team_stats = data.groupby("Team").agg({
    "xG": "mean",
    "xGA": "mean",
    "Poss": "mean",
    "Attendance": "mean",
    "Formation": lambda x: x.mode()[0]  # Most frequent formation
}).reset_index()

## 8. Encoding Results and Teams

### Encoding Match Results:
- The `Result` column is encoded as:
  - **Win (W):** `1`
  - **Draw (D):** `0`
  - **Lose (L):** `-1`
  - This allows for numerical representation of match outcomes, suitable for modeling.

### Encoding Teams:
- Each team is represented as a one-hot encoded column:
  - **Home Team:** `1`
  - **Away Team:** `-1`
  - **Neither:** `0`
  - This encoding scheme captures the role of each team in a match.

### Dropping Redundant Columns:
- Original columns (`Team`, `Opponent`, and `Result`) are dropped after encoding as they are no longer needed.

This step prepares the dataset for machine learning by converting categorical variables into numerical representations.

In [9]:
from sklearn.preprocessing import LabelEncoder


# Encode 'Result' as Win: 2, Draw: 1, Lose: 0
data['Result_Encoded'] = data['Result'].map({'W': 1, 'D': 0, 'L': -1})

# Create encoded columns for each team
unique_teams = data['Team'].unique()
for team in unique_teams:
    data[team] = data.apply(
        lambda row: 1 if row['Team'] == team else (-1 if row['Opponent'] == team else 0),
        axis=1
    )

# Drop original 'Team', 'Opponent' and 'Result' columns
data = data.drop(columns=['Team', 'Opponent', 'Result'], errors='ignore')

# Verify the changes
print("\nSample of the encoded dataset:")
print(data.head())


Sample of the encoded dataset:
           Date   xG  xGA  Poss  Attendance Formation Opp Formation  \
1102 2022-08-05  1.2  1.0  56.0     25286.0   4-2-3-1         4-3-3   
1178 2022-08-06  0.6  0.7  35.0     11013.0     3-4-3         4-3-3   
912  2022-08-06  1.5  0.5  58.0     61732.0     3-4-3         5-3-2   
988  2022-08-06  1.7  0.3  61.0     52245.0     4-3-3       3-4-1-2   
1444 2022-08-06  0.8  1.3  40.0     36347.0   4-2-3-1       4-2-3-1   

      Result_Encoded  Crystal Palace  Bournemouth  ...  Manchester City  \
1102            -1.0               1            0  ...                0   
1178             1.0               0            1  ...                0   
912              1.0               0            0  ...                0   
988              1.0               0            0  ...                0   
1444             1.0               0            0  ...                0   

      Arsenal  Southampton  Chelsea  Nottingham  Liverpool  Burnley  \
1102       -1      

## 9. Splitting the Dataset into Training and Testing Sets

### Objective:
- Split the dataset into **training** and **testing** sets while preserving its time-series nature.

### Steps:
1. **Sort by Date:**
   - Ensure the dataset is ordered chronologically to maintain the time-series structure.
   
2. **Define Split Ratio:**
   - Split the data into training (80%) and testing (20%) sets based on a defined ratio.

3. **Separate Features and Target:**
   - Features (`X`): All columns except the target variable and `Date`.
   - Target (`y`): The encoded `Result_Encoded` column.

4. **Output:**
   - Training set size and testing set size are displayed to verify the split.
   - These datasets will be used for training and evaluating the machine learning model.

### Notes:
- `Date` column is excluded from the features for now as it is not a direct predictor.
- Adjust the split ratio as needed for your model's requirements.

In [10]:
# Sort data by date to preserve the time-series nature
data = data.sort_values(by="Date")

# Define the split ratio (e.g., 80% train, 20% test)
split_ratio = 0.8
split_index = int(len(data) * split_ratio)

# Split the data
train_data = data.iloc[:split_index]
test_data = data.iloc[split_index:]

# Separate features (X) and target (y)
X_train = train_data.drop(columns=["Date"])  # Drop Date for now
y_train = train_data["Result_Encoded"]  # Replace with actual target if needed

X_test = test_data.drop(columns=["Date"])
y_test = test_data["Result_Encoded"]  # Replace with actual target if needed

# Check the splits
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")


Training set size: 736
Test set size: 184


## 10. Installing and Importing TensorFlow Decision Forests

### Objective:
- Install and set up **TensorFlow Decision Forests (TFDF)** for building and training the machine learning model.

### Steps:
1. **Install TFDF:**
   - Use `!pip install tensorflow_decision_forests` to ensure the library is available in the Colab environment.

2. **Import Libraries:**
   - Import `tensorflow_decision_forests` for working with decision forest models.
   - Import `tensorflow` for general machine learning utilities and compatibility.

### Notes:
- Make sure the installation step runs successfully before importing the libraries.
- TensorFlow Decision Forests is specifically designed for decision tree-based models and is a powerful tool for structured data tasks.

In [11]:
# Install TensorFlow Decision Forests
!pip install tensorflow_decision_forests

# Import required libraries
import tensorflow_decision_forests as tfdf
import tensorflow as tf

Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-1.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Collecting tensorflow==2.18.0 (from tensorflow_decision_forests)
  Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting wurlitzer (from tensorflow_decision_forests)
  Downloading wurlitzer-3.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting ydf (from tensorflow_decision_forests)
  Downloading ydf-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow==2.18.0->tensorflow_decision_forests)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
INFO: pip is looking at multiple versions of tf-keras to determine which version is compatible with other requirements. This could take a while.
Collecting tf-keras~=2.17 (from tensorflow_decision_forests)
  Downloading tf_keras-2.

## 11. Preparing Data for TensorFlow Models

### Objective:
- Combine the features and target labels into a single DataFrame suitable for TensorFlow.
- Ensure numeric columns are properly formatted for training and testing.

### Steps:
1. **Combine Features and Target:**
   - Create `train_data_combined` and `test_data_combined` by merging the features and target labels (`Result_Encoded`).
   - Convert the target label (`Result_Encoded`) to a string format for TensorFlow compatibility.

2. **Ensure Proper Numeric Formatting:**
   - Replace commas with dots in `xG` and `xGA` columns to ensure they are numeric.
   - Verify the updated format by displaying a preview of the relevant columns.

3. **Check Updated Columns:**
   - Print the first few rows of the `xG` and `xGA` columns in the test dataset to confirm the changes.

### Notes:
- This step is essential for ensuring that the data is compatible with TensorFlow Decision Forests.
- Any formatting issues in numeric columns (e.g., `xG` and `xGA`) must be resolved before training.

In [12]:
# Combine features and target into a single DataFrame for TensorFlow
train_data_combined = X_train.copy()
train_data_combined["Result_Encoded"] = y_train.astype(str)  # Update target to Result_Encoded

# Combine test features and target into a single DataFrame for TensorFlow
test_data_combined = X_test.copy()
test_data_combined["Result_Encoded"] = y_test.astype(str)


# Replace commas with dots and convert to float
train_data_combined['xG'] = train_data_combined['xG']
train_data_combined['xGA'] = train_data_combined['xGA']

test_data_combined['xG'] = test_data_combined['xG']
test_data_combined['xGA'] = test_data_combined['xGA']

# Check the updated columns
print("\nUpdated xG and xGA columns:")
print(test_data_combined[['xG', 'xGA']].head())



Updated xG and xGA columns:
      xG  xGA
415  1.7  0.4
606  2.3  0.6
568  1.6  2.6
301  2.2  1.4
225  2.7  0.7


## 12. Training the Gradient Boosted Trees Model

### Objective:
- Use TensorFlow Decision Forests to train a Gradient Boosted Trees model for classification tasks.

### Steps:
1. **Convert Data to TensorFlow Dataset:**
   - Use `tfdf.keras.pd_dataframe_to_tf_dataset` to convert the training DataFrame (`train_data_combined`) into a format compatible with TensorFlow.
   - Specify `Result_Encoded` as the target label.

2. **Initialize the Model:**
   - Define a `GradientBoostedTreesModel` using TensorFlow Decision Forests.
   - Set the task type to `CLASSIFICATION` to predict categorical outcomes.

3. **Train the Model:**
   - Fit the model on the training dataset using `model.fit(train_ds)`.

4. **Model Summary:**
   - Display a detailed summary of the trained model, including the structure and key metrics.

### Notes:
- **Gradient Boosted Trees** is a robust ensemble learning technique suitable for classification tasks.
- Ensure the training data is clean and properly formatted to avoid errors during training.
- Hyperparameters of the model (e.g., the number of trees, learning rate) can be tuned later for optimal performance.

In [13]:
# Convert the DataFrame to a TensorFlow dataset
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_data_combined, label="Result_Encoded")

# Initialize the Gradient Boosted Trees model with valid hyperparameters
model = tfdf.keras.GradientBoostedTreesModel(task=tfdf.keras.Task.CLASSIFICATION)

# Train the model
model.fit(train_ds)

# Display the model's summary
model.summary()






Use /tmp/tmp26yk2od8 as temporary training directory
Reading training dataset...
Training dataset read in 0:00:07.049360. Found 736 examples.
Training model...
Model trained in 0:00:00.553789
Compiling model...
Model compiled.
Model: "gradient_boosted_trees_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
Total params: 1 (1.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 1 (1.00 Byte)
_________________________________________________________________
Type: "GRADIENT_BOOSTED_TREES"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (30):
	Arsenal
	Aston_Villa
	Attendance
	Bournemouth
	Brentford
	Brighton
	Burnley
	Chelsea
	Crystal_Palace
	Everton
	Formation
	Fulham
	Ipswich
	Leeds
	Leicester
	Liverpool
	Luton
	Manchester_City
	Manchester_United
	Newcastle
	Nottingham
	Opp_Formation
	Poss
	Sheffield
	Southampton
	Tottenham
	West_Ham
	Wolverhampton
	xG
	xGA

No weights

Variable Impo

## 13. Evaluating the Gradient Boosted Trees Model

### Objective:
- Evaluate the trained Gradient Boosted Trees model on the test dataset to assess its performance.

### Steps:
1. **Convert Test Data:**
   - Use `tfdf.keras.pd_dataframe_to_tf_dataset` to convert the test DataFrame (`test_data_combined`) into a TensorFlow dataset format.
   - Specify `Result_Encoded` as the target label for evaluation.

2. **Model Evaluation:**
   - Evaluate the model on the test dataset using `model.evaluate(test_ds)`.
   - Print the evaluation results, which may include metrics like accuracy, log loss, etc., depending on the model's configuration.

### Notes:
- The test dataset should be unseen by the model during training to provide a fair evaluation.
- Use the evaluation results to determine if the model's performance meets expectations or requires further tuning.

### Potential Issue: `Model Evaluation Results: 0.0`
- A result of `0.0` indicates that the model has failed to predict correctly on the test dataset.
- Possible reasons for this issue:
  1. **Data Encoding Mismatch:**
     - The encoding of the labels or features in the test data might not align with the model's expectations.
  2. **Insufficient Training Data:**
     - The training data might not adequately represent the patterns in the test data.
  3. **Overfitting:**
     - The model might have memorized the training data instead of learning generalizable patterns.
  4. **Imbalanced Classes:**
     - If one class dominates the dataset, the model might struggle to predict the minority class.
  5. **Model Configuration:**
     - The hyperparameters of the model (e.g., number of trees, depth) might need tuning.
  6. **Incorrect Label or Feature Setup:**
     - Verify that the `Result_Encoded` label and features in both train and test datasets are correctly set up.


In [14]:
# Convert the test data to a TensorFlow dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data_combined, label="Result_Encoded")

# Evaluate the model
evaluation = model.evaluate(test_ds)
print("\nModel Evaluation Results:")
print(evaluation)







Model Evaluation Results:
0.0


## 14. Making Predictions with the Trained Model

### Objective:
- Use the trained Gradient Boosted Trees model to make predictions on the test dataset.

### Steps:
1. **Generate Predictions:**
   - Use `model.predict(test_ds)` to generate predictions for the test dataset.
   - The predictions will typically be probabilities for each class if the model's task is classification.

2. **Display Predictions:**
   - Print the predictions to inspect the output for each instance in the test dataset.

### Notes:
- The predictions should correspond to the `Result_Encoded` labels (e.g., Win: `2`, Draw: `1`, Lose: `0`).
- For classification tasks, the output includes probabilities for each class. For instance:
  - `[0.7, 0.2, 0.1]` indicates a 70% chance of the first class, 20% for the second, and 10% for the third.
- If the model's evaluation results were poor, the predictions might not align well with actual outcomes.



In [15]:
# Make predictions
predictions = model.predict(test_ds)

# Display the predictions
print("Predictions:")
print(predictions)


Predictions:
[[0.09444992 0.21564417 0.68990594]
 [0.06277465 0.3547328  0.58249253]
 [0.7308416  0.10344719 0.16571124]
 [0.13193764 0.2060469  0.6620155 ]
 [0.06138319 0.10212296 0.83649385]
 [0.47525805 0.24489927 0.2798427 ]
 [0.3801714  0.2548892  0.36493942]
 [0.86156905 0.06572634 0.0727046 ]
 [0.05179171 0.06220222 0.88600606]
 [0.6449924  0.19833334 0.15667413]
 [0.1104141  0.16055864 0.7290273 ]
 [0.42402285 0.10162788 0.47434932]
 [0.19176312 0.40618733 0.40204954]
 [0.41462934 0.21402624 0.3713444 ]
 [0.21052597 0.18342291 0.6060511 ]
 [0.05454466 0.06423309 0.8812222 ]
 [0.23474777 0.27315268 0.49209958]
 [0.26223132 0.14422713 0.59354144]
 [0.57131016 0.2692622  0.15942766]
 [0.04660321 0.5230025  0.43039423]
 [0.05161078 0.05793957 0.89044964]
 [0.74707526 0.11293733 0.13998744]
 [0.81948113 0.11441227 0.06610657]
 [0.84974563 0.05081836 0.09943601]
 [0.06766795 0.09052868 0.8418033 ]
 [0.26785544 0.5516053  0.1805393 ]
 [0.07347508 0.13274817 0.7937767 ]
 [0.3422346  0.

## 15. Inspecting Model Features

### Objective:
- Display the feature names that the Gradient Boosted Trees model used during training.

### Code Explanation:
1. **Retrieve Feature Names:**
   - The `_semantics` attribute of the model stores metadata about the input features.
   - Using `.keys()`, we can extract the names of all features that were used during training.

2. **Purpose:**
   - Ensures that the input dataset for predictions (`test_ds`) matches the feature set used during training.
   - Helps debug issues if predictions fail due to mismatched features.



In [16]:
# Find out the column names that we used to train our model
print("Model features during training:")
print(model._semantics.keys())

Model features during training:
dict_keys(['xG', 'xGA', 'Poss', 'Attendance', 'Formation', 'Opp_Formation', 'Crystal_Palace', 'Bournemouth', 'Tottenham', 'Newcastle', 'Leeds', 'Fulham', 'Everton', 'Manchester_United', 'West_Ham', 'Leicester', 'Aston_Villa', 'Brighton', 'Wolverhampton', 'Brentford', 'Manchester_City', 'Arsenal', 'Southampton', 'Chelsea', 'Nottingham', 'Liverpool', 'Burnley', 'Sheffield', 'Luton', 'Ipswich'])


## 16. Preparing for Predictions

### Objective:
- Align prediction workflow with the column order used during model training.
- Ensure team names are formatted correctly and consistent between training and prediction datasets.

### Code Explanation:
1. **Retrieve Training Columns:**
   - `model._semantics.keys()` provides the column names used during model training. This ensures the prediction dataset matches the training dataset structure.

2. **Team Name Formatting:**
   - Team names in the `Team` column of `team_stats` are adjusted by replacing spaces with underscores to ensure compatibility with feature names used during training.
   - Example: `"Manchester City"` becomes `"Manchester_City"`.

3. **Team Initialization:**
   - A list of all teams (`teams`) is initialized to validate team inputs during predictions.
   - Ensure that this list is comprehensive and matches the team names in your dataset.



In [17]:

# Step 1: Get the column order used during training
trained_columns = model._semantics.keys()

# Replace spaces with underscores in team names in the "Team" column
team_stats["Team"] = team_stats["Team"].str.replace(" ", "_")

# Initialize the encoder with all teams used in training and testing
teams = ["Arsenal", "Liverpool", "Nottingham", "Chelsea", "Manchester City",
         "Newcastle", "Aston Villa", "Bournemouth", "Brentford", "Fulham",
         "Tottenham", "Brighton", "Manchester United", "West Ham",
         "Crystal Palace", "Wolverhampton", "Everton", "Ipswich", "Leicester",
         "Southampton", "Burnley", "Sheffield", "Luton"]  # Replace with your full team list


print("Prediction Workflow\n")


Prediction Workflow



## 17. Prediction Workflow Implementation

### Objective:
- Create a streamlined prediction workflow for user inputs, leveraging trained model semantics and consistent formatting.

### Code Breakdown:
1. **Step 2: Input Home and Away Team**
   - User inputs for `home_team` and `away_team` are formatted to match the structure in `team_stats` (replacing spaces with underscores).
   - Validation ensures both teams exist in the dataset, prompting for re-entry if invalid.

2. **Fetch Average Stats for Teams**
   - Extract team-specific stats (`xG`, `xGA`, `Poss`, etc.) from `team_stats`.
   - Calculate average possession as a combination of home and away team stats.

3. **Step 3: Team Encoding**
   - Create a dictionary where the `home_team` is encoded as `1`, the `away_team` as `-1`, and all other teams as `0`.

4. **Step 4: Prepare Input Data**
   - Assemble the input data dictionary with required features and encoded team columns.
   - Missing columns from the training dataset are initialized with default values (e.g., `0`).

5. **Step 5: Reorder Columns**
   - Ensure the prediction dataset matches the column order used during model training.

6. **Step 6: TensorFlow Dataset Conversion**
   - Convert the input DataFrame into a TensorFlow-compatible dataset for prediction.

7. **Step 7: Predict Using the Trained Model**
   - Use the trained model to generate predictions for the input data.



In [20]:
# Step 2: Input Home and Away Team
while True:
    try:
        home_team = input("Enter the Home Team: ").replace(" ", "_")
        away_team = input("Enter the Away Team: ").replace(" ", "_")
        if home_team not in team_stats["Team"].values or away_team not in team_stats["Team"].values:
            raise ValueError("Invalid team name. Please enter valid team names from the dataset.")
        break
    except ValueError as e:
        print(e)

# Fetch average stats for both teams
home_team_stats = team_stats[team_stats["Team"] == home_team].iloc[0]
away_team_stats = team_stats[team_stats["Team"] == away_team].iloc[0]

# Step 3: Create encoded columns for teams
team_encoding = {team: 0 for team in teams}
team_encoding[home_team] = 1
team_encoding[away_team] = -1


# Step 4: Prepare Input Data
input_data = {
    "xG": [home_team_stats["xG"]],
    "xGA": [home_team_stats["xGA"]],
    "Poss": [(home_team_stats["Poss"] + away_team_stats["Poss"]) / 2], # Average possession
    "Attendance": [home_team_stats["Attendance"]],
    "Formation": [home_team_stats["Formation"]],
    "Opp Formation": [away_team_stats["Formation"]]
}

# Add encoded team columns
for team, value in team_encoding.items():
    input_data[team] = [value]

# Create DataFrame
input_df = pd.DataFrame(input_data)

# Add missing columns with default values
for col in trained_columns:
    if col not in input_df.columns:
        input_df[col] = 0  # Add missing columns with default value 0

# Step 5: Reorder columns to match training order
input_df = input_df[trained_columns]

# Step 6: Convert to TensorFlow Dataset
input_ds = tfdf.keras.pd_dataframe_to_tf_dataset(input_df)

# Step 7: Predict Using the Trained Model
predictions = model.predict(input_ds)

# Extract probabilities
win_rate, draw_rate, lose_rate = predictions[0][2], predictions[0][1], predictions[0][0]

# Display Results
print("\nPrediction Results:")
print(f"Win Rate: {win_rate:.2f}")
print(f"Draw Rate: {draw_rate:.2f}")
print(f"Lose Rate: {lose_rate:.2f}")


Enter the Home Team: Manchester City
Enter the Away Team: Wolverhampton





Prediction Results:
Win Rate: 0.86
Draw Rate: 0.07
Lose Rate: 0.07


## 18. Load External Data: SofaScore and Last 5 Matches Form Data

### Objective:
- Import two additional datasets:
  1. **SofaScore Player Ratings**: Individual player performance ratings.
  2. **Last 5 Matches Form Data**: Team performance trends from their last five matches.

### Code Explanation:
1. **Step 1: Load SofaScore Player Ratings**
   - Data is imported directly from a shared Google Sheets link in CSV format.
   - Example data includes columns like `Name` (player name) and `Sofa Point` (rating).

2. **Step 2: Load Last 5 Matches Form Data**
   - Data is similarly imported from a Google Sheets link.
   - Example data includes columns like `Team` and `Current Form` (numerical representation of form).

3. **Data Verification**
   - Print the first few rows of each dataset to confirm successful loading and ensure data integrity.



In [26]:
# Step 1: Load SofaScore player ratings
# Google Sheets link (CSV format)
sofascore_file_path = "sofascore_data.xlsx" #Replace this with your file path
sofascore_data = pd.read_csv(sofascore_file_path)

# Step 2: Load last 5 matches form data
# Google Sheets link (CSV format)
form_file_path = "team_form_data.xlsx" #Replace this with your file path
form_data = pd.read_csv(form_file_path)

# Display the first few rows of each dataset to ensure they are loaded correctly
print("SofaScore Player Ratings:")
print(sofascore_data.head())

print("\nLast 5 Matches Form Data:")
print(form_data.head())

SofaScore Player Ratings:
   ID             Name  Sofa Point
0   1      Bukayo Saka        7.93
1   2    Mohamed Salah        7.90
2   3      Cole Palmer        7.85
3   4             Neto        7.85
4   5  Kevin De Bruyne        7.65

Last 5 Matches Form Data:
          Team  Current Form
0      Arsenal            11
1  Aston Villa             7
2  Bournemouth            10
3    Brentford             7
4     Brighton             5


## 19. Input Player and Team Information

### Objective:
- Collect team and player information interactively for the match prediction workflow.

### Code Explanation:
1. **Valid Teams**:
   - A list of valid team names is extracted from the `form_data` dataset.

2. **Validate Player Names**:
   - The `validate_player_name` function ensures that player names exist in the `SofaScore` dataset.
   - Prevents invalid inputs by checking against available data.

3. **Get Player Names**:
   - The `get_team_players` function prompts the user to input 11 player names for each team.
   - If a player name is invalid (not in `SofaScore`), the user is prompted again.

4. **Interactive Input for Teams**:
   - The user is asked to input a valid home and away team.
   - Ensures no duplicate team entries.

5. **Output Verification**:
   - Displays the team names and their respective player lists for verification.



In [27]:
# Define the list of valid teams
valid_teams = form_data['Team'].tolist()

# Function to validate player names
def validate_player_name(player_name, sofascore_data):
    """
    Check if the player name exists in the SofaScore data.
    """
    return player_name in sofascore_data['Name'].values

# Function to get player names for a team
def get_team_players(team_name):
    """
    Prompt the user to enter 11 player names for a specific team.
    """
    players = []
    print(f"Enter player names for {team_name} (11 players):")
    while len(players) < 11:
        player_name = input(f"Enter player {len(players) + 1} name: ")
        if validate_player_name(player_name, sofascore_data):
            players.append(player_name)
        else:
            print("Invalid player name. Please try again.")
    return players

# Prompt user for home and away teams
while True:
    home_team = input("Enter the Home Team Name: ")
    if home_team in valid_teams:
        break
    print("Invalid team name. Please try again.")

# Get players for the home team
home_players = get_team_players(home_team)

while True:
    away_team = input("Enter the Away Team Name: ")
    if away_team in valid_teams and away_team != home_team:
        break
    print("Invalid team name or duplicate team. Please try again.")

# Get players for the away team
away_players = get_team_players(away_team)

# Display the collected data for verification
print("\nHome Team and Players:")
print(f"Team: {home_team}, Players: {home_players}")
print("\nAway Team and Players:")
print(f"Team: {away_team}, Players: {away_players}")

Enter the Home Team Name: Arsenal
Enter player names for Arsenal (11 players):
Enter player 1 name: Bukayo Saka
Enter player 2 name: Kai Havertz
Enter player 3 name: mama
Invalid player name. Please try again.
Enter player 3 name: David Raya
Enter player 4 name: Bukayo Saka
Enter player 5 name: Bukayo Saka
Enter player 6 name: Bukayo Saka
Enter player 7 name: Bukayo Saka
Enter player 8 name: Bukayo Saka
Enter player 9 name: Bukayo Saka
Enter player 10 name: Bukayo Saka
Enter player 11 name: Bukayo Saka
Enter the Away Team Name: Liverpool
Enter player names for Liverpool (11 players):
Enter player 1 name: Bukayo Saka
Enter player 2 name: Bukayo Saka
Enter player 3 name: Bukayo Saka
Enter player 4 name: Bukayo Saka
Enter player 5 name: Bukayo Saka
Enter player 6 name: Bukayo Saka
Enter player 7 name: Bukayo Saka
Enter player 8 name: a
Invalid player name. Please try again.
Enter player 8 name: Bukayo Saka
Enter player 9 name: Bukayo Saka
Enter player 10 name: Bukayo Saka
Enter player 11 

## 20. Calculating Team SofaScore Averages

### Objective:
- Compute the average SofaScore points for a team based on player performance.

### Code Explanation:
1. **Function Definition**:
   - `calculate_team_score`:
     - Iterates through the list of players for a given team.
     - Validates each player's name against the `SofaScore` dataset.
     - Accumulates the SofaScore points for valid players and calculates the average.

2. **Validation**:
   - Ensures only valid players (present in the dataset) contribute to the average.
   - Issues a warning for invalid or missing players.

3. **Average Calculation**:
   - If valid players exist, the average is computed.
   - If no valid players are found, the average is set to `0`.

4. **Display Results**:
   - Outputs the calculated average SofaScore for the home and away teams.



In [28]:
# Function to calculate average SofaScore points for a team
def calculate_team_score(team_name, players, sofascore_data):
    """
    Calculate the average SofaScore points for a given team based on its players.
    """
    total_score = 0
    valid_players = 0

    print(f"\nCalculating SofaScore for {team_name}...")

    for player in players:
        # Check if the player exists in the SofaScore data
        if player in sofascore_data['Name'].values:
            # Get the player's score
            score = sofascore_data.loc[sofascore_data['Name'] == player, 'Sofa Point'].values[0]
            total_score += score
            valid_players += 1
        else:
            print(f"Warning: {player} not found in SofaScore data.")

    # Calculate the average score
    if valid_players > 0:
        average_score = total_score / valid_players
    else:
        average_score = 0  # Handle case where no valid players are found

    print(f"Average SofaScore for {team_name}: {average_score:.2f}")
    return average_score

# Calculate scores for both teams
home_team_sofa_score = calculate_team_score(home_team, home_players, sofascore_data)
away_team_sofa_score = calculate_team_score(away_team, away_players, sofascore_data)

# Display calculated scores
print("\nTeam Scores:")
print(f"{home_team}: {home_team_sofa_score:.2f}")
print(f"{away_team}: {away_team_sofa_score:.2f}")


Calculating SofaScore for Arsenal...
Average SofaScore for Arsenal: 7.78

Calculating SofaScore for Liverpool...
Average SofaScore for Liverpool: 7.93

Team Scores:
Arsenal: 7.78
Liverpool: 7.93


## 21. Calculating Final Predictions

### Objective:
- Combine the model predictions, SofaScore data, and recent form data to produce final match outcome probabilities.

### Code Explanation:
1. **Weights**:
   - `SOFASCORE_WEIGHT`: Weight assigned to SofaScore differences between teams.
   - `FORM_WEIGHT`: Weight assigned to recent form differences between teams.
   - `MODEL_WEIGHT`: Weight assigned to the original model predictions.
   - These weights are adjustable for experimentation and are verified to sum to 1.

2. **Function Definition**:
   - `calculate_final_prediction`:
     - Extracts raw probabilities from the model's predictions.
     - Calculates normalized differences in SofaScore and Form between the home and away teams.
     - Scales these differences into probabilities (e.g., scaling factor `0.05` or `0.1` can be tweaked).
     - Combines all probabilities using the defined weights.

3. **Normalization**:
   - Ensures the combined probabilities (`final_home_win`, `final_draw`, `final_away_win`) sum to 1.

4. **Example Usage**:
   - Retrieves form data for the home and away teams from the `form_data` DataFrame.
   - Uses `model.predict()` to get raw probabilities for the match.
   - Combines all components to calculate the final probabilities.

5. **Output**:
   - Final probabilities for each possible match outcome are displayed:
     ```
     Final Prediction Results:
     Home Win Probability: 0.48
     Draw Probability: 0.28
     Away Win Probability: 0.24
     ```

### Notes:
- **Scalability**:
  - The function can be extended to include additional metrics or features for prediction.
- **Adjustable Weights**:
  - Experiment with different weight configurations to optimize prediction accuracy.
- **Validation**:
  - The function ensures probabilities remain normalized regardless of input variations.

In [29]:
# Define weights for SofaScore and Form
# You can tweak these weights to adjust the impact of SofaScore and Form on the final prediction
SOFASCORE_WEIGHT = 0.4
FORM_WEIGHT = 0.3
MODEL_WEIGHT = 0.3

# Verify weights sum up to 1
assert SOFASCORE_WEIGHT + FORM_WEIGHT + MODEL_WEIGHT == 1.0, "Weights must sum up to 1."

# Combine the data
def calculate_final_prediction(home_team_sofa_score, away_team_sofa_score, home_team_form, away_team_form, model_predictions):
    """
    Calculate the final prediction by combining model predictions, SofaScore, and Form data.
    """
    # Extract model prediction probabilities
    home_win_prob, draw_prob, away_win_prob = model_predictions[0][2], model_predictions[0][1], model_predictions[0][0]

    # Normalize SofaScore and Form differences
    sofa_diff = home_team_sofa_score - away_team_sofa_score
    form_diff = home_team_form - away_team_form

    # Convert differences to probabilities (simple scaling)
    sofa_prob = max(0, min(1, 0.5 + 0.05 * sofa_diff))  # Adjust scaling factor as needed
    form_prob = max(0, min(1, 0.5 + 0.1 * form_diff))   # Adjust scaling factor as needed

    # Combine probabilities using the weights
    final_home_win = (home_win_prob * MODEL_WEIGHT) + (sofa_prob * SOFASCORE_WEIGHT) + (form_prob * FORM_WEIGHT)
    final_away_win = (away_win_prob * MODEL_WEIGHT) + ((1 - sofa_prob) * SOFASCORE_WEIGHT) + ((1 - form_prob) * FORM_WEIGHT)
    final_draw = draw_prob * MODEL_WEIGHT  # Draw remains unaffected by sofa/form

    # Normalize the final probabilities
    total = final_home_win + final_draw + final_away_win
    final_home_win /= total
    final_draw /= total
    final_away_win /= total

    return final_home_win, final_draw, final_away_win

# Example Usage
# Assuming model_predictions, SofaScore, and Form data are already calculated
home_team_form = form_data.loc[form_data['Team'] == home_team, 'Current Form'].values[0]
away_team_form = form_data.loc[form_data['Team'] == away_team, 'Current Form'].values[0]

# Predict with the trained model
model_predictions = model.predict(input_ds)

# Calculate final probabilities
final_home_win, final_draw, final_away_win = calculate_final_prediction(
    home_team_sofa_score, away_team_sofa_score, home_team_form, away_team_form, model_predictions
)

# Display the final probabilities
print("\nFinal Prediction Results:")
print(f"Home Win Probability: {final_home_win:.2f}")
print(f"Draw Probability: {final_draw:.2f}")
print(f"Away Win Probability: {final_away_win:.2f}")




Final Prediction Results:
Home Win Probability: 0.69
Draw Probability: 0.02
Away Win Probability: 0.28


## Comprehensive Prediction Workflow

### Purpose:
This code integrates multiple data sources (SofaScore player ratings, recent team form, and machine learning predictions) to generate a final probability of match outcomes (home win, draw, away win).

---

### **1. Team and Player Input**
1. **Valid Teams**:
   - A list of valid team names is derived from the form data to ensure user inputs are restricted to known teams.
   
2. **Player Validation**:
   - The `validate_player_name` function checks if entered player names exist in the SofaScore dataset.

3. **User Prompts**:
   - Users are prompted to enter home and away teams, followed by 11 player names for each team.
   - Invalid inputs trigger retry messages, ensuring only valid data is accepted.

---

### **2. Calculating SofaScore for Teams**
- **`calculate_team_score`**:
  - Iterates through the list of entered players for each team.
  - For valid players:
    - Retrieves SofaScore ratings from the dataset.
    - Calculates the average SofaScore for the team.
  - Handles missing players gracefully with warnings.

---

### **3. Prediction Weighting**
1. **Weights**:
   - `SOFASCORE_WEIGHT`: Importance of SofaScore ratings.
   - `FORM_WEIGHT`: Importance of recent form.
   - `MODEL_WEIGHT`: Importance of machine learning predictions.
   - Ensures weights sum to 1 for accurate probability normalization.

2. **Probability Normalization**:
   - SofaScore and form differences are scaled to probabilities.
   - Combined with raw model predictions using defined weights.

3. **`calculate_final_prediction`**:
   - Combines SofaScore, form, and model predictions.
   - Includes additional adjustments for draw probabilities.
   - Normalizes the results to ensure the sum of probabilities equals 1.

---

### **4. Model Predictions**
- Machine learning model predicts raw probabilities for match outcomes (home win, draw, away win).
- These are combined with SofaScore and form-based probabilities.

---

### **5. Example Usage**
1. **Inputs**:
   - Team names and player lists.
   - SofaScore data (loaded from a CSV).
   - Form data (loaded from a CSV).

2. **Workflow**:
   - Fetch team-specific SofaScore and form data.
   - Generate raw model predictions using `model.predict`.
   - Combine all inputs to calculate final probabilities.

3. **Output**:
  Team Scores:
Arsenal: 7.93
Manchester City: 7.93
1/1 [==============================] - 0s 63ms/step

Final Prediction Results:
Home Win Probability: 0.55
Draw Probability: 0.33
Away Win Probability: 0.12

In [34]:
# Define the list of valid teams
valid_teams = form_data['Team'].tolist()

# Function to validate player names
def validate_player_name(player_name, sofascore_data):
    """
    Check if the player name exists in the SofaScore data.
    """
    return player_name in sofascore_data['Name'].values

# Function to get player names for a team
def get_team_players(team_name):
    """
    Prompt the user to enter 11 player names for a specific team.
    """
    players = []
    print(f"Enter player names for {team_name} (11 players):")
    while len(players) < 11:
        player_name = input(f"Enter player {len(players) + 1} name: ")
        if validate_player_name(player_name, sofascore_data):
            players.append(player_name)
        else:
            print("Invalid player name. Please try again.")
    return players

# Prompt user for home and away teams
while True:
    home_team = input("Enter the Home Team Name: ")
    if home_team in valid_teams:
        break
    print("Invalid team name. Please try again.")

# Get players for the home team
home_players = get_team_players(home_team)

while True:
    away_team = input("Enter the Away Team Name: ")
    if away_team in valid_teams and away_team != home_team:
        break
    print("Invalid team name or duplicate team. Please try again.")

# Get players for the away team
away_players = get_team_players(away_team)

# Display the collected data for verification
print("\nHome Team and Players:")
print(f"Team: {home_team}, Players: {home_players}")
print("\nAway Team and Players:")
print(f"Team: {away_team}, Players: {away_players}")

# Function to calculate average SofaScore points for a team
def calculate_team_score(team_name, players, sofascore_data):
    """
    Calculate the average SofaScore points for a given team based on its players.
    """
    total_score = 0
    valid_players = 0

    print(f"\nCalculating SofaScore for {team_name}...")

    for player in players:
        # Check if the player exists in the SofaScore data
        if player in sofascore_data['Name'].values:
            # Get the player's score
            score = sofascore_data.loc[sofascore_data['Name'] == player, 'Sofa Point'].values[0]
            total_score += score
            valid_players += 1
        else:
            print(f"Warning: {player} not found in SofaScore data.")

    # Calculate the average score
    if valid_players > 0:
        average_score = total_score / valid_players
    else:
        average_score = 0  # Handle case where no valid players are found

    print(f"Average SofaScore for {team_name}: {average_score:.2f}")
    return average_score

# Calculate scores for both teams
home_team_sofa_score = calculate_team_score(home_team, home_players, sofascore_data)
away_team_sofa_score = calculate_team_score(away_team, away_players, sofascore_data)

# Display calculated scores
print("\nTeam Scores:")
print(f"{home_team}: {home_team_sofa_score:.2f}")
print(f"{away_team}: {away_team_sofa_score:.2f}")

# Define weights for SofaScore and Form
SOFASCORE_WEIGHT = 0.4
FORM_WEIGHT = 0.3
MODEL_WEIGHT = 0.3


# Verify weights sum up to 1
assert SOFASCORE_WEIGHT + FORM_WEIGHT + MODEL_WEIGHT == 1.0, "Weights must sum up to 1."

# Combine the data
def calculate_final_prediction(home_team_sofa_score, away_team_sofa_score, home_team_form, away_team_form, model_predictions):
    """
    Calculate the final prediction by combining model predictions, SofaScore, and Form data.
    """
    # Extract model prediction probabilities
    home_win_prob, draw_prob, away_win_prob = model_predictions[0][2], model_predictions[0][1], model_predictions[0][0]

    # Normalize SofaScore and Form differences
    sofa_diff = home_team_sofa_score - away_team_sofa_score
    form_diff = home_team_form - away_team_form

    # Convert differences to probabilities (simple scaling)
    sofa_prob = max(0, min(1, 0.5 + 0.05 * sofa_diff))  # Adjust scaling factor as needed
    form_prob = max(0, min(1, 0.5 + 0.1 * form_diff))   # Adjust scaling factor as needed

    # Combine probabilities using the weights
    final_home_win = (home_win_prob * MODEL_WEIGHT) + (sofa_prob * SOFASCORE_WEIGHT) + (form_prob * FORM_WEIGHT)
    final_away_win = (away_win_prob * MODEL_WEIGHT) + ((1 - sofa_prob) * SOFASCORE_WEIGHT) + ((1 - form_prob) * FORM_WEIGHT)
    final_draw = draw_prob * MODEL_WEIGHT + (0.5 * (sofa_prob + form_prob)) * (SOFASCORE_WEIGHT + FORM_WEIGHT)

    # Normalize the final probabilities
    total = final_home_win + final_draw + final_away_win
    final_home_win /= total
    final_draw /= total
    final_away_win /= total

    return final_home_win, final_draw, final_away_win

# Example Usage
home_team_form = form_data.loc[form_data['Team'] == home_team, 'Current Form'].values[0]
away_team_form = form_data.loc[form_data['Team'] == away_team, 'Current Form'].values[0]

# Predict with the trained model
model_predictions = model.predict(input_ds)

# Calculate final probabilities
final_home_win, final_draw, final_away_win = calculate_final_prediction(
    home_team_sofa_score, away_team_sofa_score, home_team_form, away_team_form, model_predictions
)

# Display the final probabilities
print("\nFinal Prediction Results:")
print(f"Home Win Probability: {final_home_win:.2f}")
print(f"Draw Probability: {final_draw:.2f}")
print(f"Away Win Probability: {final_away_win:.2f}")

Enter the Home Team Name: Arsenal
Enter player names for Arsenal (11 players):
Enter player 1 name: Bukayo Saka
Enter player 2 name: Bukayo Saka
Enter player 3 name: Bukayo Saka
Enter player 4 name: Bukayo Saka
Enter player 5 name: Bukayo Saka
Enter player 6 name: Bukayo Saka
Enter player 7 name: Bukayo Saka
Enter player 8 name: Bukayo Saka
Enter player 9 name: Bukayo Saka
Enter player 10 name: Bukayo Saka
Enter player 11 name: Bukayo Saka
Enter the Away Team Name: Manchester City
Enter player names for Manchester City (11 players):
Enter player 1 name: Bukayo Saka
Enter player 2 name: Bukayo Saka
Enter player 3 name: Bukayo Saka
Enter player 4 name: Bukayo Saka
Enter player 5 name: Bukayo Saka
Enter player 6 name: Bukayo Saka
Enter player 7 name: Bukayo Saka
Enter player 8 name: Bukayo Saka
Enter player 9 name: Bukayo Saka
Enter player 10 name: Bukayo Saka
Enter player 11 name: Bukayo Saka

Home Team and Players:
Team: Arsenal, Players: ['Bukayo Saka', 'Bukayo Saka', 'Bukayo Saka', 'B