In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


Saving ligue-1-matches.csv to ligue-1-matches.csv
User uploaded file "ligue-1-matches.csv" with length 624554 bytes


**Loading data:** At the beginning of the code, the required libraries are imported, and then a CSV file containing data from French league matches is loaded.

In [3]:
import pandas as pd

df = pd.read_csv('ligue-1-matches.csv', index_col=0)  

In [4]:
df.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2022-08-06,21:00,Ligue 1,Matchweek 1,Sat,Away,W,5.0,0.0,Clermont Foot,...,Match Report,,18.0,12.0,12.9,1.0,0.0,0.0,2023,Paris Saint Germain
2,2022-08-13,21:00,Ligue 1,Matchweek 2,Sat,Home,W,5.0,2.0,Montpellier,...,Match Report,,18.0,8.0,18.2,3.0,1.0,2.0,2023,Paris Saint Germain
3,2022-08-21,20:45,Ligue 1,Matchweek 3,Sun,Away,W,7.0,1.0,Lille,...,Match Report,,16.0,9.0,11.9,0.0,0.0,0.0,2023,Paris Saint Germain
4,2022-08-28,20:45,Ligue 1,Matchweek 4,Sun,Home,D,1.0,1.0,Monaco,...,Match Report,,17.0,4.0,18.7,0.0,1.0,1.0,2023,Paris Saint Germain
5,2022-08-31,21:00,Ligue 1,Matchweek 5,Wed,Away,W,3.0,0.0,Toulouse,...,Match Report,,20.0,12.0,14.8,2.0,0.0,0.0,2023,Paris Saint Germain


In [5]:
df.columns

Index(['date', 'time', 'comp', 'round', 'day', 'venue', 'result', 'gf', 'ga',
       'opponent', 'xg', 'xga', 'poss', 'attendance', 'captain', 'formation',
       'referee', 'match report', 'notes', 'sh', 'sot', 'dist', 'fk', 'pk',
       'pkatt', 'season', 'team'],
      dtype='object')

#Data preprocessing:

Removal of redundant columns: Only necessary columns are selected to be used for analysis and prediction of results.

In [6]:
columns_to_keep = ['venue', 'opponent', 'xg', 'xga', 'poss', 'attendance', 'sh', 'sot', 
                   'formation', 'fk', 'pk', 'pkatt', 'season', 'date', 'team', 'result']
df = df[columns_to_keep]


Quick overview of selected columns: 
1. **venue**: The venue of the match may affect the result. Teams often play better at home.
2. **opponent**: The strength of the opponent affects the probability of winning.
3. **xg and xga**: These are the "expected goals" stats for the team and its opponent. They can provide information on how well a team has performed in the past.
4. **poss**: Possession of the ball can be a significant factor in the outcome of a match.
5. **attendance**: Match attendance can affect the atmosphere of the match and the final result.
6. **sh and sot**: Shots and shots on goal can be a good indicator of a team's offensive capabilities.
7. **formation**: The formation a team plays in can affect its effectiveness.
8. **fk, pk and pkatt**: Free kicks, penalty kicks and penalty kick attempts can affect the outcome of a match.
9. **season and date**: Time-related trends, such as a team's form in different seasons or at different points in a season, can affect the outcome of a match.


In [7]:
df.head()

Unnamed: 0,venue,opponent,xg,xga,poss,attendance,sh,sot,formation,fk,pk,pkatt,season,date,team,result
1,Away,Clermont Foot,3.5,0.3,62.0,12203.0,18.0,12.0,3-4-3,1.0,0.0,0.0,2023,2022-08-06,Paris Saint Germain,W
2,Home,Montpellier,3.2,0.9,59.0,46000.0,18.0,8.0,3-4-1-2,3.0,1.0,2.0,2023,2022-08-13,Paris Saint Germain,W
3,Away,Lille,3.4,1.7,52.0,47526.0,16.0,9.0,3-4-3,0.0,0.0,0.0,2023,2022-08-21,Paris Saint Germain,W
4,Home,Monaco,2.7,1.2,67.0,46000.0,17.0,4.0,3-4-3,0.0,1.0,1.0,2023,2022-08-28,Paris Saint Germain,D
5,Away,Toulouse,2.8,0.8,62.0,31700.0,20.0,12.0,3-4-3,2.0,0.0,0.0,2023,2022-08-31,Paris Saint Germain,W


In [8]:
df.columns

Index(['venue', 'opponent', 'xg', 'xga', 'poss', 'attendance', 'sh', 'sot',
       'formation', 'fk', 'pk', 'pkatt', 'season', 'date', 'team', 'result'],
      dtype='object')

In [9]:
df.isna().sum()

venue           0
opponent        0
xg              0
xga             0
poss            0
attendance    912
sh              0
sot             0
formation       0
fk              0
pk              0
pkatt           0
season          0
date            0
team            0
result          0
dtype: int64

In [10]:
df.dtypes

venue          object
opponent       object
xg            float64
xga           float64
poss          float64
attendance    float64
sh            float64
sot           float64
formation      object
fk            float64
pk            float64
pkatt         float64
season          int64
date           object
team           object
result         object
dtype: object

Supplementing missing data:

You can either delete these rows or fill in the missing values. Deleting the rows is not a good solution because you can lose a lot of important data. Instead, we can fill in the missing values with the mean frequency or zero.

In [11]:
df['attendance'] = df['attendance'].fillna(df['attendance'].mean()) 
# or df['attendance'].fillna(0)


In [12]:
df.isna().sum()

venue         0
opponent      0
xg            0
xga           0
poss          0
attendance    0
sh            0
sot           0
formation     0
fk            0
pk            0
pkatt         0
season        0
date          0
team          0
result        0
dtype: int64

Encoding categorical columns:

'*venue*', '*opponent*', '*formation*', '*team*' and '*resutl*' columns are categorical and must be encoded. We can use the LabelEncoder from the sklearn library for this encoding.

In [13]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['venue'] = le.fit_transform(df['venue'])
df['opponent'] = le.fit_transform(df['opponent'])
df['formation'] = le.fit_transform(df['formation'])
df['team'] = le.fit_transform(df['team'])
df['result'] = le.fit_transform(df['result'])


Creating new features:

creating new columns based on the average values of "xg" and "xga" for each team. The average "xg" and "xga" for the team can give us information about the team's overall performance in the past, which can be useful for our model. 



In [14]:
# Calculate mean 'xg' and 'xga' for each team
mean_xg = df.groupby('team')['xg'].mean()
mean_xga = df.groupby('team')['xga'].mean()

# Create new features
df['mean_team_xg'] = df['team'].map(mean_xg)
df['mean_team_xga'] = df['team'].map(mean_xga)

# Drop the original 'xg' and 'xga' columns
#model learned better with these columns
#df = df.drop(columns=['xg', 'xga'])


In [15]:
df.head()

Unnamed: 0,venue,opponent,xg,xga,poss,attendance,sh,sot,formation,fk,pk,pkatt,season,date,team,result,mean_team_xg,mean_team_xga
1,0,7,3.5,0.3,62.0,12203.0,18.0,12.0,4,1.0,0.0,0.0,2023,2022-08-06,21,2,2.221788,1.059218
2,1,17,3.2,0.9,59.0,46000.0,18.0,8.0,3,3.0,1.0,2.0,2023,2022-08-13,21,2,2.221788,1.059218
3,0,11,3.4,1.7,52.0,47526.0,16.0,9.0,4,0.0,0.0,0.0,2023,2022-08-21,21,2,2.221788,1.059218
4,1,16,2.7,1.2,67.0,46000.0,17.0,4.0,4,0.0,1.0,1.0,2023,2022-08-28,21,0,2.221788,1.059218
5,0,26,2.8,0.8,62.0,31700.0,20.0,12.0,4,2.0,0.0,0.0,2023,2022-08-31,21,2,2.221788,1.059218


In [16]:
df.columns

Index(['venue', 'opponent', 'xg', 'xga', 'poss', 'attendance', 'sh', 'sot',
       'formation', 'fk', 'pk', 'pkatt', 'season', 'date', 'team', 'result',
       'mean_team_xg', 'mean_team_xga'],
      dtype='object')

Change the date format to a Pandas date:

In [17]:
df['date'] = pd.to_datetime(df['date'])

In [18]:
df.head()

Unnamed: 0,venue,opponent,xg,xga,poss,attendance,sh,sot,formation,fk,pk,pkatt,season,date,team,result,mean_team_xg,mean_team_xga
1,0,7,3.5,0.3,62.0,12203.0,18.0,12.0,4,1.0,0.0,0.0,2023,2022-08-06,21,2,2.221788,1.059218
2,1,17,3.2,0.9,59.0,46000.0,18.0,8.0,3,3.0,1.0,2.0,2023,2022-08-13,21,2,2.221788,1.059218
3,0,11,3.4,1.7,52.0,47526.0,16.0,9.0,4,0.0,0.0,0.0,2023,2022-08-21,21,2,2.221788,1.059218
4,1,16,2.7,1.2,67.0,46000.0,17.0,4.0,4,0.0,1.0,1.0,2023,2022-08-28,21,0,2.221788,1.059218
5,0,26,2.8,0.8,62.0,31700.0,20.0,12.0,4,2.0,0.0,0.0,2023,2022-08-31,21,2,2.221788,1.059218


In the first step of this project, the Random Forest model was used as a starting point for predicting match results.



---

Data up to August 5, 2022 were used as training data, and data after that date (the last 2022/2022 season) as test data.

In [19]:
train = df[df["date"] < '2022-08-05']
test = df[df["date"] >= '2022-08-05']


Then it was decided what the goal would be (i.e. what we want to predict). We want to predict the outcome of the match (that is, the "result" column).

In [20]:
y_train = train["result"]
X_train = train.drop("result", axis=1)

y_test = test["result"]
X_test = test.drop("result", axis=1)


In [96]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)
rf.fit(X_train, y_train)


TypeError: ignored

You can convert a date to numerical characteristics such as year, month, day of the month, day of the week, etc. Depending on the context, different aspects of the date may be important. In the context of predicting the results of matches, the day of the week, the month and maybe even the year can be important.

In [21]:
#Convert 'date' column to datetime type
X_train['date'] = pd.to_datetime(X_train['date'])
X_test['date'] = pd.to_datetime(X_test['date'])

#Extract features from the date
X_train['year'] = X_train['date'].dt.year
X_train['month'] = X_train['date'].dt.month
X_train['day'] = X_train['date'].dt.day
X_train['dayofweek'] = X_train['date'].dt.dayofweek

X_test['year'] = X_test['date'].dt.year
X_test['month'] = X_test['date'].dt.month
X_test['day'] = X_test['date'].dt.day
X_test['dayofweek'] = X_test['date'].dt.dayofweek

#Drop the original 'date' column
X_train = X_train.drop('date', axis=1)
X_test = X_test.drop('date', axis=1)


Random Forest model is created and trained on the training data.

In [22]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, min_samples_split=10, random_state=1)
rf.fit(X_train, y_train)


In [23]:
preds = rf.predict(X_test)


The accuracy of the model is calculated.

In [24]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, preds)
print("Accuracy: ", acc)


Accuracy:  0.6052631578947368


Using other metrics to evaluate our RandomForest model:

**Precision**: the ratio of true positives to the sum of true positives and false positives. Precision is a measure that tells us how well the model identifies only significant cases.

**Recall**: the ratio of true positives to the sum of true positives and false negatives. Recall tells us how well the model identifies all relevant cases.

**F1 Score**: the harmonic mean of precision and recall. For the F1 Score to be high, both precision and recall must be high.

**Confusion Matrix**: a table that describes the performance of a classification model on a set of data for which the truth is known. This allows you to easily understand what errors the model is making.

In [26]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

y_pred = rf.predict(X_test)

print('Precision:', precision_score(y_test, y_pred, average='macro'))
print('Recall:', recall_score(y_test, y_pred, average='macro'))
print('F1 Score:', f1_score(y_test, y_pred, average='macro'))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))


Precision: 0.5607786157741554
Recall: 0.5539955716586151
F1 Score: 0.536248126149224
Confusion Matrix:
 [[ 33  83  68]
 [ 21 204  63]
 [ 22  43 223]]


Overall accuracy is 0.61, which means that the model correctly predicts about 61% of the cases. For the problem of predicting the results of football matches, this is quite a good result, because it is a task with high uncertainty.

The average precision of the model is 0.56. This means that when the model predicts a team to win, it is correct 56% of the time.

The average Recall of the model is 0.55. This means that out of all the matches that a given team actually won, the model correctly predicted 55% of them. 

The average F1 value is 0.54, which is the harmonic mean of precision and sensitivity. F1 is useful when the costs of false positives and false negatives are uncertain, and it is important to balance precision and sensitivity.

This matrix shows what errors the model makes. From your results, the model tends to predict second class, which may suggest that the model is somewhat biased.


*Overall, the results suggest that the model has some ability to predict the outcome of French league matches, but also can be improved. May be considered using other modeling techniques, adding more features that can help the model better understand the data, or fine-tuning the parameters of the model.*



---

We now turn to the use of the XGBoost model. XGBoost is an implementation of the Gradient Boosting algorithm which is very effective in many classification problems. Here's how we can apply XGBoost to this problem:

In [27]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [28]:
#Create an XGBoost model
xgb = XGBClassifier(random_state=1)

In [29]:
#Train the model
xgb.fit(X_train, y_train)

In [32]:
#Use the model to make predictions on test data
xgb_preds = xgb.predict(X_test)

In [33]:
#Print the metrics
print('Accuracy: ', accuracy_score(y_test, xgb_preds))
print('Precision: ', precision_score(y_test, xgb_preds, average='macro'))
print('Recall: ', recall_score(y_test, xgb_preds, average='macro'))
print('F1 Score: ', f1_score(y_test, xgb_preds, average='macro'))
print('Confusion Matrix:\n', confusion_matrix(y_test, xgb_preds))

Accuracy:  0.6118421052631579
Precision:  0.5713982120483286
Recall:  0.5735205314009661
F1 Score:  0.5691309996036993
Confusion Matrix:
 [[ 54  73  57]
 [ 42 200  46]
 [ 40  37 211]]


The accuracy of the XGBoost model is 0.61, which is slightly higher than the accuracy of the RandomForest model (0.60). This suggests that the XGBoost model is slightly better at predicting the results of French league matches.

The average precision of the XGBoost model is 0.57, which is slightly higher than that of the RandomForest model (0.56). This means that the XGBoost model is slightly more accurate in predicting match winners.

The average sensitivity of the XGBoost model is 0.57, which is also slightly higher compared to the RandomForest model (0.55). This means that the XGBoost model is better at identifying real winners.

The average F1 value for the XGBoost model is 0.57, which is slightly higher compared to the RandomForest model (0.54). This means that the XGBoost model better balances precision and sensitivity.

Looking at the confusion matrix, it can be seen that the number of false predictions has decreased for each class compared to the RandomForest model. This suggests that the XGBoost model is a bit more reliable.

In conclusion, the XGBoost model seems to improve the results compared to the RandomForest model in the context of predicting the results of French league matches based on the provided metrics. However, it is important to note that these differences are minor and both models have similar results. Still, XGBoost seems to be a better choice based on the data provided.






---
**Creating own model:**


In [34]:
df.head()

Unnamed: 0,venue,opponent,xg,xga,poss,attendance,sh,sot,formation,fk,pk,pkatt,season,date,team,result,mean_team_xg,mean_team_xga
1,0,7,3.5,0.3,62.0,12203.0,18.0,12.0,4,1.0,0.0,0.0,2023,2022-08-06,21,2,2.221788,1.059218
2,1,17,3.2,0.9,59.0,46000.0,18.0,8.0,3,3.0,1.0,2.0,2023,2022-08-13,21,2,2.221788,1.059218
3,0,11,3.4,1.7,52.0,47526.0,16.0,9.0,4,0.0,0.0,0.0,2023,2022-08-21,21,2,2.221788,1.059218
4,1,16,2.7,1.2,67.0,46000.0,17.0,4.0,4,0.0,1.0,1.0,2023,2022-08-28,21,0,2.221788,1.059218
5,0,26,2.8,0.8,62.0,31700.0,20.0,12.0,4,2.0,0.0,0.0,2023,2022-08-31,21,2,2.221788,1.059218


In [35]:
#Importing the necessary libraries
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import regularizers

from sklearn.preprocessing import StandardScaler
from keras.utils import to_categorical

#Data scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#Converting labels to one-hot form
y_train_categorical = to_categorical(y_train)
y_test_categorical = to_categorical(y_test)

#Define the model
model = Sequential()
input_dim = X_train.shape[1]  #number of features

model.add(Dense(64, input_dim=input_dim, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.2))  # Dodaj warstwę Dropout
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))  # Dodaj warstwę Dropout
model.add(Dense(16, activation='relu'))
model.add(Dense(3, activation='softmax'))

#Model compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Model training
history = model.fit(X_train, y_train_categorical, validation_data=(X_test, y_test_categorical), epochs=100, batch_size=10)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

The sequential model from Keras was used, which means that the layers were added one after the other. The first layer is the Dense layer (full connection layer) with 64 neurons, ReLU activation function and L2 regularization. Added a Dropout layer with a rate of 0.2, which means 20% of neurons are randomly turned off during training to prevent overfitting. The addition of Dense and Dropout layers continued, and the last layer is a Dense layer with 3 neurons (one per class) and a softmax activation function, which is typically used in multi-class classification.

In [36]:
#Rate the model
loss, accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)

print('Test loss:', loss)
print('Test accuracy:', accuracy)


Test loss: 1.0558794736862183
Test accuracy: 0.5789473652839661


In [37]:
from numpy import argmax

predictions = model.predict(X_test)

predicted_labels = argmax(predictions, axis=1)




In [38]:
from sklearn.metrics import classification_report, confusion_matrix

#Generate classification report
print(classification_report(y_test, predicted_labels))

#Generate confusion matrix
print(confusion_matrix(y_test, predicted_labels))


              precision    recall  f1-score   support

           0       0.35      0.41      0.38       184
           1       0.64      0.64      0.64       288
           2       0.72      0.63      0.67       288

    accuracy                           0.58       760
   macro avg       0.57      0.56      0.56       760
weighted avg       0.60      0.58      0.59       760

[[ 76  66  42]
 [ 75 183  30]
 [ 68  39 181]]


Create the table with the real and predicted result for one team (in this example Paris Saint-Germain):

In [40]:
team_encoder = LabelEncoder()
team_encoder.fit(df['team'])
team_decoder = {code: team for code, team in enumerate(team_encoder.classes_)}

opponent_encoder = LabelEncoder()
opponent_encoder.fit(df['opponent'])
opponent_decoder = {code: opponent for code, opponent in enumerate(opponent_encoder.classes_)}

#Create a new DataFrame with match information
results_df = pd.DataFrame(columns=['team_A', 'team_B', 'home/away', 'date', 'result', 'predicted_result'])

team_code = 21
team_name = team_decoder[team_code]

import numpy as np

for i, row in test.iterrows():
    if row['team'] == team_code or row['opponent'] == team_code:
        team_A = team_decoder[row['team']]
        team_B = opponent_decoder[row['opponent']]
        home_away = 'Home' if row['venue'] == 'home' else 'Away'
        date = row['date']
        result = row['result']
        predicted_probs = model.predict(np.expand_dims(X_test[i], axis=0))[0]
        predicted_result = np.argmax(predicted_probs)

        results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
                                        'date': date, 'result': result, 'predicted_result': predicted_result},
                                       ignore_index=True)

print(results_df)




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,




  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,
  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,


   team_A team_B home/away       date result predicted_result
0      21      7      Away 2022-08-06      2                2
1      21     17      Away 2022-08-13      2                2
2      21     11      Away 2022-08-21      2                2
3      21     16      Away 2022-08-28      0                2
4      21     26      Away 2022-08-31      2                2
..    ...    ...       ...        ...    ...              ...
71      0     21      Away 2023-05-13      1                0
72     27     21      Away 2022-10-29      1                2
73     27     21      Away 2023-05-07      1                2
74      2     21      Away 2023-01-11      1                0
75      2     21      Away 2023-04-21      1                2

[76 rows x 6 columns]


  results_df = results_df.append({'team_A': team_A, 'team_B': team_B, 'home/away': home_away,


In [41]:
results_df

Unnamed: 0,team_A,team_B,home/away,date,result,predicted_result
0,21,7,Away,2022-08-06,2,2
1,21,17,Away,2022-08-13,2,2
2,21,11,Away,2022-08-21,2,2
3,21,16,Away,2022-08-28,0,2
4,21,26,Away,2022-08-31,2,2
...,...,...,...,...,...,...
71,0,21,Away,2023-05-13,1,0
72,27,21,Away,2022-10-29,1,2
73,27,21,Away,2023-05-07,1,2
74,2,21,Away,2023-01-11,1,0


In [42]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Convert y_test to numpy array
y_test_array = y_test.to_numpy()

print('Accuracy:', accuracy_score(y_test_array, predicted_labels))
print('Precision:', precision_score(y_test_array, predicted_labels, average='macro'))
print('Recall:', recall_score(y_test_array, predicted_labels, average='macro'))
print('F1 Score:', f1_score(y_test_array, predicted_labels, average='macro'))


Accuracy: 0.5789473684210527
Precision: 0.5659545499666107
Recall: 0.5589774557165862
F1 Score: 0.5605730403316266


#CONCLUSIONS

Analyzing the results obtained from the neural network model for predicting the results of football matches, we see that the accuracy of the model is 58%. This means that the model correctly predicted the outcome of approximately 58% of the test matches. This is quite a good result, given the complex nature of predicting the outcome of football matches, where many variables can affect the final result.

Looking more closely at the metrics for each class, we see that Precision, Recall, and F1-scores are relatively low for class 0. This suggests that the model is having a hard time predicting that particular class. It is possible that the model needs more data for this class, or that there are some features that are unique to this class that are difficult for the model to capture.

For classes 1 and 2, the metric values are higher, suggesting that the model is better at predicting these classes.

The confusion matrix further illustrates where the model is making mistakes. For example, the model often confuses classes 0 and 1, suggesting that the two classes may be difficult to distinguish based on available features.

Compared to the **Random Forest** and **XGBoost** models, the neural network has comparable accuracy, suggesting that neither model is significantly better than the other for this particular problem.



---
However, it is important to emphasize that predicting the outcome of football matches is a very complex task. Football is an unpredictable sport and the outcome of matches can be affected by many factors that are difficult to account for in the model, such as team strategy, player injuries, player form, and even factors such as weather conditions and mental strain. In fact, even the most accurate prediction models may not be able to anticipate the surprises that are a common feature of football.

---

When analyzing the results of machine learning models in the context of predicting football match results, we noticed that even the most efficient model (XGBoost) achieved an accuracy of around 61%. The model based on the neural network achieved an accuracy of about 58%. These results highlight the difficulty of the problem - football is an unpredictable sport where the outcome of a match can change at any minute.

While these models may provide some information, they do not always reflect the full dynamics of a match. It is important to understand that modeling such a complex field as football requires taking into account many variables that can affect the result. This understanding is one of the key lessons learned from this project.

During the construction of the models, I experimented with various parameters and features, studying their impact on the quality of predictions. After many tests and research, I selected a set of features that provided the most promising results, achieving accuracy in the range of 58% to 60%. This process underscores the importance of the feature selection and model tuning process in machine learning, while demonstrating that even after careful selection of features and parameters, the model's accuracy in predicting football outcomes is limited by the unpredictability of the sport itself.

When predicting match outcomes, an alternative approach might be to look for additional data that may be useful, such as player stats, injury data, and even weather conditions. We can also explore other modeling techniques. 


