## AFCON Match Outcome Predictor

In [15]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import joblib

In [5]:
# Load the datasets 

matches_df = pd.read_csv('datasets/Matches.csv')
players_df = pd.read_csv('datasets/Players.csv')
team_stats_df = pd.read_csv('datasets/Participated_Team.csv')
tournament_stats_df = pd.read_csv('datasets/Tournaments.csv')

# Displaying the first few rows of each dataset to understand their structure
(matches_df.head(), players_df.head(), team_stats_df.head(), tournament_stats_df.head())

(   Year      Date  Time    HomeTeam       AwayTeam  HomeTeamGoals  \
 0  1957  10-Feb-57   NaN     Sudan           Egypt            1.0   
 1  1957  10-Feb-57   NaN  Ethiopia    South Africa            NaN   
 2  1957  16-Feb-57   NaN     Egypt        Ethiopia            4.0   
 3  1959  22-May-59   NaN      Egypt       Ethiopia            4.0   
 4  1959  25-May-59   NaN     Sudan        Ethiopia            1.0   
 
    AwayTeamGoals             Stage  \
 0            2.0        Semifinals   
 1            NaN        Semifinals   
 2            0.0             Final   
 3            0.0  Final Tournament   
 4            0.0  Final Tournament   
 
                                 SpecialWinConditions                Stadium  \
 0                                                NaN      Municipal Stadium   
 1  Ethiopia  wins due to disqualification of othe...                    NaN   
 2                                                NaN      Municipal Stadium   
 3                    

Matches Dataset: This dataset includes details about each match, such as the year, date, teams involved, goals scored, and the match stage. Key columns are 'Year', 'HomeTeam', 'AwayTeam', 'HomeTeamGoals', 'AwayTeamGoals', etc.

Players Dataset: This dataset lists players who participated in the tournament, including their positions, names, birth dates, caps, goals, clubs, and countries. Key columns include 'Year', 'PlayerPosition', 'PlayerName', 'Club', 'Country', etc.

Team Statistics Dataset: This dataset provides general statistics for each team that participated in the tournaments, such as the number of times participated, games played, wins, draws, losses, goals for and against, etc. Key columns are 'Team', 'Part', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts'.

Tournament Statistics Dataset: This dataset includes information about debut teams for each tournament year. Key columns are 'Year', 'Debuting teams'.

In [7]:
matches_df.columns

Index(['Year', 'Date ', 'Time ', 'HomeTeam', 'AwayTeam', 'HomeTeamGoals',
       'AwayTeamGoals', 'Stage', 'SpecialWinConditions', 'Stadium', 'City',
       'Attendance'],
      dtype='object')

## Data preprocessing

In [9]:
# Handling missing values and data types in Matches dataset
matches_df = matches_df.dropna(subset=['HomeTeamGoals', 'AwayTeamGoals'])
matches_df['Date'] = pd.to_datetime(matches_df['Date '], errors='coerce')

# For Players dataset, we might not need detailed player info for predicting match outcomes, but let's clean it
players_df['DateofBirth(age)'] = pd.to_datetime(players_df['DateofBirth(age)'], errors='coerce')

# For Team and Tournament stats
# Removing unnecessary columns and cleaning data
team_stats_df = team_stats_df.dropna()
tournament_stats_df = tournament_stats_df.dropna()

# Checking the data types and missing values after preprocessing
(matches_df.dtypes, players_df.dtypes, team_stats_df.dtypes, tournament_stats_df.dtypes,
 matches_df.isnull().sum(), players_df.isnull().sum(), team_stats_df.isnull().sum(), tournament_stats_df.isnull().sum())


(Year                             int64
 Date                            object
 Time                            object
 HomeTeam                        object
 AwayTeam                        object
 HomeTeamGoals                  float64
 AwayTeamGoals                  float64
 Stage                           object
 SpecialWinConditions            object
 Stadium                         object
 City                            object
 Attendance                     float64
 Date                    datetime64[ns]
 dtype: object,
 Unnamed: 0                   int64
 Year                         int64
 ShirtNumber                 object
 PlayerPosition              object
 PlayerName                  object
 DateofBirth(age)    datetime64[ns]
 Caps                        object
 Goals                       object
 Club                        object
 Country                     object
 dtype: object,
 Rank     int64
 Team    object
 Part     int64
 Pld      int64
 W        int64
 D      

atches Dataset:
Converted 'Date' to datetime format.
Removed rows with missing values in 'HomeTeamGoals' and 'AwayTeamGoals'.
Some columns like 'Time', 'SpecialWinConditions', and 'Attendance' still have missing values, which might not be critical for our model.
Players Dataset:
Converted 'DateofBirth(age)' to datetime format.
There are missing values in several columns, but for predicting match outcomes, we may not need detailed player info.
Team Statistics Dataset:
Cleaned and no missing values.
Tournament Statistics Dataset:
Cleaned and no missing values.

## Feature engineering

In [12]:
# Encoding categorical variables in the Matches dataset
label_encoder = LabelEncoder()
matches_df['HomeTeam_encoded'] = label_encoder.fit_transform(matches_df['HomeTeam'])
matches_df['AwayTeam_encoded'] = label_encoder.fit_transform(matches_df['AwayTeam'])
matches_df['Stage_encoded'] = label_encoder.fit_transform(matches_df['Stage'])

# Creating a target variable for match outcome
# 1 for HomeTeam win, 0 for draw, -1 for AwayTeam win
matches_df['MatchOutcome'] = matches_df.apply(
    lambda row: 1 if row['HomeTeamGoals'] > row['AwayTeamGoals'] else 
                (-1 if row['HomeTeamGoals'] < row['AwayTeamGoals'] else 0), 
    axis=1)

# Dropping columns not used for prediction
matches_df.drop(columns=['SpecialWinConditions', 'Stadium', 'City', 'Attendance'], inplace=True)


# Verifying the dataset after dropping columns
matches_df.head()


# Checking the modified matches dataset
matches_df.head()


Unnamed: 0,Year,Date,Time,HomeTeam,AwayTeam,HomeTeamGoals,AwayTeamGoals,Stage,Date.1,HomeTeam_encoded,AwayTeam_encoded,Stage_encoded,MatchOutcome
0,1957,10-Feb-57,,Sudan,Egypt,1.0,2.0,Semifinals,2057-02-10,36,13,12,-1
2,1957,16-Feb-57,,Egypt,Ethiopia,4.0,0.0,Final,2057-02-16,12,15,0,1
3,1959,22-May-59,,Egypt,Ethiopia,4.0,0.0,Final Tournament,2059-05-22,11,15,1,1
4,1959,25-May-59,,Sudan,Ethiopia,1.0,0.0,Final Tournament,2059-05-25,36,15,1,1
5,1959,29-May-59,,Egypt,Sudan,2.0,1.0,Final Tournament,2059-05-29,11,38,1,1


coded categorical variables: 'HomeTeam_encoded', 'AwayTeam_encoded', and 'Stage_encoded'.
Target variable for match outcome: 'MatchOutcome' (1 for HomeTeam win, 0 for draw, -1 for AwayTeam win).
Dropped unnecessary columns such as 'SpecialWinConditions', 'Stadium', 'City', and 'Attendance'.

## Building the predictive model

In [14]:
# Preparing data for model
X = matches_df[['HomeTeam_encoded', 'AwayTeam_encoded', 'Stage_encoded']]
y = matches_df['MatchOutcome']

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Building the model - using Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predicting on test data
y_pred = rf_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

(accuracy, classification_rep)


(0.472636815920398,
 '              precision    recall  f1-score   support\n\n          -1       0.25      0.18      0.21        33\n           0       0.26      0.22      0.24        54\n           1       0.59      0.68      0.63       114\n\n    accuracy                           0.47       201\n   macro avg       0.37      0.36      0.36       201\nweighted avg       0.44      0.47      0.46       201\n')

The Random Forest Classifier model has been trained and evaluated. Here are the results:

Accuracy: The model achieved an accuracy of approximately 47.26%. This indicates that the model correctly predicts the outcome of the games about 47% of the time.
Classification Report:
Precision, recall, and f1-score vary across the classes (-1 for AwayTeam win, 0 for draw, 1 for HomeTeam win).
The model performs best in predicting HomeTeam wins (1), with a precision of 59% and a recall of 68%.
These results suggest that the model has moderate predictive power. The performance might be improved by:

Using more features: Incorporating more data like team statistics, player statistics, or historical performance.
Hyperparameter tuning: Optimizing the parameters of the Random Forest model.
Trying different models: Experimenting with other algorithms like Gradient Boosting or Logistic Regression.

## Save the model

In [18]:
# Saving the model to a file
model_filename = 'model/rf_afcon_prediction_model.joblib'
joblib.dump(rf_model, model_filename)

model_filename


'model/rf_afcon_prediction_model.joblib'