# Classification problem
Can we predict the best stage (E.g Group Stage, Third-Place match, Final) to be reached by a team that has qualified for a world cup?

by Brian O'Sullivan and Eike Stoltze

# Preprocessing

Built cleaned dataset (no missing values)

The first problems occurred when some string attributes appeared twice or caused trouble due to their content. 
In some instances, the seemingly same string (eg. “Lionel Messi”) was listed twice. That 
happened due to a different syntax for the space sign in between the first and last names. This 
error can be fixed by replacing “\xa0” with a regular space sign using the following line:
Similar mistakes happened with letters and names that include an apostrophe “ ’ ”, such as 
“Côte d'Ivoire”. 
Another content-related mistake occurred, due to a change in countries' names. Some 
countries such as “Yugoslavia” were dissolved, so they can just be left out. But other countries 
changed their names or merges, such as the case with “Germany”. Both “East-” and “West 
Germany” played in at least one World Cup. Since their reunion in 1991, a new country with 
the name “Germany” was put in place in FIFA matches. Since “East Germany” never won an
international competition, all of “West Germany” 's achievements also count as achievements 
of “Germany”. That’s why all entries of “West Germany” are replaced with “Germany".
The World Cup in 2002 also caused trouble, as it was the first and, so far, only competition 
hosted by two nations. We also came across missing data. Multiple captains and coaches from the squads of the 
tournaments held between 1998 and 2014 were missing. All these names had to be looked up 
and filled in manually. The data was taken from the Wikipedia articles of each squad.


# Data transformation, Attribute/feature construction : 
Built csv of all teams that have ever qualified for the World Cup. Each row has Country,Year,GoalsFor,GoalsAgainst,Goal Difference,Round Achieved

GoalsFor, GoalsAgainst and Goal Difference are all calculated as the teams average per each 90 minutes of game time.

Country became Country_Algeria = F, Country_Angola = T, Country_Argentina = F,... through hot-one encoding. This helped our decision tree handle this categorical data.

# Classes
Nine options: "Group stage", "First group stage", "Second group stage", "Group stage play-off", "First round", "Second round",
"Round of 16", "Quarter-finals", "Semi-finals", "Third-place match", "Final Stage", "Final"

# Redundant classes?
Some may think we have redundant classes here, however the format of the tournament has changed over the years. For example,
1934 Structure: Final, Third-place match, Quarter-finals, Round of 16
1930 Structure: Final, Semi-finals, Group stage
This means we need all achieved stage names, even if they have the same "rank" or prestige.

# Model
Decided to use a Decision Tree for our classifier. This is because they posses many favourable qualities such as

Handles Both Numerical and Categorical Data:
Decision trees can handle both numerical and categorical data without the need for extensive pre-processing. This flexibility simplifies the data preparation phase. This was needed for classifying data such as goal difference as well as country of teams origin.

Automatic Feature Selection:
Decision trees can automatically select important features from the dataset, making them robust to irrelevant or less important variables. This can lead to more efficient models.

Non-Linear Relationships:
Decision trees can capture non-linear relationships between features and the target variable, making them suitable for complex decision boundaries in the data.

Low Computational Cost for Prediction:
Once trained, decision trees have relatively low computational cost for making predictions. Predictions involve traversing the tree structure, and the time complexity is logarithmic in the number of data points.

Supports Multi-Class Classification:
Decision trees are not limited to binary classification, alllwing us to predict many different types of stages that a team may reach.

Interpretability:
Decision trees are easily interpretable and can be visualised graphically. The tree structure represents a series of decisions and their outcomes, making it easy to understand and explain to a layperson.

# Code cell 1

In [20]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Read
match_data = pd.read_csv('new.csv')

# Extract features and target variable
X = match_data.drop(columns=['Round Achieved'])
y = match_data['Round Achieved']

print("1:")

print(f"X values:")
print(X[:10])
print("\n2:")
print(f"y values:")
print(y[:10])

# Split the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# Country -> Country_Algeria = F, Country_Angola = T, Country_Argentina = F,...
# Combine X_train and X_test for one-hot encoding
X_combined = pd.concat([X_train, X_test], ignore_index=True)
X_combined_encoded = pd.get_dummies(X_combined, columns=['Country'])

# Split back into X_train_encoded and X_test_encoded
X_train_encoded = X_combined_encoded.iloc[:len(X_train)]
X_test_encoded = X_combined_encoded.iloc[len(X_train):]

# Create and train the model
model = DecisionTreeClassifier()
model.fit(X_train_encoded, y_train)

# Make predictions on a new data point with named columns
new_data_point = pd.DataFrame([['Germany','2026','1.66670','0.83333','0.83333'],['Ireland','2026','1.0','1.0','0.0']], columns=X.columns)
#new_data_point = pd.DataFrame([['Ireland','2026','1.0','1.0','0.0']], columns=X.columns)

# One-hot encode the 'Country' column explicitly
new_data_point_encoded = pd.get_dummies(new_data_point, columns=['Country'])
# Ensure the columns are in the same order as X_train_encoded
new_data_point_encoded = new_data_point_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Make predictions on the new data point
y_pred = model.predict(new_data_point_encoded)

# Evaluate the model on the test set
accuracy = model.score(X_test_encoded, y_test)

print("\n3:")
print("Model Accuracy:", accuracy)
print("\n4:")
print("Prediciton", y_pred)
#save
#joblib.dump(model, 'Round-Classifier.joblib')

1:
X values:
       Country  Year  GoalsFor  GoalsAgainst  Goal Difference
0    Argentina  2022   1.95650       1.04350          0.91304
1       France  2022   2.18180       1.09090          1.09090
2      Croatia  2022   1.04350       0.91304          0.13043
3      Morocco  2022   0.81818       0.68182          0.13636
4     Portugal  2022   2.40000       1.20000          1.20000
5      England  2022   2.60000       0.80000          1.80000
6       Brazil  2022   1.50000       0.56250          0.93750
7  Netherlands  2022   1.87500       0.75000          1.12500
8        Spain  2022   2.07690       0.69231          1.38460
9  Switzerland  2022   1.25000       2.25000         -1.00000

2:
y values:
0                Final
1                Final
2    Third-place match
3    Third-place match
4       Quarter-finals
5       Quarter-finals
6       Quarter-finals
7       Quarter-finals
8          Round of 16
9          Round of 16
Name: Round Achieved, dtype: object

3:
Model Accuracy: 0.540

# Code cell 1 summary

Print first ten X values and first ten y values to demonstrate structure of dataset used to train model. (1,2)

Hot-one encode country values.
Trained Decision Tree model with 80% of data and tested against remaining 20%.

Analyse model accuracy on % of correct predictions on remaining 20% of data. (3)

Made prediction of two teams' performance in the 2026 world cup. This demonstates our model is capable of making predictions on unseen data. (4)

# Overfitting

Overfitting is a problem that models can be vunerable to. It occurs when the model learns the training data "too well". It may make incorrect inferences from meaningless patterns in the data. E.g. If the model is trained on data with ten occurences where Irish teams only reach the round of 16, the model may predict Ireland will always only reach this stage, even if there is an Irish team that scores 100 goals a game. This is an extreme example but we believe it demnstartes this issue well.

# Smoothing
Smoothing data is a technique employed to enhance the generalization of a dataset, making it more suitable for predictions beyond the specific context of the original dataset. We smoothed our dataset by rounding all stats to the nearest 0.125. For example:


Argentina, 2022, 1.95650, 1.04350, 0.91304, Final becomes Argentina, 2022, 2.0, 1.0, 0.875, Final

We will now comapre the performance of two models trained on these two datasets.

# Code cell 2

In [80]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import joblib

# Read
match_data = pd.read_csv('simple-new.csv')

# Extract features and target variable
X = match_data.drop(columns=['Round Achieved'])
y = match_data['Round Achieved']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Combine X_train and X_test for one-hot encoding
X_combined = pd.concat([X_train, X_test], ignore_index=True)

# Assuming 'Country' is the column containing country names
X_combined_encoded = pd.get_dummies(X_combined, columns=['Country'])

# Split back into X_train_encoded and X_test_encoded
X_train_encoded = X_combined_encoded.iloc[:len(X_train)]
X_test_encoded = X_combined_encoded.iloc[len(X_train):]

# Create and train the model
model = DecisionTreeClassifier()
model.fit(X_train_encoded, y_train)

# Make predictions on a new data point with named columns
new_data_point = pd.DataFrame([['Germany','1978','1.66670','0.83333','0.83333']], columns=X.columns)
#new_data_point = pd.DataFrame([['Brazil','2026','200.0','2.0','198.0']], columns=X.columns)

# One-hot encode the 'Country' column explicitly
new_data_point_encoded = pd.get_dummies(new_data_point, columns=['Country'])
# Ensure the columns are in the same order as X_train_encoded
new_data_point_encoded = new_data_point_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Make predictions on the new data point
y_pred = model.predict(new_data_point_encoded)

# Evaluate the model on the test set
accuracy = model.score(X_test_encoded, y_test)

print(f"1:~~~~~")
print("Model Accuracy:", accuracy)
print(f"\n2:~~~~~")
print("Prediciton", y_pred)

#save
#joblib.dump(model, 'Round-Classifier-Smoothed.joblib')

1:~~~~~
Model Accuracy: 0.5612244897959183

2:~~~~~
Prediciton ['Second round']


# Code cell 2 summary

Model trained on 'simple-new.csv' instead of 'new.csv'. This dataset is smoothed as described under the previous heading.

Print model accuracy (1)

Make prediction on unseen data point (2)

# Code cell 3

In [103]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import joblib


# Read
match_data = pd.read_csv('new.csv')

# Extract features and target variable
X = match_data.drop(columns=['Round Achieved'])
y = match_data['Round Achieved']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


X_combined = pd.concat([X_train, X_test], ignore_index=True)
X_combined_encoded = pd.get_dummies(X_combined, columns=['Country'])

X_train_encoded = X_combined_encoded.iloc[:len(X_train)]
X_test_encoded = X_combined_encoded.iloc[len(X_train):]


print('\nModel 1 (Trained with untouched data)')
model = joblib.load('Round-Classifier.joblib')
print('1:~~~~~~~')

# Display feature importances
feature_importances = model.feature_importances_
print("Feature Importances:")
for feature, importance in zip(X.columns, feature_importances):
    print(f"{feature}: {importance}")

print("\n2:~~~~~~~")
    
# Assuming y_true is the true labels and y_pred is the predicted labels
conf_matrix = confusion_matrix(y_test, model.predict(X_test_encoded))

print(conf_matrix)

# Accuracy
accuracy = accuracy_score(y_test, model.predict(X_test_encoded))

# Precision
precision = precision_score(y_test, model.predict(X_test_encoded), average='weighted', zero_division=0)

# Recall
recall = recall_score(y_test, model.predict(X_test_encoded), average='weighted', zero_division=0)

# F1 Score
f1 = f1_score(y_test, model.predict(X_test_encoded), average='weighted')

#print("Model score:", )
print("\n3:~~~~~~~")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

model = joblib.load('Round-Classifier-Smoothed.joblib')

print('\nModel 2 (Trained with smoothed data)')


print("4:~~~~~~~")


# Display feature importances
feature_importances = model.feature_importances_
print("Feature Importances:")
for feature, importance in zip(X.columns, feature_importances):
    print(f"{feature}: {importance}")


    
# Assuming y_true is the true labels and y_pred is the predicted labels
conf_matrix = confusion_matrix(y_test, model.predict(X_test_encoded))

print("\n5:~~~~~~~")
print(conf_matrix)

# Accuracy
#accuracy = accuracy_score(y_test, model.predict(X_test_encoded))
#accuracy = model.score(X_test_encoded, y_test)

# Precision
precision = precision_score(y_test, model.predict(X_test_encoded), average='weighted',zero_division=0)

# Recall
recall = recall_score(y_test, model.predict(X_test_encoded), average='weighted', zero_division=0)

# F1 Score
f1 = f1_score(y_test, model.predict(X_test_encoded), average='weighted')

print("\n6:~~~~~~~")

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)



Model 1 (Trained with untouched data)
1:~~~~~~~
Feature Importances:
Country: 0.2463759834714667
Year: 0.151279199747222
GoalsFor: 0.08781585679754596
GoalsAgainst: 0.3058157946556295
Goal Difference: 0.0

2:~~~~~~~
[[ 5  0  1  0  1  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0  0  0  0]
 [ 0  0  2  0  0  0  0  0  0  0  0]
 [ 0  0  0  3  0  0  0  0  0  0  0]
 [ 0  0  0  0 33  1  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 18  0  0  0  0]
 [ 0  0  0  0  0  0  0 21  0  0  0]
 [ 0  0  0  0  0  0  0  0  2  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  0]
 [ 0  0  0  0  0  0  0  1  0  0  8]]

3:~~~~~~~
Accuracy: 0.9591836734693877
Precision: 0.9732529375386518
Recall: 0.9591836734693877
F1 Score: 0.9634239742408591

Model 2 (Trained with smoothed data)
4:~~~~~~~
Feature Importances:
Country: 0.25577524914601335
Year: 0.1261541906336787
GoalsFor: 0.10094611601730474
GoalsAgainst: 0.2745099429106064
Goal Difference: 0.006046584993953417

5:~~~~~~~
[[ 7  0  0  0  0  0  0  0

# Code cell 3 summary

Create test/train split.

Load model 1 (Trained on untouched data) 

Display feature importances (1)

Display confusion matrix (2)

Display accuracy, precision, recall and F1 score (3)

Load model 2 (Trained on untouched data)

Display feature importances (4)

Display confusion matrix (5)

Display accuracy, precision, recall and F1 score (6)

# Explanation of metrics

Feature importances: 
Feature importances refer to the contribution of each feature in a machine learning model to the prediction. Feature importances can be calculated based on how much each feature contributes to the reduction in impurity or entropy (how unknown something is) during the tree's construction. Displaying feature importances helps understand which features are most influential in making predictions.

Confusion matrix:
A confusion matrix displays the number of true positive (correctly predicted positive instances), true negatives (correctly predicted negative instances), false positives (actual negatives incorrectly predicted as positives), and false negatives (actual positives incorrectly predicted as negatives). It provides a comprehensive view of a model's performance.

Accuracy:
Accuracy is the ratio of correctly predicted instances to the total instances. It provides a general measure of a model's correctness.

Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It measures the accuracy of positive predictions and is relevant when the cost of false positives is high.

Recall (Sensitivity or True Positive Rate):
Recall is the ratio of correctly predicted positive observations to the all observations in the actual class. It measures the ability of the model to capture all the relevant instances.

F1 Score:
F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall, useful when there is an uneven class distribution.

# Conclusion 

In conclusion we are happy that our approach to this classification problem has been successful. Random guessing would have one predict with 0.111 accuracy. Both of our models can consistently predict with >85% accuracy, precision, recall, and F-1 score. Even though both models have similarly successful results, we believe that our second model 'Round-Classifier-Smoothed.joblib' would be better for predictions outside of the context of this dataset as it is less vunerable to overfitting. 


# Visualise Decision Tree

Will be saved as 'Round-Classiifer-Smoothed.dot'

In [104]:
import joblib
from sklearn import tree

# Load the trained model
model = joblib.load('Round-Classifier-Smoothed.joblib')

# Check the number of features in the model
num_features = len(model.feature_importances_)

#print(num_features)

# Specify the correct number of feature names
feature_names = ['Year', 'GoalsFor', 'GoalsAgainst', 'Goal Difference',
       'Country_Algeria', 'Country_Angola', 'Country_Argentina',
       'Country_Australia', 'Country_Austria', 'Country_Belgium',
       'Country_Bolivia', 'Country_Bosnia and Herzegovina', 'Country_Brazil',
       'Country_Bulgaria', 'Country_Cameroon', 'Country_Canada',
       'Country_Chile', 'Country_China PR', 'Country_Colombia',
       'Country_Costa Rica', 'Country_Croatia', 'Country_Cuba',
       'Country_Czech Republic', 'Country_Czechoslovakia',
       'Country_Côte dIvoire', 'Country_Denmark', 'Country_Dutch East Indies',
       'Country_Ecuador', 'Country_Egypt', 'Country_El Salvador',
       'Country_England', 'Country_FR Yugoslavia', 'Country_France',
       'Country_Germany', 'Country_Germany DR', 'Country_Ghana',
       'Country_Greece', 'Country_Haiti', 'Country_Honduras',
       'Country_Hungary', 'Country_IR Iran', 'Country_Iceland', 'Country_Iraq',
       'Country_Israel', 'Country_Italy', 'Country_Jamaica', 'Country_Japan',
       'Country_Korea DPR', 'Country_Korea Republic', 'Country_Kuwait',
       'Country_Mexico', 'Country_Morocco', 'Country_Netherlands',
       'Country_New Zealand', 'Country_Nigeria', 'Country_Northern Ireland',
       'Country_Norway', 'Country_Panama', 'Country_Paraguay', 'Country_Peru',
       'Country_Poland', 'Country_Portugal', 'Country_Qatar',
       'Country_Republic of Ireland', 'Country_Romania', 'Country_Russia',
       'Country_Saudi Arabia', 'Country_Scotland', 'Country_Senegal',
       'Country_Serbia', 'Country_Serbia and Montenegro', 'Country_Slovakia',
       'Country_Slovenia', 'Country_South Africa', 'Country_Soviet Union',
       'Country_Spain', 'Country_Sweden', 'Country_Switzerland',
       'Country_Togo', 'Country_Trinidad and Tobago', 'Country_Tunisia',
       'Country_Türkiye', 'Country_Ukraine', 'Country_United Arab Emirates',
       'Country_United States', 'Country_Uruguay', 'Country_Wales',
       'Country_Yugoslavia', 'Country_Zaire']
# Add more feature names if needed to match the total number of features

# Export the decision tree visualization
tree.export_graphviz(model, out_file='Round-Classiifer-Smoothed.dot',
                    feature_names=feature_names,
                    class_names=sorted(["Group stage",
                                        "First group stage",
                                        "Second group stage",
                                        "Group stage play-off",
                                        "First round",
                                        "Second round",
                                        "Round of 16",
                                        "Quarter-finals",
                                        "Semi-finals",
                                        "Third-place match",
                                        "Final Stage",
                                        "Final"]),
                    label='all',
                    rounded=True,
                    filled=True)


#http://viz-js.com/

# filled: fills each box
# rounded: rounds each box
# label='all': every node has labels that we can read
# class_names=sorted(y.unique()): displaying class for each node
# feature_names=['age', 'gender']: see the rules for each node