# Projet World Cup Winner

Du moment qu'il y a des données, les Data Sciences peuvent être utilisées. Si certains d'entre vous ont l'âme d'un parieur et souhaitent éclairer leurs décisions par des statistiques, nous avons créé un algorithme qui permet de prédire l'équipe gagnante d'un match de football. 


Nous y avons ajouté les statistiques moyennes de chaque joueur et de chaque équipe dans l'équation. 

Le but de ce challenge va être donc de prédire l'issue match de football en utilisant les méthodes ensemblistes que nous avons vues ensemble. Vous pouvez tenter deux choses : 

1. Voting Classifier 

2. Stacking 

Regardez quelles méthodes fonctionnent le mieux. 


N'hésitez pas à vous aider de ce tutoriel : 

[World Cup Winner](https://github.com/JedhaBootcamp/world-cup-winner)

0. Importez les librairies usuelles

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

1. Importez les datasets 

In [2]:
matches = pd.read_csv("./Datasets/results.csv")
rankings = pd.read_csv("./Datasets/fifa_ranking.csv")
world_cup_matches = pd.read_csv("./Datasets/World Cup 2018 Dataset.csv")
players = pd.read_csv("./Datasets/FullData.csv")
all_time_stats = pd.read_csv("./Datasets/all_time_fifa_statistics.csv")

Nous n'avons pas besoin de toutes les données dans chaque fichier. Certains noms de pays diffèrent en fonction des années (l'Allemagne comptait comme deux pays avant la chute du mur de Berlin en 1989). Nous allons donc commencer une première phase de nettoyage

In [3]:
rankings = rankings.loc[:,['rank',
                           'country_full',
                           'country_abrv',
                           'cur_year_avg_weighted',
                           'rank_date',
                           'two_year_ago_weighted',
                           'three_year_ago_weighted']]
rankings = rankings.replace({"IR Iran": "Iran"})
rankings['weighted_points'] =  rankings['cur_year_avg_weighted'] + rankings['two_year_ago_weighted'] + rankings['three_year_ago_weighted']
rankings["rank_date"] = pd.to_datetime(rankings["rank_date"])


In [11]:
rankings.describe()

Unnamed: 0,rank,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
count,57793.0,57793.0,57793.0,57793.0,57793.0
mean,101.628086,61.798602,17.933277,11.834811,91.566691
std,58.618424,138.014883,40.888849,27.106675,197.891852
min,1.0,0.0,0.0,0.0,0.0
25%,51.0,0.0,0.0,0.0,0.0
50%,101.0,0.0,0.0,0.0,0.0
75%,152.0,32.25,6.45,4.25,64.81
max,209.0,1158.66,347.91,240.15,1511.5


In [12]:
matches = matches.replace({"Germany DR": "Germany", "China": "China PR"})
matches["date"] = pd.to_datetime(matches["date"])

world_cup_matches = world_cup_matches.loc[:, ['Team',
                                              'Group',
                                              'First match \nagainst',
                                              'Second match\n against',
                                              'Third match\n against']]
world_cup_matches = world_cup_matches.dropna(how='all')
world_cup_matches = world_cup_matches.replace({"IRAN": "Iran",
                               "Costarica": "Costa Rica",
                               "Porugal": "Portugal",
                               "Columbia": "Colombia",
                               "Korea" : "Korea Republic"})
world_cup_matches = world_cup_matches.set_index('Team')
world_cup_matches.head()

Unnamed: 0_level_0,Group,First match against,Second match  against,Third match  against
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Russia,A,Saudi Arabia,Egypt,Uruguay
Saudi Arabia,A,Russia,Uruguay,Egypt
Egypt,A,Uruguay,Russia,Saudi Arabia
Uruguay,A,Egypt,Saudi Arabia,Russia
Portugal,B,Spain,Morocco,Iran


In [13]:
matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


Vu la quantité de données que l'on possède ainsi que le peu de données manquantes, nous décidons de simplement effacer les lignes où il y a des données manquantes. Finissons d'importer les statistiques des joueurs.

In [15]:
players = players.loc[:, ["Nationality",
                            "Rating",
                            "Age",
                            "Weak_foot",
                            "Skill_Moves",
                            "Ball_Control",
                            "Dribbling",
                            "Marking",
                            "Sliding_Tackle",
                            "Standing_Tackle",
                            "Aggression",
                            "Reactions",
                            "Attacking_Position",
                            "Interceptions",
                            "Vision",
                            "Composure",
                            "Crossing",
                             "Short_Pass",
                             "Long_Pass",
                             "Acceleration",
                             "Speed",
                             "Stamina",
                             "Strength",
                             "Balance",
                             "Agility",
                             "Jumping",
                             "Heading",
                             "Shot_Power",
                             "Finishing",
                             "Long_Shots",
                             "Curve",
                             "Freekick_Accuracy",
                             "Penalties",
                             "Volleys"]]
players.describe()

Unnamed: 0,Rating,Age,Weak_foot,Skill_Moves,Ball_Control,Dribbling,Marking,Sliding_Tackle,Standing_Tackle,Aggression,...,Agility,Jumping,Heading,Shot_Power,Finishing,Long_Shots,Curve,Freekick_Accuracy,Penalties,Volleys
count,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,...,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0
mean,66.166193,25.460314,2.934103,2.303161,57.972766,54.802877,44.230327,45.565499,47.441096,55.920173,...,63.206732,64.918524,52.393109,55.581192,45.157607,47.403173,47.181146,43.383443,49.165738,43.275586
std,7.083012,4.680217,0.655927,0.746156,16.834779,18.913857,21.561703,21.515179,21.827815,17.445464,...,14.618163,11.430807,17.473703,17.600155,19.374428,19.211887,18.464396,17.701903,15.871735,17.710839
min,45.0,17.0,1.0,1.0,5.0,4.0,3.0,5.0,3.0,2.0,...,11.0,15.0,4.0,3.0,2.0,4.0,6.0,4.0,7.0,3.0
25%,62.0,22.0,3.0,2.0,53.0,47.0,22.0,23.0,26.0,44.0,...,55.0,58.0,45.0,45.0,29.0,32.0,34.0,31.0,39.0,30.0
50%,66.0,25.0,3.0,2.0,63.0,60.0,48.0,51.0,54.0,59.0,...,65.0,65.0,56.0,59.0,48.0,52.0,48.0,42.0,50.0,44.0
75%,71.0,29.0,3.0,3.0,69.0,68.0,64.0,64.0,66.0,70.0,...,74.0,73.0,65.0,69.0,61.0,63.0,62.0,57.0,61.0,57.0
max,94.0,47.0,5.0,5.0,95.0,97.0,92.0,95.0,92.0,96.0,...,96.0,95.0,94.0,93.0,95.0,91.0,92.0,93.0,96.0,93.0


In [16]:
players = players.dropna(how="all")
grouped = players.groupby(["Nationality"], as_index = False)
players = grouped.aggregate(np.mean)

La fin de la partie du code sert à calculer la moyenne des statistiques des joueurs dans chaque équipe pour que l'on puisse ensuite les intégrer dans le comparatif entre chaque pays.

Nos données sont maintenant importées mais nous devrons les fusionner pour que notre algorithme puisse apprendre des différentes statistiques. Il faudra le faire en plusieurs étapes.

D'abord, les rangs et les dates des matches ne correspondent pas exactement. En effet, nous avons les rangs au mois-le-mois alors que nous avons une date au jours près pour les matches. Il faudra donc créer un classement au jour-le-jour pour que l'on puisse fusionner nos colonnes.

Une fois que ceci est fait, nous faisons un premier merge (fusion).

In [20]:
rankings.head()

Unnamed: 0,rank,country_full,country_abrv,cur_year_avg_weighted,rank_date,two_year_ago_weighted,three_year_ago_weighted,weighted_points
0,1,Germany,GER,0.0,1993-08-08,0.0,0.0,0.0
1,2,Italy,ITA,0.0,1993-08-08,0.0,0.0,0.0
2,3,Switzerland,SUI,0.0,1993-08-08,0.0,0.0,0.0
3,4,Sweden,SWE,0.0,1993-08-08,0.0,0.0,0.0
4,5,Argentina,ARG,0.0,1993-08-08,0.0,0.0,0.0


In [21]:
rankings = rankings.set_index(['rank_date'])\
            .groupby(['country_full'], group_keys=False)\
            .resample('D').first()\
            .fillna(method='ffill')\
            .reset_index()


rankings.head()

Unnamed: 0,rank_date,rank,country_full,country_abrv,cur_year_avg_weighted,two_year_ago_weighted,three_year_ago_weighted,weighted_points
0,2003-01-15,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
1,2003-01-16,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
2,2003-01-17,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
3,2003-01-18,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0
4,2003-01-19,204.0,Afghanistan,AFG,0.0,0.0,0.0,0.0


In [35]:
rankings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1830658 entries, 0 to 1830657
Data columns (total 8 columns):
rank_date                  datetime64[ns]
rank                       float64
country_full               object
country_abrv               object
cur_year_avg_weighted      float64
two_year_ago_weighted      float64
three_year_ago_weighted    float64
weighted_points            float64
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 111.7+ MB


In [36]:
matches = matches.merge(rankings,
                        left_on=['date', 'home_team'],
                        right_on=['rank_date', 'country_full'])
matches.head()
matches = matches.merge(rankings,
                        left_on=['date', 'away_team'],
                        right_on=['rank_date', 'country_full'],
                        suffixes=('_home', '_away'))

In [39]:
matches = matches.merge(players,
                       left_on =["home_team"],
                       right_on = ["Nationality"])

matches = matches.merge(players,
                        left_on = ['away_team'],
                        right_on = ["Nationality"],
                        suffixes = ('_home', "_away"))

matches = matches.merge(all_time_stats,
                       left_on = ["home_team"],
                       right_on = ["Country"])

matches = matches.merge(all_time_stats,
                       left_on = ["away_team"],
                        right_on = ["Country"],
                       suffixes = ("_home", "_away"))

In [40]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6057 entries, 0 to 6056
Columns: 119 entries, date to Best_finish_away
dtypes: bool(1), datetime64[ns](3), float64(78), int64(22), object(15)
memory usage: 5.5+ MB


Comment allons nous evaluer les différentes équipes qui s'affrontent ? Un moyen simple est de **prendre la différence de chaque statistiques entre les équipes.** Par exemple, nous allons prendre la différence de position dans les classements FIFA, la différence d'âge entre les joueurs etc. Ce processus est un peu fastidieux car il faudra tout faire à la main mais le voici :

In [41]:
matches['rank_difference'] = matches['rank_home'] - matches['rank_away']
matches['average_rank'] = (matches['rank_home'] + matches['rank_away'])/2
matches['score_difference'] = matches['home_score'] - matches['away_score']
matches["point_difference"] = matches['weighted_points_home'] - matches['weighted_points_away']
matches["rating_difference"] = matches["Rating_home"] - matches["Rating_away"]
matches["Age_difference"] = matches["Age_home"] - matches["Age_away"]
matches["Weak_foot_difference"] = matches["Weak_foot_home"] - matches["Weak_foot_away"]
matches["Skill_Moves_difference"] = matches["Skill_Moves_home"] - matches["Skill_Moves_away"]
matches["Ball_Control_difference"] = matches["Ball_Control_home"] - matches["Ball_Control_away"]
matches["Dribbling_difference"] = matches["Dribbling_home"] - matches["Dribbling_away"]
matches["Marking_difference"] = matches["Marking_home"] - matches["Marking_away"]
matches["Sliding_Tackle_difference"] = matches["Sliding_Tackle_home"] - matches["Sliding_Tackle_away"]
matches["Standing_Tackle_difference"] = matches["Standing_Tackle_home"] - matches["Standing_Tackle_away"]
matches["Aggression_difference"] = matches["Aggression_home"] - matches["Aggression_away"]
matches["Reactions_difference"] = matches["Reactions_home"] - matches["Reactions_away"]
matches["Attacking_Position_difference"] = matches["Attacking_Position_home"] - matches["Attacking_Position_away"]
matches["Interceptions_difference"] = matches["Interceptions_home"] - matches["Interceptions_away"]
matches["Vision_difference"] = matches["Vision_home"] - matches["Vision_away"]
matches["Composure_difference"] = matches["Composure_home"] - matches["Composure_away"]
matches["Crossing_difference"] = matches["Crossing_home"] - matches["Crossing_away"]
matches["Short_Pass_difference"] = matches["Short_Pass_home"] - matches["Short_Pass_away"]
matches["Long_Pass_difference"] = matches["Long_Pass_home"] - matches["Long_Pass_away"]
matches["Stamina_difference"] = matches["Stamina_home"] - matches["Stamina_away"]
matches["Penalties_difference"] = matches["Penalties_home"] - matches["Penalties_away"]
matches["Acceleration_difference"] = matches["Acceleration_home"] - matches["Acceleration_away"]
matches["Speed_difference"] = matches["Speed_home"] - matches["Speed_away"]
matches["Strength_difference"] = matches["Strength_home"] - matches["Strength_away"]
matches["Balance_difference"] = matches["Balance_home"] - matches["Balance_away"]
matches["Agility_difference"] = matches["Agility_home"] - matches["Agility_away"]
matches["Jumping_difference"] = matches["Jumping_home"] - matches["Jumping_away"]
matches["Heading_difference"] = matches["Heading_home"] - matches["Heading_away"]
matches["Shot_Power_difference"] = matches["Shot_Power_home"] - matches["Shot_Power_away"]
matches["Finishing_difference"] = matches["Finishing_home"] - matches["Finishing_away"]
matches["Long_Shots_difference"] = matches["Long_Shots_home"] - matches["Long_Shots_away"]
matches["Curve_difference"] = matches["Curve_home"] - matches["Curve_away"]
matches["Freekick_Accuracy_difference"] = matches["Freekick_Accuracy_home"] - matches["Freekick_Accuracy_away"]
matches["Volleys_difference"] = matches["Volleys_home"] - matches["Volleys_away"]
matches["Part's_difference"] = matches["Part's_home"] - matches["Part's_away"]
matches["Played_difference"] = matches["Played_home"] - matches["Played_away"]
matches["Won_difference"] = matches["Won_home"] - matches["Won_away"]
matches["Drawn_difference"] = matches["Drawn_home"] - matches["Drawn_away"]
matches["Lost_difference"] = matches["Lost_home"] - matches["Lost_away"]
matches["Goal_Difference_difference"] = matches["Goal Difference_home"] - matches["Goal Difference_away"]
matches["Points_difference"] = matches["Points_home"] - matches["Points_away"]
matches["Average_points_difference"] = matches["Average_points_home"] - matches["Average_points_away"]
matches['is_won'] = matches['score_difference'] > 0 # take draw as lost
matches['is_stake'] = matches['tournament'] != 'Friendly'

La gestion de chacune de nos variables qui va s'en suivre va de même être quelque peu longue et il existe très certainement des moyens de gérer cela d'une meilleure façon mais, par contrainte de temps, nous avons préféré procéder ainsi.

## Construction du modèle

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

X = matches.loc[:,['average_rank',
                    'rank_difference',
                    "point_difference",
                    'is_stake',
                    "rating_difference",
                     "Age_difference",
                    "Weak_foot_difference",
                     "Skill_Moves_difference",
                    "Ball_Control_difference",
                     "Dribbling_difference",
                     "Marking_difference",
                     "Sliding_Tackle_difference",
                     "Standing_Tackle_difference",
                     "Aggression_difference",
                     "Reactions_difference",
                     "Interceptions_difference",
                     "Vision_difference",
                   "Crossing_difference",
                     "Short_Pass_difference",
                     "Long_Pass_difference",
                    "Stamina_difference",
                     "Penalties_difference",
                     "Acceleration_difference",                   
                     "Speed_difference",
                    "Strength_difference",
                    "Balance_difference",
                     "Agility_difference",
                     "Jumping_difference",
                    "Heading_difference",
                     "Shot_Power_difference",
                    "Finishing_difference",
                   "Long_Shots_difference",
                     "Curve_difference",
                    "Freekick_Accuracy_difference",
                     "Volleys_difference",
                     "Won_difference",
                     "Drawn_difference",
                     "Lost_difference",
                     "Average_points_difference",
                  ]]
y = matches['is_won']

In [43]:
y = pd.get_dummies(y, drop_first = True)
y.head()

Unnamed: 0,True
0,1
1,1
2,1
3,0
4,0


In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [45]:
from sklearn.ensemble import RandomForestClassifier

pre_classifier = RandomForestClassifier()
pre_classifier.fit(X_train, y_train)

  after removing the cwd from sys.path.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [46]:
pre_classifier.score(X_test, y_test)

0.6023102310231023

On ajoute les prédiction de notre Random Forest dans notre dataset pour y appliquer un XGBoost par la suite 

In [47]:
X_new = pd.concat([X, pd.DataFrame({"prediction_from_RF":pre_classifier.predict(X)})], axis=1)
X_new.head()

Unnamed: 0,average_rank,rank_difference,point_difference,is_stake,rating_difference,Age_difference,Weak_foot_difference,Skill_Moves_difference,Ball_Control_difference,Dribbling_difference,...,Finishing_difference,Long_Shots_difference,Curve_difference,Freekick_Accuracy_difference,Volleys_difference,Won_difference,Drawn_difference,Lost_difference,Average_points_difference,prediction_from_RF
0,40.5,37.0,0.0,True,-4.107843,1.605882,-0.234641,-0.264052,-7.986928,-6.426144,...,-6.547059,-8.129412,-10.150327,-7.005882,-9.759477,-22,-11,-14,-1.33,1
1,42.5,-17.0,0.0,True,-4.107843,1.605882,-0.234641,-0.264052,-7.986928,-6.426144,...,-6.547059,-8.129412,-10.150327,-7.005882,-9.759477,-22,-11,-14,-1.33,1
2,31.0,-26.0,0.0,True,-4.107843,1.605882,-0.234641,-0.264052,-7.986928,-6.426144,...,-6.547059,-8.129412,-10.150327,-7.005882,-9.759477,-22,-11,-14,-1.33,1
3,51.0,30.0,0.0,True,-4.107843,1.605882,-0.234641,-0.264052,-7.986928,-6.426144,...,-6.547059,-8.129412,-10.150327,-7.005882,-9.759477,-22,-11,-14,-1.33,0
4,53.0,26.0,0.0,True,-4.107843,1.605882,-0.234641,-0.264052,-7.986928,-6.426144,...,-6.547059,-8.129412,-10.150327,-7.005882,-9.759477,-22,-11,-14,-1.33,0


In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.2)

In [49]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [50]:
classifier.score(X_test, y_test)

0.8902640264026402

Les performances sont impressionnantes ! 