# Le Projet

L'idée ici est de simuler les matchs de la la CAN 2024 à l'aide de l'apprentissage automatique, dans le but de prédire le vainqueur de la compétition. Le projet utilise deux ensembles de données : [Résultats du football international de 1872 à 2023](https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017) et [Classement mondial FIFA 1992-2023](https://www.kaggle.com/datasets/cashncarry/fifaworldranking).


# Préparation des Données

Ici, je vais préparer les données pour appliquer des méthodes d'ingénierie des caractéristiques qui permettront de créer la base de données nécessaire pour appliquer des algorithmes d'apprentissage automatique.

In [1]:
import pandas as pd

In [2]:
df =  pd.read_csv('C:/Users/hp/Desktop/S5_enetcom/Mlops_project1/data/footballresults/results.csv')

In [3]:
df["date"] = pd.to_datetime(df["date"])

In [4]:
df.isna().sum()

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64

In [5]:
df.dropna(inplace=True)

In [6]:
df.dtypes

date          datetime64[ns]
home_team             object
away_team             object
home_score             int64
away_score             int64
tournament            object
city                  object
country               object
neutral                 bool
dtype: object

In [7]:
df.sort_values("date").tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
44907,2023-09-12,China PR,Syria,0,1,Friendly,Chengdu,China PR,False
44908,2023-09-12,Egypt,Tunisia,1,3,Friendly,Cairo,Egypt,False
44909,2023-09-12,Germany,France,2,1,Friendly,Dortmund,Germany,False
44901,2023-09-12,Ecuador,Uruguay,2,1,FIFA World Cup qualification,Quito,Ecuador,False
44933,2023-09-12,Romania,Kosovo,2,0,UEFA Euro qualification,Bucharest,Romania,False


In [8]:
df.sort_values("date").head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [14]:
df = df[(df["date"] >= "2014-8-1")].reset_index(drop=True)

In [15]:
df.sort_values("date").head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,2014-08-02,Guinea-Bissau,Botswana,1,1,African Cup of Nations qualification,Bissau,Guinea-Bissau,False
1,2014-08-02,Malawi,Benin,1,0,African Cup of Nations qualification,Blantyre,Malawi,False
2,2014-08-02,Rwanda,Congo,2,0,African Cup of Nations qualification,Kigali,Rwanda,False
3,2014-08-03,Kenya,Lesotho,0,0,African Cup of Nations qualification,Nairobi,Kenya,False
4,2014-08-03,Mauritania,Uganda,0,1,African Cup of Nations qualification,Nouakchott,Mauritania,False


In [16]:
df.sort_values("date").tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
8314,2023-09-12,China PR,Syria,0,1,Friendly,Chengdu,China PR,False
8315,2023-09-12,Egypt,Tunisia,1,3,Friendly,Cairo,Egypt,False
8316,2023-09-12,Germany,France,2,1,Friendly,Dortmund,Germany,False
8308,2023-09-12,Ecuador,Uruguay,2,1,FIFA World Cup qualification,Quito,Ecuador,False
8340,2023-09-12,Romania,Kosovo,2,0,UEFA Euro qualification,Bucharest,Romania,False


In [17]:
df.home_team.value_counts()

United States    98
Mexico           92
Qatar            84
Japan            84
Morocco          78
                 ..
Vatican City      1
Galicia           1
Aymara            1
Ticino            1
Hmong             1
Name: home_team, Length: 278, dtype: int64

# fifa_ranking

In [18]:
rank = pd.read_csv('C:/Users/hp/Desktop/S5_enetcom/Mlops_project1/data/fifaranking/fifa_ranking-2023-07-20.csv')
rank

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,1,Germany,GER,57.00,0.00,0,UEFA,1992-12-31
1,96,Syria,SYR,11.00,0.00,0,AFC,1992-12-31
2,97,Burkina Faso,BFA,11.00,0.00,0,CAF,1992-12-31
3,99,Latvia,LVA,10.00,0.00,0,UEFA,1992-12-31
4,100,Burundi,BDI,10.00,0.00,0,CAF,1992-12-31
...,...,...,...,...,...,...,...,...
64752,66,Cabo Verde,CPV,1354.65,1354.65,0,CAF,2023-07-20
64753,67,Iceland,ISL,1352.98,1352.98,0,UEFA,2023-07-20
64754,68,North Macedonia,MKD,1350.53,1350.53,0,UEFA,2023-07-20
64755,58,Jamaica,JAM,1409.73,1367.83,-5,CONCACAF,2023-07-20


In [19]:
rank["rank_date"] = pd.to_datetime(rank["rank_date"])
rank = rank[(rank["rank_date"] >= "2014-8-1")].reset_index(drop=True)
rank.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation,rank_date
0,131,Kazakhstan,KAZ,213.0,220.0,4,UEFA,2014-08-14
1,147,Syria,SYR,161.0,169.0,1,AFC,2014-08-14
2,129,Afghanistan,AFG,217.0,217.0,0,AFC,2014-08-14
3,129,Burundi,BDI,217.0,222.0,3,CAF,2014-08-14
4,128,Philippines,PHI,221.0,218.0,0,AFC,2014-08-14


Certaines équipes de la Coupe du monde ont des noms différents dans l'ensemble de données du classement. Donc, il faut s'adapter.

In [20]:
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States").str.replace("Côte d'Ivoire","Ivory Coast").str.replace("Congo DR","Congo").str.replace("The Gambia","Gambia").str.replace("Cabo Verde","Cape Verde").str.replace("Cape Verde Islands","Cape Verde")

In [21]:
rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()

In [22]:
rank.head()


Unnamed: 0,rank_date,rank,country_full,country_abrv,total_points,previous_points,rank_change,confederation
0,2014-08-14,129.0,Afghanistan,AFG,217.0,217.0,0.0,AFC
1,2014-08-15,129.0,Afghanistan,AFG,217.0,217.0,0.0,AFC
2,2014-08-16,129.0,Afghanistan,AFG,217.0,217.0,0.0,AFC
3,2014-08-17,129.0,Afghanistan,AFG,217.0,217.0,0.0,AFC
4,2014-08-18,129.0,Afghanistan,AFG,217.0,217.0,0.0,AFC


In [23]:
df_wc_ranked = df.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)


In [24]:
df_wc_ranked.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_points,previous_points,rank,rank_change
0,2014-08-14,Guatemala,Nicaragua,3,0,Friendly,Antigua Guatemala,Guatemala,False,203.0,204.0,134.0,0.0
1,2014-08-16,British Virgin Islands,Saint Martin,2,4,Friendly,Road Town,British Virgin Islands,False,13.0,13.0,201.0,1.0
2,2014-08-20,Azerbaijan,Uzbekistan,0,0,Friendly,Baku,Azerbaijan,False,413.0,410.0,73.0,0.0
3,2014-08-20,Panama,Cuba,4,0,Friendly,Panama City,Panama,False,474.0,684.0,63.0,30.0
4,2014-08-23,Guatemala,Cuba,1,0,Friendly,Guatemala,Guatemala,False,203.0,204.0,134.0,0.0


In [25]:
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

In [26]:
df_wc_ranked.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_points_home,previous_points_home,rank_home,rank_change_home,total_points_away,previous_points_away,rank_away,rank_change_away
0,2014-08-14,Guatemala,Nicaragua,3,0,Friendly,Antigua Guatemala,Guatemala,False,203.0,204.0,134.0,0.0,78.0,78.0,175.0,0.0
1,2014-08-20,Azerbaijan,Uzbekistan,0,0,Friendly,Baku,Azerbaijan,False,413.0,410.0,73.0,0.0,528.0,523.0,51.0,-1.0
2,2014-08-20,Panama,Cuba,4,0,Friendly,Panama City,Panama,False,474.0,684.0,63.0,30.0,233.0,245.0,124.0,4.0
3,2014-08-23,Guatemala,Cuba,1,0,Friendly,Guatemala,Guatemala,False,203.0,204.0,134.0,0.0,233.0,245.0,124.0,4.0
4,2014-08-26,Seychelles,Sri Lanka,1,2,Friendly,Victoria,Seychelles,False,68.0,64.0,180.0,-2.0,71.0,71.0,178.0,0.0


In [27]:
df_wc_ranked.to_csv('C:/Users/hp/Desktop/S5_enetcom/Mlops_project1/data/my_dataset_merge.csv', index=False)

In [28]:
#df_wc_ranked[(df_wc_ranked.home_team == "Morocco") | (df_wc_ranked.away_team == "Morocco")].tail(50)


In [29]:
df_wc_ranked.dtypes

date                    datetime64[ns]
home_team                       object
away_team                       object
home_score                       int64
away_score                       int64
tournament                      object
city                            object
country                         object
neutral                           bool
total_points_home              float64
previous_points_home           float64
rank_home                      float64
rank_change_home               float64
total_points_away              float64
previous_points_away           float64
rank_away                      float64
rank_change_away               float64
dtype: object

In [30]:
rank['country_full'].unique()
#df['home_team'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda',
       'Aotearoa New Zealand', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'British Virgin Islands',
       'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chile', 'China PR',
       'Chinese Taipei', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
       'Costa Rica', 'Croatia', 'Cuba', 'Curacao', 'Curaçao', 'Cyprus',
       'Czech Republic', 'Czechia', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'England',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswa

In [31]:
df1=df_wc_ranked[(df_wc_ranked.tournament == "African Cup of Nations")]

In [32]:
df2=df1[(df1["date"] >= "2015-1-1")&(df1["date"] < "2016-8-1")].reset_index(drop=True)

In [33]:
df2.tail(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_points_home,previous_points_home,rank_home,rank_change_home,total_points_away,previous_points_away,rank_away,rank_change_away
21,2015-01-31,Equatorial Guinea,Tunisia,2,1,African Cup of Nations,Bata,Equatorial Guinea,False,260.0,251.0,118.0,-2.0,873.0,867.0,22.0,0.0
22,2015-02-01,Ghana,Guinea,3,0,African Cup of Nations,Malabo,Equatorial Guinea,True,714.0,714.0,37.0,0.0,706.0,706.0,39.0,0.0
23,2015-02-01,Ivory Coast,Algeria,3,1,African Cup of Nations,Malabo,Equatorial Guinea,True,833.0,833.0,28.0,0.0,948.0,948.0,18.0,0.0
24,2015-02-05,Equatorial Guinea,Ghana,0,3,African Cup of Nations,Malabo,Equatorial Guinea,False,260.0,251.0,118.0,-2.0,714.0,714.0,37.0,0.0
25,2015-02-08,Ivory Coast,Ghana,0,0,African Cup of Nations,Bata,Equatorial Guinea,True,833.0,833.0,28.0,0.0,714.0,714.0,37.0,0.0
