# Kto wygra finał mistrzostw świata w piłce nożnej 2018?

### Cel projektu:

* Głównym celem projektu jest dziś wyjątkowo klarowny i prosty: predykcja zwycięzcy finału MŚ.

### Założenia:

* Algorytm: nie stawiam żadnych ograniczeń co do wyboru algorytmu. Liczy się skuteczność i przetestuję kilka najbardziej wysublimowanych rozwiązań.
* Zmienne objaśniające: skupię się na zmiennych pochodzących z dwóch źródeł (historyczne wyniki meczy i aktualizacje rankingu FIFA).
* Jakość modelu: będzie mierzona za pomocą *Accuracy*.

### Zbiory danych:
* [Ranking FIFA.](https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner/data)
* [Wyniki spotkań.](https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner/data)

## 1. Wczytanie niezbędnych bibliotek.

In [1]:
import pandas as pd
import numpy as np
import xgboost
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
from scipy.stats import randint
import warnings

In [2]:
pd.set_option('display.max_columns', 50)

In [3]:
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)

## 2. Wczytanie danych.

### 2.1. Wczytanie zbioru wyników spotkań.

In [4]:
matches = pd.read_csv('data//long/matches.csv')

Podgląd zbioru.

In [5]:
matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


Sprawdzenie typów zmiennych.

In [6]:
matches.dtypes

date          object
home_team     object
away_team     object
home_score     int64
away_score     int64
tournament    object
city          object
country       object
neutral         bool
dtype: object

Zmiana typu zmiennej 'date'.

In [7]:
matches['date'] = pd.to_datetime(matches['date'], format = '%Y-%m-%d') 

In [8]:
matches.dtypes

date          datetime64[ns]
home_team             object
away_team             object
home_score             int64
away_score             int64
tournament            object
city                  object
country               object
neutral                 bool
dtype: object

In [9]:
matches['neutral'].replace([True, False], [1, 0], inplace = True)

Zmiana nazw zmiennych.

In [10]:
matches.columns = ['date', 'team_1', 'team_2', 'score_1', 'score_2', 'tournament', 'city', 'country', 'neutral']

### 2.2. Wczytanie rankingu FIFA.

In [11]:
fifa_ranking = pd.read_csv('data//long/fifa_ranking.csv')

Podgląd zbioru.

In [12]:
fifa_ranking.sort_values('rank_date').head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
107,108,Lebanon,LIB,0.0,0,53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AFC,1993-08-08
108,109,South Africa,RSA,0.0,4,15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,1993-08-08
109,110,Luxembourg,LUX,0.0,9,-7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
110,111,Faroe Islands,FRO,0.0,11,-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08


Sprawdzenie typów zmiennych.

In [13]:
fifa_ranking.dtypes

rank                         int64
country_full                object
country_abrv                object
total_points               float64
previous_points              int64
rank_change                  int64
cur_year_avg               float64
cur_year_avg_weighted      float64
last_year_avg              float64
last_year_avg_weighted     float64
two_year_ago_avg           float64
two_year_ago_weighted      float64
three_year_ago_avg         float64
three_year_ago_weighted    float64
confederation               object
rank_date                   object
dtype: object

Zmiana typu zmiennej 'date'.

In [14]:
fifa_ranking['rank_date'] = pd.to_datetime(fifa_ranking['rank_date'], format = '%Y-%m-%d')

In [15]:
fifa_ranking.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
1,2,Italy,ITA,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
2,3,Switzerland,SUI,0.0,50,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
3,4,Sweden,SWE,0.0,55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
4,5,Argentina,ARG,0.0,51,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08


Zmiana nazw zmiennych.

In [16]:
fifa_ranking.columns = ['rank', 'country_full', 'country_abrv', 'total_points',
       'previous_points', 'rank_change', 'cur_year_avg',
       'cur_year_avg_weighted', 'last_year_avg', 'last_year_avg_weighted',
       'two_year_ago_avg', 'two_year_ago_weighted', 'three_year_ago_avg',
       'three_year_ago_weighted', 'confederation', 'date']

## 3. Przygotowanie zbiorów wejściowych.

#### Zbudowanie zmiennej celu

In [17]:
matches = matches.assign(y = 0)
matches.loc[matches.score_1 > matches.score_2, 'y'] = 1
matches.loc[matches.score_1 < matches.score_2, 'y'] = 2

#### Usunięcie meczy zakończonych remisem

Skupiam się na wyznaczeniu zwycięzcy, ponieważ w finale nie ma mowy o końcowym remisie.

In [18]:
print(matches.shape)
matches = matches[matches.y != 0]
print(matches.shape)

(39668, 10)
(30517, 10)


#### Usunięcie meczy na nieneutralnym terenie

In [19]:
matches = matches[matches.neutral == 1]
matches.shape

(7590, 10)

In [20]:
matches.drop('neutral', axis = 1, inplace = True)

#### Usunięcie starych spotkań.

In [21]:
fifa_ranking.date.describe()

count                   57793
unique                    286
top       2016-12-22 00:00:00
freq                      211
first     1993-08-08 00:00:00
last      2018-06-07 00:00:00
Name: date, dtype: object

Pierwszy ranking został opublikowany 8 sierpnia 1993. Usuwam więc wszystkie mecze, które odbyły się przed tym dniem.

In [22]:
matches.shape

(7590, 9)

In [23]:
matches = matches[matches.date >= '1993-08-08']
matches.shape

(4550, 9)

Spotkania, które odbyły się 8 sierpnia 1993 roku zostawiam w zbiorze. Ranking, który został opublikowany w tym samym dniu w którym odbywa się spotkanie, z pewnością nie bierze pod uwagę wyniku tego spotkania :)

#### Dodanie identyfikatora (indeksu) pozycji w rankingu FIFA.

Łączenie obu zbiorów bezpośrednio po nazwie państwa i dacie (rok i miesiąc) rozgrywania meczu/opublikowania rankingu nie jest najlepszym pomysłem, ponieważ:
* Ranking FIFA nie jest publikowany co miesiąc. Po uwzględnieniu daty powstanie dużo pustych wartości.
* Dojdzie do wycieku danych. Ranking opublikowany kilka dni po rozegraniu meczu (zwłaszcza przy meczach o wysoką stawkę) będzie sugerować wynik spotkania.

Muszę zatem napisać funkcję, która do zbioru z meczami przypisze identyfikator rankingu FIFA. Nie chcę przypisywać bezpośrednio pozycji w rankingu, bo zbiór `fifa_ranking` zawiera znacznie więcej informacji.

In [24]:
def find_current_ranking_position(team_1, team_2, date):
    _fifa_ranking = fifa_ranking[(fifa_ranking.date <= date)]
    
    _fifa_ranking_team_1 = _fifa_ranking[(_fifa_ranking.country_full == team_1)]
    _fifa_ranking_team_2 = _fifa_ranking[(_fifa_ranking.country_full == team_2)]
    
    if(_fifa_ranking_team_1.empty):
        _team_1_id = None
    else:
        _team_1_id = _fifa_ranking_team_1.sort_values('date', ascending = False).iloc[0:1].index[0]
    
    
    if(_fifa_ranking_team_2.empty):
        _team_2_id = None
    else:
        _team_2_id = _fifa_ranking_team_2.sort_values('date', ascending = False).iloc[0:1].index[0]
        
    return pd.Series([_team_1_id, _team_2_id])

W rankingu FIFA Irlandia widnieje pod nazwą "Republic of Ireland". Zmieniam więc nazwę, by zachować spójność w obu zbiorach.

In [25]:
matches.team_1.replace(['Ireland'], ['Republic of Ireland'], inplace = True)

Uruchamiam funkcję, która przypisze dla każdego zespołu id z rankingu FIFA.

In [26]:
matches[['t1_id', 't2_id']] = matches.apply(lambda row: find_current_ranking_position(row['team_1'], row['team_2'], row['date']), axis = 1)

In [27]:
matches.head()

Unnamed: 0,date,team_1,team_2,score_1,score_2,tournament,city,country,y,t1_id,t2_id
17890,1993-09-22,Mexico,Cameroon,1,0,Friendly,Los Angeles,USA,1,13.0,23.0
17893,1993-09-22,San Marino,Netherlands,0,7,FIFA World Cup qualification,Bologna,Italy,2,118.0,15.0
17907,1993-10-06,Mexico,South Africa,4,0,Friendly,Los Angeles,USA,1,182.0,261.0
17927,1993-10-15,Korea DPR,Iraq,3,2,FIFA World Cup qualification,Doha,Qatar,1,230.0,225.0
17929,1993-10-16,Iran,Korea Republic,0,3,FIFA World Cup qualification,Doha,Qatar,2,,202.0


#### Połączenie obu zbiorów.

In [28]:
matches = matches.merge(fifa_ranking, how = 'inner', left_on = 't1_id', right_index = True, suffixes = ['', '_1'])
matches = matches.merge(fifa_ranking, how = 'inner', left_on = 't2_id', right_index = True, suffixes = ['', '_2'])

In [29]:
matches.columns = ['date', 'team_1', 'team_2', 'score_1', 'score_2', 'tournament', 'city',
       'country', 'y', 't1_id', 't2_id', 'rank_1', 'country_full_1',
       'country_abrv_1', 'total_points_1', 'previous_points_1', 'rank_change_1',
       'cur_year_avg_1', 'cur_year_avg_weighted_1', 'last_year_avg_1',
       'last_year_avg_weighted_1', 'two_year_ago_avg_1', 'two_year_ago_weighted_1',
       'three_year_ago_avg_1', 'three_year_ago_weighted_1', 'confederation_1',
       'date_1', 'rank_2', 'country_full_2', 'country_abrv_2',
       'total_points_2', 'previous_points_2', 'rank_change_2',
       'cur_year_avg_2', 'cur_year_avg_weighted_2', 'last_year_avg_2',
       'last_year_avg_weighted_2', 'two_year_ago_avg_2',
       'two_year_ago_weighted_2', 'three_year_ago_avg_2',
       'three_year_ago_weighted_2', 'confederation_2', 'date_2']

#### Usunięcie zbędnych zmiennych.

In [30]:
matches.drop(['team_1', 'team_2', 'score_1', 'score_2', 'city', 'country', 't1_id', 't2_id', 'country_full_1', 'country_abrv_1', 'date_1', 'country_full_2', 'country_abrv_2', 'date_2'], axis = 1, inplace = True)

#### Sprawdzenie rozmiaru zbioru.

In [31]:
matches.shape

(3352, 29)

#### Dodanie nowych zmiennych.

Dodaję nowe zmienne. `time_delta` określa jak dawno odbył się mecz. Uzyję jej jako wag do modelu.

In [32]:
matches['time_delta'] = (matches['date'] - pd.to_datetime('2018-07-15', format = '%Y-%m-%d'))/np.timedelta64(1, 'M')
matches['year'] = matches.date.dt.year
matches['month'] = matches.date.dt.month
matches['day_of_week'] = matches.date.dt.dayofweek
matches['day_of_year'] = matches.date.dt.dayofyear
matches['is_year_end'] = matches.date.dt.is_year_end
matches['is_year_start'] = matches.date.dt.is_year_start
matches['quarter'] = matches.date.dt.quarter
matches['is_year_end'].replace([False, True], [0, 1], inplace = True)
matches['is_year_start'].replace([False, True], [0, 1], inplace = True)

In [33]:
matches.drop(['date'], axis = 1, inplace = True)

#### Zmiana kodowania zmiennych.

In [34]:
matches.head()

Unnamed: 0,tournament,y,rank_1,total_points_1,previous_points_1,rank_change_1,cur_year_avg_1,cur_year_avg_weighted_1,last_year_avg_1,last_year_avg_weighted_1,two_year_ago_avg_1,two_year_ago_weighted_1,three_year_ago_avg_1,three_year_ago_weighted_1,confederation_1,rank_2,total_points_2,previous_points_2,rank_change_2,cur_year_avg_2,cur_year_avg_weighted_2,last_year_avg_2,last_year_avg_weighted_2,two_year_ago_avg_2,two_year_ago_weighted_2,three_year_ago_avg_2,three_year_ago_weighted_2,confederation_2,time_delta,year,month,day_of_week,day_of_year,is_year_end,is_year_start,quarter
17890,Friendly,1,14,0.0,42,11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONCACAF,24,0.0,43,-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,-297.73096,1993,9,2,265,0,0,3
17893,FIFA World Cup qualification,2,119,0.0,4,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,16,0.0,53,-9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,-297.73096,1993,9,2,265,0,0,3
17907,Friendly,1,16,0.0,50,-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONCACAF,95,0.0,11,14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,-297.270991,1993,10,2,279,0,0,4
17935,Friendly,1,16,0.0,50,-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONCACAF,132,0.0,4,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,-296.811023,1993,10,2,293,0,0,4
17927,FIFA World Cup qualification,1,64,0.0,30,-3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AFC,59,0.0,30,-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AFC,-296.975297,1993,10,4,288,0,0,4


In [35]:
label_encoder = LabelEncoder()
label_encoder.fit(matches.tournament)
matches.tournament = label_encoder.transform(matches.tournament)

label_encoder.fit(matches.confederation_1)
matches.confederation_1 = label_encoder.transform(matches.confederation_1)

matches.confederation_2 = label_encoder.transform(matches.confederation_2)

## 4. Przygotowanie danych do modelowania.

### 4.1. Usunięcie braków danych.

In [36]:
matches.isnull().any().any()

False

### 4.2. Usunięcie odstających wartości.

Użyje XGB i GradientBoostingClassifier, które nie są wrażliwe na obserwacje odstające.

### 4.3. Podział zbioru.

In [37]:
print(matches.y.value_counts(normalize = True))

1    0.555191
2    0.444809
Name: y, dtype: float64


In [38]:
# oddzielam zmienną celu od zmiennych objaśniających
y = matches.y.copy()
x = matches.drop('y', axis = 1)

# zamieniam wartości jakie przyjmuje zmienna celu
y.replace([1,2], [0, 1], inplace = True)

# dzielę zbiór na dwie części
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 20180715)

# oddzielam wagi, których użyję do uczenia modelu
weights_tr = x_tr.time_delta
weights_te = x_te.time_delta

In [39]:
print(y_tr.value_counts(normalize = True))
print(y_te.value_counts(normalize = True))

0    0.555017
1    0.444983
Name: y, dtype: float64
0    0.555887
1    0.444113
Name: y, dtype: float64


In [40]:
weights_tr = (weights_tr.min()*-1) - weights_tr
weights_te = (weights_te.min()*-1) - weights_te

## 5. Modelowanie.

### 5.1. Model 1.

Opis modelu:
* GradientBoostingClassifier.
* Brak doboru parametrów.
* Brak doboru zmiennych.

In [42]:
gbc_1 = GradientBoostingClassifier()

In [47]:
cv = cross_val_score(gbc_1, x_tr, y_tr, cv = 10, scoring = 'accuracy', n_jobs=-1)
print('Średnie Accuracy: ' + str(cv.mean().round(3)))
print('Stabilność: ' + str((cv.std()*100/cv.mean()).round(3)) + '%')

Średnie Accuracy: 0.697
Stabilność: 4.875%


In [48]:
gbc_1.fit(x_tr, y_tr)
print("Accuracy modelu na zbiorze testowym bez uwzględnienia wag: {}.".format(gbc_1.score(x_te, y_te).round(4)))

Accuracy modelu na zbiorze testowym bez uwzględnienia wag: 0.7139.


In [49]:
gbc_1.fit(x_tr, y_tr, sample_weight=weights_tr)
print("Accuracy modelu na zbiorze testowym z uwzględnieniem wag: {}.".format(gbc_1.score(x_te, y_te, sample_weight=weights_te).round(4)))

Accuracy modelu na zbiorze testowym z uwzględnieniem wag: 0.716.


In [55]:
pd.DataFrame(gbc_1.feature_importances_, index = x_tr.columns).sort_values(0)

Unnamed: 0,0
quarter,0.0
is_year_end,0.0
is_year_start,0.0
cur_year_avg_weighted_2,0.001212
month,0.002876
cur_year_avg_1,0.002897
year,0.003589
total_points_1,0.00463
last_year_avg_1,0.004661
cur_year_avg_weighted_1,0.007113


### 5.2. Model 2.

Opis modelu:
* XGBoost.

In [50]:
xgb_1 = xgboost.XGBClassifier()

In [51]:
cv = cross_val_score(xgb_1, x_tr, y_tr, cv = 10, scoring = 'accuracy', n_jobs=-1)
print('Średnie Accuracy: ' + str(cv.mean().round(3)))
print('Stabilność: ' + str((cv.std()*100/cv.mean()).round(3)) + '%')

Średnie Accuracy: 0.693
Stabilność: 4.941%


In [52]:
xgb_1.fit(x_tr, y_tr)
print("Accuracy modelu na zbiorze testowym bez uwzględnienia wag: {}.".format(xgb_1.score(x_te, y_te).round(4)))

Accuracy modelu na zbiorze testowym bez uwzględnienia wag: 0.7139.


In [53]:
xgb_1.fit(x_tr, y_tr, sample_weight=weights_tr)
print("Accuracy modelu na zbiorze testowym z uwzględnieniem wag: {}.".format(xgb_1.score(x_te, y_te, sample_weight=weights_te).round(4)))

Accuracy modelu na zbiorze testowym z uwzględnieniem wag: 0.7087.


In [54]:
pd.DataFrame(xgb_1.feature_importances_, index = x_tr.columns).sort_values(0)

Unnamed: 0,0
quarter,0.0
is_year_end,0.0
two_year_ago_weighted_2,0.0
last_year_avg_weighted_2,0.0
cur_year_avg_weighted_2,0.0
cur_year_avg_weighted_1,0.0
is_year_start,0.0
last_year_avg_weighted_1,0.0
year,0.0
two_year_ago_weighted_1,0.0


### 5.3. Model 3.

##### Poszukiwania optymalnego zestawu parametrów - gbc_1

In [84]:
params_rs = {'loss' : ('deviance', 'exponential'),
             'n_estimators' : randint(50,200),
             'max_depth' : randint(5, 50),
             'criterion' : ('friedman_mse', 'mse', 'mae'),
             'min_samples_split' : randint(2, 50),
             'min_samples_leaf' : randint(1, 50),
             'max_features' : ('auto', 'sqrt', 'log2'),
             'max_leaf_nodes' :  randint(1, 50)}

In [None]:
rs = RandomizedSearchCV(gbc_1, param_distributions=params_rs, n_iter = 20, n_jobs = -1, cv=5)
rs.fit(x_tr, y_tr)

In [71]:
rs.best_params_

{'criterion': 'friedman_mse',
 'loss': 'deviance',
 'max_depth': 27,
 'max_features': 'log2',
 'max_leaf_nodes': 11,
 'min_samples_leaf': 36,
 'min_samples_split': 7,
 'n_estimators': 72}

In [56]:
gbc_params = {'criterion': 'friedman_mse',
 'loss': 'deviance',
 'max_depth': 27,
 'max_features': 'log2',
 'max_leaf_nodes': 11,
 'min_samples_leaf': 36,
 'min_samples_split': 7,
 'n_estimators': 72}

In [57]:
gbc_2 = GradientBoostingClassifier(**gbc_params)

In [58]:
cv = cross_val_score(gbc_2, x_tr, y_tr, cv = 10, scoring = 'accuracy')
print('Średnie Accuracy: ' + str(cv.mean().round(3)))
print('Stabilność: ' + str((cv.std()*100/cv.mean()).round(3)) + '%')

Średnie Accuracy: 0.702
Stabilność: 5.303%


In [59]:
gbc_2.fit(x_tr, y_tr)
print("Accuracy modelu na zbiorze testowym bez uwzględnienia wag: {}.".format(gbc_2.score(x_te, y_te).round(4)))

Accuracy modelu na zbiorze testowym bez uwzględnienia wag: 0.7049.


In [60]:
gbc_2.fit(x_tr, y_tr, sample_weight=weights_tr)
print("Accuracy modelu na zbiorze testowym z uwzględnieniem wag: {}.".format(gbc_2.score(x_te, y_te, sample_weight=weights_te).round(4)))

Accuracy modelu na zbiorze testowym z uwzględnieniem wag: 0.7121.


### 5.4. Model 4.

##### Poszukiwania optymalnego zestawu parametrów - xgb

In [61]:
params_rs = {'max_depth':randint(15, 50),
             'n_estimators':randint(100,300),
             'booster' : ('gbtree', 'gblinear', 'dart'),
             'min_child_weight' : randint(1, 30),
             'max_delta_step' : randint(1, 20)}

In [80]:
rs = RandomizedSearchCV(xgb_1, param_distributions=params_rs, n_iter = 20, n_jobs = -1, cv=5)
rs.fit(x_tr, y_tr)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
          fit_params=None, iid=True, n_iter=20, n_jobs=-1,
          param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fb3942c9748>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fb3942c96a0>, 'booster': ('gbtree', 'gblinear', 'dart'), 'min_child_weight': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fb3942c9da0>, 'max_delta_step': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fb3942c9be0>},
          pre_dispatch='2*n_jobs', random_state=None, re

In [62]:
xgb_params = {'booster': 'gblinear',
 'max_delta_step': 9,
 'max_depth': 46,
 'min_child_weight': 27,
 'n_estimators': 106}

In [63]:
xgb_2 = xgboost.XGBClassifier(**xgb_params)

In [64]:
cv = cross_val_score(xgb_2, x_tr, y_tr, cv = 10, scoring = 'accuracy')
print('Średnie Accuracy: ' + str(cv.mean().round(3)))
print('Stabilność: ' + str((cv.std()*100/cv.mean()).round(3)) + '%')

Średnie Accuracy: 0.707
Stabilność: 4.92%


In [65]:
xgb_2.fit(x_tr, y_tr)
print("Accuracy modelu na zbiorze testowym bez uwzględnienia wag: {}.".format(xgb_2.score(x_te, y_te).round(4)))

Accuracy modelu na zbiorze testowym bez uwzględnienia wag: 0.7049.


In [66]:
xgb_2.fit(x_tr, y_tr, sample_weight=weights_tr)
print("Accuracy modelu na zbiorze testowym z uwzględnieniem wag: {}.".format(xgb_2.score(x_te, y_te, sample_weight=weights_te).round(4)))

Accuracy modelu na zbiorze testowym z uwzględnieniem wag: 0.7084.


## 6. Predykcja zwycięzcy MŚ.

##### Przygotowanie danych.

In [67]:
final_match = [21, 20, 945.18, 975,-2, 397.75, 397.75, 672.78, 336.39, 335.96, 100.79, 551.26, 110.25, 5, 7, 1198.13, 1166, 0, 520.12, 520.12, 856.75, 428.38, 393.65, 118.09, 657.68, 131.54, 5, 0, 2018, 7, 6, 196, 0, 0, 3]

In [68]:
final_match = pd.DataFrame(final_match, index = x_tr.columns)

In [69]:
final_match.transpose()

Unnamed: 0,tournament,rank_1,total_points_1,previous_points_1,rank_change_1,cur_year_avg_1,cur_year_avg_weighted_1,last_year_avg_1,last_year_avg_weighted_1,two_year_ago_avg_1,two_year_ago_weighted_1,three_year_ago_avg_1,three_year_ago_weighted_1,confederation_1,rank_2,total_points_2,previous_points_2,rank_change_2,cur_year_avg_2,cur_year_avg_weighted_2,last_year_avg_2,last_year_avg_weighted_2,two_year_ago_avg_2,two_year_ago_weighted_2,three_year_ago_avg_2,three_year_ago_weighted_2,confederation_2,time_delta,year,month,day_of_week,day_of_year,is_year_end,is_year_start,quarter
0,21.0,20.0,945.18,975.0,-2.0,397.75,397.75,672.78,336.39,335.96,100.79,551.26,110.25,5.0,7.0,1198.13,1166.0,0.0,520.12,520.12,856.75,428.38,393.65,118.09,657.68,131.54,5.0,0.0,2018.0,7.0,6.0,196.0,0.0,0.0,3.0


In [70]:
gbc_2.predict_proba(final_match.transpose())

array([[0.21760278, 0.78239722]])

In [72]:
xgb_2.predict_proba(final_match.transpose())

array([[0.23463535, 0.76536465]], dtype=float32)

Francja (78%) pokona Chorwację (22%) w finale mistrzostw świata w piłce nożnej :)