# Google Cloud & NCAA® ML Competition 2020-NCAAM

#### _Created on Mon Apr. 20 13:22:15 2020_
#### _Author: Kang Li_
#### _Email: kohnlee1001@gmail.com_


## 1. Introduction

As the official public cloud provider of the NCAA, Google is proud to provide a competition to help participants strengthen their knowledge of basketball, statistics, data modeling, and cloud technology. As part of its journey to the cloud, the NCAA has migrated 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform (GCP). The NCAA has tapped into decades of historical basketball data using BigQuery, Cloud Spanner, Datalab, Cloud Machine Learning and Cloud Dataflow, to power the analysis of team and player performance. The mission of the NCAA has long been about serving the needs of schools, their teams and students. Google Cloud is proud to support that mission by helping the NCAA use data and machine learning to better engage with its millions of fans, 500,000 student-athletes, and more than 19,000 teams. Game on!

Each season there are thousands of NCAA basketball games played between Division I men's teams, culminating in March Madness®, the 68-team national championship that starts in the middle of March. We have provided a large amount of historical data about college basketball games and teams, going back many years. Armed with this historical data, you can explore it and develop your own distinctive ways of predicting March Madness® game outcomes. You can even evaluate and compare different approaches by seeing which of them would have done best at predicting tournament games from the past.

## 2. Data and Packages Import

In [1]:
import numpy as np
import pandas as pd 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from lightgbm import LGBMClassifier
from xgboost.sklearn import XGBClassifier

In [2]:
tourney_results=pd.read_csv('MNCAATourneyDetailedResults.csv')
tourney_compact=pd.read_csv('MNCAATourneyCompactResults.csv')
regular_results=pd.read_csv('MRegularSeasonDetailedResults.csv')
regular_compact=pd.read_csv('MRegularSeasonCompactResults.csv')
seeds_data=pd.read_csv('MNCAATourneySeeds.csv')

In [3]:
tourney_results.shape

(1115, 34)

## 3. Data Preprocessing

Since the data provided was in different files, we need to integrate all the information in to one dataframe for later use. Also, we need to clean and preprocess the data for machine learning purpose later.

In [4]:
# This UDF is for swaping the winning and losing team in each match to flip two teams
# so for each team we can get all the results it had before no matter this team won or lost.
def prepare_data(df):
    dfswap = df[['Season', 'DayNum', 'LTeamID', 'LScore', 'WTeamID', 'WScore', 'WLoc', 'NumOT', 
    'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF', 
    'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF']]

    dfswap.loc[df['WLoc'] == 'H', 'WLoc'] = 'A'
    dfswap.loc[df['WLoc'] == 'A', 'WLoc'] = 'H'
    df.columns.values[6] = 'location'
    dfswap.columns.values[6] = 'location'    
      
    df.columns = [x.replace('W','T1_').replace('L','T2_') for x in list(df.columns)]
    dfswap.columns = [x.replace('L','T1_').replace('W','T2_') for x in list(dfswap.columns)]

    output = pd.concat([df, dfswap]).reset_index(drop=True)
    output.loc[output.location=='N','location'] = '0'
    output.loc[output.location=='H','location'] = '1'
    output.loc[output.location=='A','location'] = '-1'
    output.location = output.location.astype(int)
    
    output['PointDiff'] = output['T1_Score'] - output['T2_Score']
    
    return output

In [5]:
regular_data=prepare_data(regular_results)
tourney_data=prepare_data(tourney_results)

In [6]:
tourney_data[tourney_data['PointDiff']>0]

Unnamed: 0,Season,DayNum,T1_TeamID,T1_Score,T2_TeamID,T2_Score,location,NumOT,T1_FGM,T1_FGA,...,T2_FTM,T2_FTA,T2_OR,T2_DR,T2_Ast,T2_TO,T2_Stl,T2_Blk,T2_PF,PointDiff
0,2003,134,1421,92,1411,84,0,1,32,69,...,14,31,17,28,16,15,5,0,22,8
1,2003,136,1112,80,1436,51,0,0,31,66,...,7,7,8,26,12,17,10,3,15,29
2,2003,136,1113,84,1272,71,0,0,31,59,...,14,21,20,22,11,12,2,5,18,13
3,2003,136,1141,79,1166,73,0,0,29,53,...,12,17,14,17,20,21,6,6,21,6
4,2003,136,1143,76,1301,74,0,1,27,64,...,15,20,10,26,16,14,5,8,19,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1110,2019,146,1120,77,1246,71,0,1,26,65,...,12,21,11,30,14,14,5,5,19,6
1111,2019,146,1277,68,1181,67,0,0,30,70,...,8,13,13,29,14,17,4,9,9,1
1112,2019,152,1403,61,1277,51,0,0,22,51,...,14,18,8,28,6,11,1,2,15,10
1113,2019,152,1438,63,1120,62,0,0,25,51,...,11,14,9,24,9,5,3,3,12,1


In [7]:
# Add a column of match results
tourney_data.loc[0:1115,'Result']=1
tourney_data.loc[1115:,'Result']=0

In [8]:
# Extract the statistics columns of each game
stat_cols=[col for col in regular_data.columns if col not in ['Season', 'DayNum', 'T1_TeamID', 'T1_Score', 'T2_TeamID', 'T2_Score',
       'location', 'NumOT']]

In [9]:
# Get the average statistics of each team
stat_seasons=regular_data.groupby(['Season','T1_TeamID'])[stat_cols].agg([np.mean]).reset_index()

In [10]:
stat_seasons

Unnamed: 0_level_0,Season,T1_TeamID,T1_FGM,T1_FGA,T1_FGM3,T1_FGA3,T1_FTM,T1_FTA,T1_OR,T1_DR,...,T2_FTM,T2_FTA,T2_OR,T2_DR,T2_Ast,T2_TO,T2_Stl,T2_Blk,T2_PF,PointDiff
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean,mean,mean,mean,mean,mean,mean,...,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean
0,2003,1102,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,16.821429,...,13.678571,19.250000,9.607143,20.142857,9.142857,12.964286,5.428571,1.571429,18.357143,0.250000
1,2003,1103,27.148148,55.851852,5.444444,16.074074,19.037037,25.851852,9.777778,19.925926,...,15.925926,22.148148,12.037037,22.037037,15.481481,15.333333,6.407407,2.851852,22.444444,0.629630
2,2003,1104,24.035714,57.178571,6.357143,19.857143,14.857143,20.928571,13.571429,23.928571,...,12.142857,17.142857,10.892857,22.642857,11.678571,13.857143,5.535714,3.178571,19.250000,4.285714
3,2003,1105,24.384615,61.615385,7.576923,20.769231,15.423077,21.846154,13.500000,23.115385,...,16.384615,24.500000,13.192308,26.384615,15.807692,18.807692,9.384615,4.192308,19.076923,-4.884615
4,2003,1106,23.428571,55.285714,6.107143,17.642857,10.642857,16.464286,12.285714,23.857143,...,15.535714,21.964286,11.321429,22.357143,11.785714,15.071429,8.785714,3.178571,16.142857,-0.142857
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5829,2019,1462,26.121212,55.848485,7.000000,21.030303,12.636364,18.424242,10.424242,25.757576,...,10.272727,14.636364,9.818182,22.121212,14.363636,11.060606,7.000000,3.030303,17.424242,1.272727
5830,2019,1463,29.821429,60.107143,7.785714,20.821429,13.464286,18.357143,8.678571,29.821429,...,14.214286,19.285714,9.607143,24.357143,12.678571,11.214286,7.107143,2.964286,17.142857,7.178571
5831,2019,1464,26.833333,63.633333,9.566667,28.000000,10.266667,14.733333,12.966667,24.000000,...,16.566667,22.533333,9.800000,25.766667,14.133333,11.500000,5.900000,3.333333,15.166667,-5.733333
5832,2019,1465,26.038462,59.038462,8.807692,25.230769,14.576923,18.961538,10.076923,26.269231,...,17.076923,24.038462,10.769231,24.961538,11.230769,10.807692,5.538462,2.615385,18.346154,0.269231


In [11]:
# Rename these columns to make them looks cleaner
stat_seasons.columns.values

array([('Season', ''), ('T1_TeamID', ''), ('T1_FGM', 'mean'),
       ('T1_FGA', 'mean'), ('T1_FGM3', 'mean'), ('T1_FGA3', 'mean'),
       ('T1_FTM', 'mean'), ('T1_FTA', 'mean'), ('T1_OR', 'mean'),
       ('T1_DR', 'mean'), ('T1_Ast', 'mean'), ('T1_TO', 'mean'),
       ('T1_Stl', 'mean'), ('T1_Blk', 'mean'), ('T1_PF', 'mean'),
       ('T2_FGM', 'mean'), ('T2_FGA', 'mean'), ('T2_FGM3', 'mean'),
       ('T2_FGA3', 'mean'), ('T2_FTM', 'mean'), ('T2_FTA', 'mean'),
       ('T2_OR', 'mean'), ('T2_DR', 'mean'), ('T2_Ast', 'mean'),
       ('T2_TO', 'mean'), ('T2_Stl', 'mean'), ('T2_Blk', 'mean'),
       ('T2_PF', 'mean'), ('PointDiff', 'mean')], dtype=object)

In [12]:
stat_seasons.columns=[''.join(col).strip() for col in stat_seasons.columns.values]

In [13]:
stat_seasons.head()

Unnamed: 0,Season,T1_TeamID,T1_FGMmean,T1_FGAmean,T1_FGM3mean,T1_FGA3mean,T1_FTMmean,T1_FTAmean,T1_ORmean,T1_DRmean,...,T2_FTMmean,T2_FTAmean,T2_ORmean,T2_DRmean,T2_Astmean,T2_TOmean,T2_Stlmean,T2_Blkmean,T2_PFmean,PointDiffmean
0,2003,1102,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,16.821429,...,13.678571,19.25,9.607143,20.142857,9.142857,12.964286,5.428571,1.571429,18.357143,0.25
1,2003,1103,27.148148,55.851852,5.444444,16.074074,19.037037,25.851852,9.777778,19.925926,...,15.925926,22.148148,12.037037,22.037037,15.481481,15.333333,6.407407,2.851852,22.444444,0.62963
2,2003,1104,24.035714,57.178571,6.357143,19.857143,14.857143,20.928571,13.571429,23.928571,...,12.142857,17.142857,10.892857,22.642857,11.678571,13.857143,5.535714,3.178571,19.25,4.285714
3,2003,1105,24.384615,61.615385,7.576923,20.769231,15.423077,21.846154,13.5,23.115385,...,16.384615,24.5,13.192308,26.384615,15.807692,18.807692,9.384615,4.192308,19.076923,-4.884615
4,2003,1106,23.428571,55.285714,6.107143,17.642857,10.642857,16.464286,12.285714,23.857143,...,15.535714,21.964286,11.321429,22.357143,11.785714,15.071429,8.785714,3.178571,16.142857,-0.142857


In [14]:
# Get two copies of statistics data for both team
stat_seasons_T1=stat_seasons.copy()
stat_seasons_T2=stat_seasons.copy()

In [15]:
# Rename the columns
stat_seasons_T1.columns=['T1'+ x.replace('T1','').replace('T2','oppponent') for x in stat_seasons_T1.columns]
stat_seasons_T2.columns=['T2'+ x.replace('T1','').replace('T2','oppponent') for x in stat_seasons_T2.columns]
stat_seasons_T1.columns.values[0]="Season"
stat_seasons_T2.columns.values[0]="Season"

In [16]:
stat_seasons_T1.head()

Unnamed: 0,Season,T1_TeamID,T1_FGMmean,T1_FGAmean,T1_FGM3mean,T1_FGA3mean,T1_FTMmean,T1_FTAmean,T1_ORmean,T1_DRmean,...,T1oppponent_FTMmean,T1oppponent_FTAmean,T1oppponent_ORmean,T1oppponent_DRmean,T1oppponent_Astmean,T1oppponent_TOmean,T1oppponent_Stlmean,T1oppponent_Blkmean,T1oppponent_PFmean,T1PointDiffmean
0,2003,1102,19.142857,39.785714,7.821429,20.821429,11.142857,17.107143,4.178571,16.821429,...,13.678571,19.25,9.607143,20.142857,9.142857,12.964286,5.428571,1.571429,18.357143,0.25
1,2003,1103,27.148148,55.851852,5.444444,16.074074,19.037037,25.851852,9.777778,19.925926,...,15.925926,22.148148,12.037037,22.037037,15.481481,15.333333,6.407407,2.851852,22.444444,0.62963
2,2003,1104,24.035714,57.178571,6.357143,19.857143,14.857143,20.928571,13.571429,23.928571,...,12.142857,17.142857,10.892857,22.642857,11.678571,13.857143,5.535714,3.178571,19.25,4.285714
3,2003,1105,24.384615,61.615385,7.576923,20.769231,15.423077,21.846154,13.5,23.115385,...,16.384615,24.5,13.192308,26.384615,15.807692,18.807692,9.384615,4.192308,19.076923,-4.884615
4,2003,1106,23.428571,55.285714,6.107143,17.642857,10.642857,16.464286,12.285714,23.857143,...,15.535714,21.964286,11.321429,22.357143,11.785714,15.071429,8.785714,3.178571,16.142857,-0.142857


In [17]:
# Merge tourney data with regular detailed data
tourney_data = tourney_data[['Season', 'DayNum', 'T1_TeamID', 'T1_Score', 'T2_TeamID' ,'T2_Score','Result']]

In [18]:
tourney_data=pd.merge(tourney_data,stat_seasons_T1,on=['Season','T1_TeamID'],how='left')
tourney_data=pd.merge(tourney_data,stat_seasons_T2,on=['Season','T2_TeamID'],how='left')

In [19]:
tourney_data.shape

(2230, 61)

In [20]:
# Add another important fact of teams' strength, seed
seeds_data.head()

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [21]:
set(seeds_data['Seed'])

{'W01',
 'W02',
 'W03',
 'W04',
 'W05',
 'W06',
 'W07',
 'W08',
 'W09',
 'W10',
 'W11',
 'W11a',
 'W11b',
 'W12',
 'W12a',
 'W12b',
 'W13',
 'W14',
 'W15',
 'W16',
 'W16a',
 'W16b',
 'X01',
 'X02',
 'X03',
 'X04',
 'X05',
 'X06',
 'X07',
 'X08',
 'X09',
 'X10',
 'X11',
 'X11a',
 'X11b',
 'X12',
 'X12a',
 'X12b',
 'X13',
 'X14',
 'X15',
 'X16',
 'X16a',
 'X16b',
 'Y01',
 'Y02',
 'Y03',
 'Y04',
 'Y05',
 'Y06',
 'Y07',
 'Y08',
 'Y09',
 'Y10',
 'Y11',
 'Y11a',
 'Y11b',
 'Y12',
 'Y12a',
 'Y12b',
 'Y13',
 'Y14',
 'Y15',
 'Y16',
 'Y16a',
 'Y16b',
 'Z01',
 'Z02',
 'Z03',
 'Z04',
 'Z05',
 'Z06',
 'Z07',
 'Z08',
 'Z09',
 'Z10',
 'Z11',
 'Z11a',
 'Z11b',
 'Z12',
 'Z13',
 'Z13a',
 'Z13b',
 'Z14',
 'Z14a',
 'Z14b',
 'Z15',
 'Z16',
 'Z16a',
 'Z16b'}

In [22]:
# This UDF is to help compare the suffix of the seeds information
def extra(x):
    if len(x)==3:
        return -1
    elif x[3]=='a':
        return 0
    else:
        return 1

In [23]:
# Extract the seed rank
seeds_data['seed']=seeds_data['Seed'].apply(lambda x: int(x[1:3]))

In [24]:
seeds_data['seed_2']=seeds_data['Seed'].apply(extra)

In [25]:
# Make duplicates for both teams in a match
seeds_T1 = seeds_data[['Season','TeamID','seed','seed_2']].copy()
seeds_T2 = seeds_data[['Season','TeamID','seed','seed_2']].copy()

In [26]:
seeds_T1.columns = ['Season','T1_TeamID','T1_seed','T1_seed_2']
seeds_T2.columns = ['Season','T2_TeamID','T2_seed','T2_seed_2']

In [27]:
seeds_T2.head()

Unnamed: 0,Season,T2_TeamID,T2_seed,T2_seed_2
0,1985,1207,1,-1
1,1985,1210,2,-1
2,1985,1228,3,-1
3,1985,1260,4,-1
4,1985,1374,5,-1


In [28]:
tourney_data = pd.merge(tourney_data, seeds_T1, on = ['Season', 'T1_TeamID'], how = 'left')
tourney_data = pd.merge(tourney_data, seeds_T2, on = ['Season', 'T2_TeamID'], how = 'left')

In [29]:
tourney_data.head()

Unnamed: 0,Season,DayNum,T1_TeamID,T1_Score,T2_TeamID,T2_Score,Result,T1_FGMmean,T1_FGAmean,T1_FGM3mean,...,T2oppponent_Astmean,T2oppponent_TOmean,T2oppponent_Stlmean,T2oppponent_Blkmean,T2oppponent_PFmean,T2PointDiffmean,T1_seed,T1_seed_2,T2_seed,T2_seed_2
0,2003,134,1421,92,1411,84,1.0,24.37931,56.793103,6.482759,...,13.766667,14.333333,8.0,2.6,21.633333,1.966667,16,1,16,0
1,2003,136,1112,80,1436,51,1.0,30.321429,65.714286,7.035714,...,13.275862,13.0,7.103448,3.655172,17.931034,4.655172,1,-1,16,-1
2,2003,136,1113,84,1272,71,1.0,27.206897,56.896552,4.0,...,13.310345,15.068966,7.275862,3.172414,19.931034,8.689655,10,-1,7,-1
3,2003,136,1141,79,1166,73,1.0,26.62069,52.689655,6.827586,...,12.363636,17.060606,6.333333,2.575758,19.393939,14.909091,11,-1,6,-1
4,2003,136,1143,76,1301,74,1.0,27.344828,58.724138,6.413793,...,12.566667,14.633333,7.433333,2.833333,19.333333,4.4,8,-1,9,-1


In [30]:
# Create the seed difference columns because by intuition this is an important fact of predicting match result
tourney_data["Seed_diff"] = tourney_data["T1_seed"].astype(int) - tourney_data["T2_seed"].astype(int)

In [31]:
tourney_data.dtypes

Season       int64
DayNum       int64
T1_TeamID    int64
T1_Score     int64
T2_TeamID    int64
             ...  
T1_seed      int64
T1_seed_2    int64
T2_seed      int64
T2_seed_2    int64
Seed_diff    int32
Length: 66, dtype: object

In [32]:
tourney_data.columns

Index(['Season', 'DayNum', 'T1_TeamID', 'T1_Score', 'T2_TeamID', 'T2_Score',
       'Result', 'T1_FGMmean', 'T1_FGAmean', 'T1_FGM3mean', 'T1_FGA3mean',
       'T1_FTMmean', 'T1_FTAmean', 'T1_ORmean', 'T1_DRmean', 'T1_Astmean',
       'T1_TOmean', 'T1_Stlmean', 'T1_Blkmean', 'T1_PFmean',
       'T1oppponent_FGMmean', 'T1oppponent_FGAmean', 'T1oppponent_FGM3mean',
       'T1oppponent_FGA3mean', 'T1oppponent_FTMmean', 'T1oppponent_FTAmean',
       'T1oppponent_ORmean', 'T1oppponent_DRmean', 'T1oppponent_Astmean',
       'T1oppponent_TOmean', 'T1oppponent_Stlmean', 'T1oppponent_Blkmean',
       'T1oppponent_PFmean', 'T1PointDiffmean', 'T2_FGMmean', 'T2_FGAmean',
       'T2_FGM3mean', 'T2_FGA3mean', 'T2_FTMmean', 'T2_FTAmean', 'T2_ORmean',
       'T2_DRmean', 'T2_Astmean', 'T2_TOmean', 'T2_Stlmean', 'T2_Blkmean',
       'T2_PFmean', 'T2oppponent_FGMmean', 'T2oppponent_FGAmean',
       'T2oppponent_FGM3mean', 'T2oppponent_FGA3mean', 'T2oppponent_FTMmean',
       'T2oppponent_FTAmean', 'T2opp

## 4. Model Building

After we cleaned our data, the next steps are extracting features, prepare different algorithms, hyper-prameter tuning, and model validation. 

In [33]:
# Remove meaningless columns and target attributes
features=[col for col in tourney_data.columns.values if col not in ['Season', 'DayNum', 'T1_TeamID', 'T2_TeamID','Result','T1_Score','T2_Score']]

In [34]:
len(features)

59

In [35]:
y = tourney_data['Result']
X = tourney_data[features].values

In [36]:
features

['T1_FGMmean',
 'T1_FGAmean',
 'T1_FGM3mean',
 'T1_FGA3mean',
 'T1_FTMmean',
 'T1_FTAmean',
 'T1_ORmean',
 'T1_DRmean',
 'T1_Astmean',
 'T1_TOmean',
 'T1_Stlmean',
 'T1_Blkmean',
 'T1_PFmean',
 'T1oppponent_FGMmean',
 'T1oppponent_FGAmean',
 'T1oppponent_FGM3mean',
 'T1oppponent_FGA3mean',
 'T1oppponent_FTMmean',
 'T1oppponent_FTAmean',
 'T1oppponent_ORmean',
 'T1oppponent_DRmean',
 'T1oppponent_Astmean',
 'T1oppponent_TOmean',
 'T1oppponent_Stlmean',
 'T1oppponent_Blkmean',
 'T1oppponent_PFmean',
 'T1PointDiffmean',
 'T2_FGMmean',
 'T2_FGAmean',
 'T2_FGM3mean',
 'T2_FGA3mean',
 'T2_FTMmean',
 'T2_FTAmean',
 'T2_ORmean',
 'T2_DRmean',
 'T2_Astmean',
 'T2_TOmean',
 'T2_Stlmean',
 'T2_Blkmean',
 'T2_PFmean',
 'T2oppponent_FGMmean',
 'T2oppponent_FGAmean',
 'T2oppponent_FGM3mean',
 'T2oppponent_FGA3mean',
 'T2oppponent_FTMmean',
 'T2oppponent_FTAmean',
 'T2oppponent_ORmean',
 'T2oppponent_DRmean',
 'T2oppponent_Astmean',
 'T2oppponent_TOmean',
 'T2oppponent_Stlmean',
 'T2oppponent_Blkmean

In [37]:
# Prepare 3 different algorithms
rf = RandomForestClassifier()
lgb= LGBMClassifier()
xgb=XGBClassifier()

In [38]:
#rf tuning
n_estimators=[100,200,500,1000]
max_features = ['auto', 'sqrt']
max_depth = [20,30,40,50,100,None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
rf_param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
                 'min_samples_leaf':min_samples_leaf,
                'bootstrap':bootstrap}
rf_grid = GridSearchCV(estimator=rf,
                    param_grid=rf_param_grid,
                    cv=5,
                    verbose=1,
                    n_jobs=-1)
rf_grid.fit(X,y)

rf_best = rf_grid.best_estimator_


Fitting 5 folds for each of 864 candidates, totalling 4320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   37.2s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 4320 out of 4320 | elapsed: 17.6min finished


In [39]:
#lgbm tuning
lgb_param_grid = {
    'learning_rate': [0.01,0.05,0.1],
    'n_estimators': [100,500,1000,2000,3000],
    'num_leaves': [20,50,100,200],
    'boosting_type' : ['gbdt'],
    'objective' : ['binary'],
    'colsample_bytree' : [0.6, 0.8],
    'subsample' : [0.6,0.8],
    'reg_alpha' : [0.8,1.2],
    'reg_lambda' : [0.8,1.2],
    }
lgb_grid = GridSearchCV(lgb, lgb_param_grid,
                    verbose=1,
                    cv=5,
                    n_jobs=-1)
lgb_grid.fit(X,y)
lgb_best = lgb_grid.best_estimator_
lgb_grid.best_params_

Fitting 5 folds for each of 960 candidates, totalling 4800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 3168 tasks      | elapsed: 20.2min
[Parallel(n_jobs=-1)]: Done 4018 tasks      | elapsed: 23.6min
[Parallel(n_jobs=-1)]: Done 4800 out of 4800 | elapsed: 25.3min finished


{'boosting_type': 'gbdt',
 'colsample_bytree': 0.8,
 'learning_rate': 0.01,
 'n_estimators': 100,
 'num_leaves': 20,
 'objective': 'binary',
 'reg_alpha': 1.2,
 'reg_lambda': 1.2,
 'subsample': 0.6}

In [40]:
#xgb tuning
xgb_param_grid = {'learning_rate':[0.1,0.05,0.001],
                'n_estimators':[100,200,500,1000],
                'max_depth':[10,20,50,100],
                'min_child_weight':[1,5,10]
                 }
xgb_grid = GridSearchCV(xgb, xgb_param_grid,
                    verbose=1,
                    cv=5,
                    n_jobs=-1)
xgb_grid.fit(X,y)
xgb_best = xgb_grid.best_estimator_

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  7.5min finished




## 5. Submition (for Kaggle only)

Because Kaggle has the test data online so we don't do train test split here, instead we check our model accuracy by submit our predictions.

In [41]:
sub=pd.read_csv('MSampleSubmissionStage1_2020.csv')
sub['Season'] = sub['ID'].map(lambda x: int(x[:4]))
sub["T1_TeamID"] = sub["ID"].apply(lambda x: x[5:9]).astype(int)
sub["T2_TeamID"] = sub["ID"].apply(lambda x: x[10:14]).astype(int)
sub = pd.merge(sub, stat_seasons_T1, on = ['Season', 'T1_TeamID'])
sub = pd.merge(sub, stat_seasons_T2, on = ['Season', 'T2_TeamID'])
sub = pd.merge(sub, seeds_T1, on = ['Season', 'T1_TeamID'])
sub = pd.merge(sub, seeds_T2, on = ['Season', 'T2_TeamID'])
sub["Seed_diff"] = sub["T1_seed"] - sub["T2_seed"]
X_test=sub[features].values

In [42]:
def get_submission(model, submission_name):
    
    y_pred = model.predict_proba(X_test)
    sub["Pred"] = y_pred[:, 1]
    sub[['ID','Pred']].to_csv("submission_{}.csv".format(submission_name), 
                              index = False)
    return y_pred, sub

In [44]:
_,_ = get_submission(rf_best, 'rf_best')
_,_ = get_submission(lgb_best, 'lgb_best')
_,_ = get_submission(xgb_best,'xgb_best')

In [45]:
lgb=LGBMClassifier(n_estimators=3000,learning_rate=0.05)
lgb.fit(X,y)
_,_ = get_submission(lgb, 'lgb')

In [46]:
xgb_grid.best_params_

{'learning_rate': 0.001,
 'max_depth': 10,
 'min_child_weight': 10,
 'n_estimators': 1000}

In [47]:
xgb=XGBClassifier(learning_rate=0.1,n_estimators=5000,max_depth=6)
xgb.fit(X,y)
_,_ = get_submission(xgb,'xgb')





### The final score we got on Kaggle is 0.00000, which means the loss of our model is 0, and we got ranked at the 7th place (acturally the 1st in tie)
### Unfortunatelly, this competetion was shut down due to the Covid-19, so we don't have the url to our score anymore.
### We only have a screen shot of our rank and score at that time.

![title](Score.png)