#### First of all what is Gradient Boosting?
##### Credits for this part goes to Sunil Ray 
Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:<br/>

-Email has only one image file (promotional image), It’s a SPAM<br/>
-Email has only link(s), It’s a SPAM<br/>
-Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM<br/>
-Email from our official domain “Analyticsvidhya.com” , Not a SPAM<br/>
-Email from known source, Not a SPAM<br/>

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules individually are strong enough to successfully classify an email? No.<br/>

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these rules are called as weak learner.<br/>

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:
•   Using average/ weighted average<br/>
•   Considering prediction has higher vote<br/>

For example:  Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’...<br/>
For full blog post: https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/

#### Introduction to Gradient Boost Algorithms
##### Credits for this part goes to my fellow friend Sefik Ilkin Serengil
It is a fact that decision tree based machine learning algorithms dominate Kaggle competitions. More than half of the winning solutions have adopted XGBoost. Recently, Microsoft announced its gradient boosting framework LightGBM. Nowadays, it steals the spotlight in gradient boosting machines. Kagglers start to use LightGBM more than XGBoost. Even though XGBoost might have higher accuracy, LightGBM runs previously 10 times and currently 6 times faster than XGBoost. Moreover, there are tens of solutions standing atop a challenge podium...<br/> (My additional note to this: There is a miracle called GPU and it gives XGBoost a boost because library natively supports gpu with only a parameter while if you need GPU boost you have to compile LightGBM for gpu usage by yourself, this situation takes away LightGBM's 6 times faster achievement because you can't even compare XGBoost with GPU parameter to LightGBM on cpu)
For full blog post: https://sefiks.com/2018/10/13/a-gentle-introduction-to-lightgbm-for-applied-machine-learning/

#### Introduction to Gridsearch
It is in simple implementation of auto-hyperparameter tuning. scikit learn team integrated a GridsearchCV function inside of model_selection library of theirs. It exhaustively searches over specified parameter grid (multiple values for each hyperparameter) values for a given estimator. 

GridSearchCV implements a “fit” and a “score” method on array of hyperparameters specified in grid params set. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used and returns back best_score, best_parameters that fits your data for the training. So we leave trial-error part to scikit learn. 

Usage is simply:

from sklearn import svm, datasets<br/>
from sklearn.model_selection import GridSearchCV<br/>
iris = datasets.load_iris()<br/>
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}<br/>
svc = svm.SVC(gamma="scale")<br/>
clf = GridSearchCV(svc, parameters, cv=5)<br/>
clf.fit(iris.data, iris.target)<br/>
sorted(clf.cv_results_.keys())

In [None]:
# Import necessary everyday os libs
import sys
import gc

# Import the usual suspects
import numpy as np
import pandas as pd

### Useful functions for community

In [None]:
# Universal pandas dataframe memory footprint reducer for those dealing with big data but not that big that require spark
def df_footprint_reduce(df, skip_obj=False, skip_int=False, skip_float=False, print_comparison=True):
    '''
    :param df              : Pandas Dataframe to shrink in memory footprint size
    :param skip_obj        : If not desired string columns can be skipped during shrink operation
    :param skip_int        : If not desired integer columns can be skipped during shrink operation
    :param skip_float      : If not desired float columns can be skipped during shrink operation
    :param print_comparison: Beware! Printing comparison needs calculation of each columns datasize
                             so if you need speed turn this off. It's just here to show you info                            
    :return                : Pandas Dataframe of exactly the same data and dtypes but in less memory footprint    
    '''
    if print_comparison:
        print(f"Dataframe size before shrinking column types into smallest possible: {round((sys.getsizeof(df)/1024/1024),4)} MB")
    for column in df.columns:
        if (skip_obj is False) and (str(df[column].dtype)[:6] == 'object'):
            num_unique_values = len(df[column].unique())
            num_total_values = len(df[column])
            if num_unique_values / num_total_values < 0.5:
                df.loc[:,column] = df[column].astype('category')
            else:
                df.loc[:,column] = df[column]
        elif (skip_int is False) and (str(df[column].dtype)[:3] == 'int'):
            if df[column].min() > np.iinfo(np.int8).min and df[column].max() < np.iinfo(np.int8).max:
                df[column] = df[column].astype(np.int8)
            elif df[column].min() > np.iinfo(np.int16).min and df[column].max() < np.iinfo(np.int16).max:
                df[column] = df[column].astype(np.int16)
            elif df[column].min() > np.iinfo(np.int32).min and df[column].max() < np.iinfo(np.int32).max:
                df[column] = df[column].astype(np.int32)
        elif (skip_float is False) and (str(df[column].dtype)[:5] == 'float'):
            if df[column].min() > np.finfo(np.float16).min and df[column].max() < np.finfo(np.float16).max:
                df[column] = df[column].astype(np.float16)
            elif df[column].min() > np.finfo(np.float32).min and df[column].max() < np.finfo(np.float32).max:
                df[column] = df[column].astype(np.float32)
    if print_comparison:
        print(f"Dataframe size after shrinking column types into smallest possible: {round((sys.getsizeof(df)/1024/1024),4)} MB")
    return df

In [None]:
# Universal pandas dataframe null/nan cleaner
def df_null_cleaner(df, fill_with=None, drop_na=False, axis=0):
    '''
    Very good information on dealing with missing values of dataframes can be found at 
    http://pandas.pydata.org/pandas-docs/stable/missing_data.html
    
    :param df        : Pandas Dataframe to clean from missing values 
    :param fill_with : Fill missing values with a value of users choice
    :param drop_na   : Drop either axis=0 for rows containing missing fields
                       or axis=1 to drop columns having missing fields default rows                   
    :return          : Pandas Dataframe cleaned from missing values 
    '''
    df[(df == np.NINF)] = np.NaN
    df[(df == np.Inf)] = np.NaN
    if drop_na:
        df.dropna(axis=axis,inplace=True)
    if ~fill_with:
        df.fillna(fill_with, inplace=True)
    return df

### Feature Engineering

In [None]:
def feature_engineering(df,is_train=True):
    if is_train:          
        df = df[df['maxPlace'] > 1].copy()

    target = 'winPlacePerc'
    print('Grouping similar match types together')
    df.loc[(df['matchType'] == 'solo'), 'matchType'] = 1
    df.loc[(df['matchType'] == 'normal-solo'), 'matchType'] = 1
    df.loc[(df['matchType'] == 'solo-fpp'), 'matchType'] = 1
    df.loc[(df['matchType'] == 'normal-solo-fpp'), 'matchType'] = 1

    df.loc[(df['matchType'] == 'duo'), 'matchType'] = 2
    df.loc[(df['matchType'] == 'normal-duo'), 'matchType'] = 2
    df.loc[(df['matchType'] == 'duo-fpp'), 'matchType'] = 2    
    df.loc[(df['matchType'] == 'normal-duo-fpp'), 'matchType'] = 2

    df.loc[(df['matchType'] == 'squad'), 'matchType'] = 3
    df.loc[(df['matchType'] == 'normal-squad'), 'matchType'] = 3    
    df.loc[(df['matchType'] == 'squad-fpp'), 'matchType'] = 3
    df.loc[(df['matchType'] == 'normal-squad-fpp'), 'matchType'] = 3
    
    df.loc[(df['matchType'] == 'flaretpp'), 'matchType'] = 0
    df.loc[(df['matchType'] == 'flarefpp'), 'matchType'] = 0
    df.loc[(df['matchType'] == 'crashtpp'), 'matchType'] = 0
    df.loc[(df['matchType'] == 'crashfpp'), 'matchType'] = 0
    df.loc[(df['rankPoints'] < 0), 'rankPoints'] = 0
    
    print('Adding new features using existing ones')
    df['headshotrate'] = df['kills']/df['headshotKills']
    df['killStreakrate'] = df['killStreaks']/df['kills']
    df['healthitems'] = df['heals'] + df['boosts']
    df['totalDistance'] = df['rideDistance'] + df["walkDistance"] + df["swimDistance"]
    df['killPlace_over_maxPlace'] = df['killPlace'] / df['maxPlace']
    df['headshotKills_over_kills'] = df['headshotKills'] / df['kills']
    df['distance_over_weapons'] = df['totalDistance'] / df['weaponsAcquired']
    df['walkDistance_over_heals'] = df['walkDistance'] / df['heals']
    df['walkDistance_over_kills'] = df['walkDistance'] / df['kills']
    df['killsPerWalkDistance'] = df['kills'] / df['walkDistance']
    df['skill'] = df['headshotKills'] + df['roadKills']
    
    print('Adding normalized features')
    df['playersJoined'] = df.groupby('matchId')['matchId'].transform('count')
    gc.collect()
    df['killsNorm'] = df['kills']*((100-df['playersJoined'])/100 + 1)
    df['damageDealtNorm'] = df['damageDealt']*((100-df['playersJoined'])/100 + 1)
    df['maxPlaceNorm'] = df['maxPlace']*((100-df['playersJoined'])/100 + 1)
    df['matchDurationNorm'] = df['matchDuration']*((100-df['playersJoined'])/100 + 1)
    df['headshotKillsNorm'] = df['headshotKills']*((100-df['playersJoined'])/100 + 1)
    df['killPlaceNorm'] = df['killPlace']*((100-df['playersJoined'])/100 + 1)
    df['killPointsNorm'] = df['killPoints']*((100-df['playersJoined'])/100 + 1)
    df['killStreaksNorm'] = df['killStreaks']*((100-df['playersJoined'])/100 + 1)
    df['longestKillNorm'] = df['longestKill']*((100-df['playersJoined'])/100 + 1)
    df['roadKillsNorm'] = df['roadKills']*((100-df['playersJoined'])/100 + 1)
    df['teamKillsNorm'] = df['teamKills']*((100-df['playersJoined'])/100 + 1)
    df['damageDealtNorm'] = df['damageDealt']*((100-df['playersJoined'])/100 + 1)
    df['DBNOsNorm'] = df['DBNOs']*((100-df['playersJoined'])/100 + 1)
    df['revivesNorm'] = df['revives']*((100-df['playersJoined'])/100 + 1)    
    
    # Clean null values from dataframe
    df = df_null_cleaner(df,fill_with=0)

    features = list(df.columns)
    features.remove("Id")
    features.remove("matchId")
    features.remove("groupId")
    features.remove("matchType")  
    
    y = pd.DataFrame()
    if is_train: 
        print('Preparing target variable')
        y = df.groupby(['matchId','groupId'])[target].agg('mean')
        gc.collect()
        features.remove(target)
        
    print('Aggregating means')
    means_features = list(df.columns)
    means_features.remove("Id")
    means_features.remove("matchId")
    means_features.remove("groupId")
    means_features.remove("matchType")  
    
    if is_train:
        means_features.remove(target)
    
    agg = df.groupby(['matchId','groupId'])[means_features].agg('mean')
    gc.collect()
    agg_rank = agg.groupby('matchId')[means_features].rank(pct=True).reset_index()
    gc.collect()
    
    if is_train: 
        X = agg.reset_index()[['matchId','groupId']]
    else: 
        X = df[['matchId','groupId']]

    X = X.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    X = X.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])
    del agg, agg_rank
    gc.collect()
    
    print('Aggregating maxes')
    maxes_features = list(df.columns) 
    maxes_features.remove("Id")
    maxes_features.remove("matchId")
    maxes_features.remove("groupId")
    maxes_features.remove("matchType")  

    if is_train:
        maxes_features.remove(target)
    
    agg = df.groupby(['matchId','groupId'])[maxes_features].agg('max')
    gc.collect()
    agg_rank = agg.groupby('matchId')[maxes_features].rank(pct=True).reset_index()
    gc.collect()
    X = X.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    X = X.merge(agg_rank, suffixes=["_max", "_max_rank"], how='left', on=['matchId', 'groupId'])
    del agg, agg_rank
    gc.collect()
    
    print('Aggregating mins')
    mins_features = list(df.columns) 
    mins_features.remove("Id")
    mins_features.remove("matchId")
    mins_features.remove("groupId")
    mins_features.remove("matchType")  
    
    if is_train:
        mins_features.remove(target)
    
    agg = df.groupby(['matchId','groupId'])[mins_features].agg('min')
    gc.collect()
    agg_rank = agg.groupby('matchId')[mins_features].rank(pct=True).reset_index()
    gc.collect()
    X = X.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    X = X.merge(agg_rank, suffixes=["_min", "_min_rank"], how='left', on=['matchId', 'groupId'])
    del agg, agg_rank
    gc.collect()
    
    print('Aggregating group sizes')
    grpsize_features = list(df.columns) 
    grpsize_features.remove("Id")
    grpsize_features.remove("matchId")
    grpsize_features.remove("groupId")
    grpsize_features.remove("matchType")  
    grpsize_features.remove("DBNOsNorm")
    grpsize_features.remove("damageDealtNorm")
    grpsize_features.remove("headshotKillsNorm")
    grpsize_features.remove("killPlaceNorm")
    grpsize_features.remove("killPlace_over_maxPlace")
    grpsize_features.remove("killPointsNorm")
    grpsize_features.remove("killStreaksNorm")
    grpsize_features.remove("killsNorm")
    grpsize_features.remove("longestKillNorm")
    grpsize_features.remove("matchDurationNorm")
    grpsize_features.remove("matchDuration")
    grpsize_features.remove("maxPlaceNorm")
    grpsize_features.remove("maxPlace")
    grpsize_features.remove("numGroups")
    grpsize_features.remove("playersJoined")
    grpsize_features.remove("revivesNorm")
    grpsize_features.remove("roadKillsNorm")
    grpsize_features.remove("teamKillsNorm")    
    agg = df.groupby(['matchId','groupId'])[grpsize_features].size().reset_index(name='group_size')
    gc.collect()
    X = X.merge(agg, how='left', on=['matchId', 'groupId'])
    
    print('Aggregating match means')
    mmeans_features = list(df.columns) 
    mmeans_features.remove("Id")
    mmeans_features.remove("matchId")
    mmeans_features.remove("groupId")
    mmeans_features.remove("DBNOsNorm")
    mmeans_features.remove("damageDealtNorm")
    mmeans_features.remove("headshotKillsNorm")
    mmeans_features.remove("killPlace_over_maxPlace")
    mmeans_features.remove("killPointsNorm")
    mmeans_features.remove("killStreaksNorm")
    mmeans_features.remove("longestKillNorm")
    mmeans_features.remove("matchDurationNorm")
    mmeans_features.remove("matchDuration")
    mmeans_features.remove("maxPlaceNorm")
    mmeans_features.remove("numGroups")
    mmeans_features.remove("revivesNorm")
    mmeans_features.remove("roadKillsNorm")
    mmeans_features.remove("teamKillsNorm")      
    agg = df.groupby(['matchId'])[mmeans_features].agg('mean').reset_index()
    gc.collect()
    X = X.merge(agg, suffixes=["", "_match_mean"], how='left', on=['matchId'])
    
    print('Aggregating match sizes')
    msizes_features = list(df.columns) 
    msizes_features.remove("Id")
    msizes_features.remove("matchId")
    msizes_features.remove("groupId")
    msizes_features.remove("DBNOsNorm")
    msizes_features.remove("damageDealtNorm")
    msizes_features.remove("headshotKillsNorm")
    msizes_features.remove("killPlace_over_maxPlace")
    msizes_features.remove("killPointsNorm")
    msizes_features.remove("killStreaksNorm")
    msizes_features.remove("longestKillNorm")
    msizes_features.remove("matchDurationNorm")
    msizes_features.remove("matchDuration")
    msizes_features.remove("maxPlaceNorm")
    msizes_features.remove("numGroups")
    msizes_features.remove("revivesNorm")
    msizes_features.remove("roadKillsNorm")
    msizes_features.remove("teamKillsNorm")      
    agg = df.groupby(['matchId']).size().reset_index(name='match_size')
    gc.collect()
    X = X.merge(agg, how='left', on=['matchId'])
    del df, agg
    gc.collect()

    X.drop(columns = ['matchId', 
                      'groupId'
                     ], axis=1, inplace=True)  
    gc.collect()
    if is_train:
        return X, y
    
    return X

### Load dataset files

In [None]:
X_train = pd.read_csv('../input/train_V2.csv', engine='c')

In [None]:
X_train = df_footprint_reduce(X_train, skip_obj=True)
gc.collect()

In [None]:
X_train, y_train = feature_engineering(X_train, True)
gc.collect()

In [None]:
X_train = df_footprint_reduce(X_train, skip_obj=True)
gc.collect()

In [None]:
from sklearn import preprocessing

In [None]:
scaler = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=False).fit(X_train)

In [None]:
y_train = y_train * 2 - 1

### Test / Validation split of dataset

In [None]:
# Import good old friend
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
# Split dataset into train and validation set from %80 of x_train
X_train, X_validation, y_train, y_validation = train_test_split(X_train, 
                                                                y_train, 
                                                                test_size=0.2)
gc.collect()

### Iinitialize model

In [None]:
# Import the real deal
import xgboost as xgb

In [None]:
# Initialize model with initial parameters given
model = xgb.XGBRegressor(objective = 'reg:linear',
                         n_estimators = 30000,
                         metric = 'mae',
                         bagging_fraction = 0.7,
                         bagging_seed = 13,
                         feature_fraction = 0.7,
                         tree_method = 'gpu_hist',
                         verbosity = 1)

### Finding right hyperparameters for our model

In [None]:
def find_best_hyperparameters(model):
    # Grid parameters for using in Gridsearch while tuning
    gridParams = {
        'learning_rate'         : [0.1, 0.01 , 0.05],
        'n_estimators '         : [1000, 10000, 20000],
        'bagging_fraction'      : [0.5, 0.6 ,0.7],
        'feature_fraction'      : [0.5, 0.6 ,0.7],
        'num_leaves'            : [31, 80, 140]
    }
    # Create the grid
    grid = GridSearchCV(model, 
                        gridParams,
                        verbose=5,
                        cv=3)
    # Run the grid
    grid.fit(X_train, y_train)
    print('Best parameters: %s' % grid.best_params_)
    print('Accuracy: %.2f' % grid.best_score_)
    return

In [None]:
#find_best_hyperparameters(model)   # This takes time so comment out after finding your right parameters for model training

### Model Training

In [None]:
%%time
model.fit(X_train,y_train,
          eval_metric='mae',
          eval_set=[(X_train, y_train), (X_validation, y_validation)])

In [None]:
# Competition evaluation is based on mean absolute error so we calculate it over predictions from test data labels
print('The mean absolute error of model on validation set is:', min(model.evals_result_['validation_0']['mae']))

### Feature importance

In [None]:
import matplotlib.pyplot as plt

In [None]:
ax = xgb.plot_importance(model)
fig = ax.figure
fig.set_size_inches(20, 50)

In [None]:
# Clean memory and load test set
del X_train, X_validation, y_train, y_validation 
gc.collect()

### Model Prediction 

In [None]:
test_x = pd.read_csv('../input/test_V2.csv', engine='c')

In [None]:
test_x = df_footprint_reduce(test_x, skip_obj=True)
gc.collect()

In [None]:
test_x = feature_engineering(test_x, False)
gc.collect()

In [None]:
scaler.transform(test_x)

In [None]:
pred_test = model.predict(test_x)
del test_x
gc.collect()

### Prepare for submission 

In [None]:
test_set = pd.read_csv('../input/test_V2.csv', engine='c')

In [None]:
pred_test = pred_test.reshape(-1)
pred_test = (pred_test + 1) / 2
for i in range(len(test_set)):
    winPlacePerc = pred_test[i]
    maxPlace = int(test_set.iloc[i]['maxPlace'])
    if maxPlace == 0:
        winPlacePerc = 0.0
    elif maxPlace == 1:
        winPlacePerc = 1.0
    else:
        gap = 1.0 / (maxPlace - 1)
        winPlacePerc = round(winPlacePerc / gap) * gap
    
    if winPlacePerc < 0: winPlacePerc = 0.0
    if winPlacePerc > 1: winPlacePerc = 1.0    
    pred_test[i] = winPlacePerc

    if (i + 1) % 100000 == 0:
        print(i, flush=True, end=" ")

test_set['winPlacePerc'] = pred_test

submission = test_set[['Id', 'winPlacePerc']]
submission.to_csv('submission.csv', index=False)

##### Credits for some of codes used during feature engineering and post processing:
###### https://www.kaggle.com/harshitsheoran/mlp-and-fe
###### https://www.kaggle.com/ceshine/a-simple-post-processing-trick-lb-0237-0204