## Table of Content

* __Step 1: Importing the Relevant Libraries__
    
* __Step 2: Data Inspection__
    
* __Step 3: Data Cleaning__
    
* __Step 4: Exploratory Data Analysis__
    
* __Step 5: Building Model__
    
* __How to Make a Submission?__
* __Guidelines for Final Submission__

### Step 1: Importing the Relevant Libraries

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

### Step 2: Data Inspection

In [2]:
train = pd.read_csv("./train_1K0BDt5/train.csv")
test = pd.read_csv("./test_kuhCxHY/test.csv")

In [3]:
train.shape,test.shape

((20453, 800), (8774, 799))

In [4]:
#ratio of null values
train.isnull().sum()/train.shape[0] *100

row_id                            0.000000
scout_id                          0.000000
rating_num                        0.000000
winner                            0.000000
team                              0.000000
                                    ...   
team2_defensive_derived_var_15    6.976972
team2_offensive_derived_var_16    6.976972
team2_defensive_derived_var_17    6.976972
team2_offensive_derived_var_18    6.976972
team2_offensive_derived_var_19    6.976972
Length: 800, dtype: float64

In [5]:
#ratio of null values
(test.isnull().sum()/test.shape[0] *100).sort_values(ascending=False)

team2_other_ratio_var_32    100.0
team2_other_raw_var_51      100.0
team1_other_ratio_var_32    100.0
team2_other_raw_var_40      100.0
team2_other_raw_var_43      100.0
                            ...  
player_general_var_2          0.0
player_general_var_3          0.0
player_general_var_5          0.0
scout_id                      0.0
row_id                        0.0
Length: 799, dtype: float64

* __We have some columns having 100% missing values. Need to drop those columns__

In [6]:
#categorical features
categorical = train.select_dtypes(include =[np.object])
print("Categorical Features in Train Set:",categorical.shape[1])

#numerical features
numerical= train.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Train Set:",numerical.shape[1])

Categorical Features in Train Set: 2
Numerical Features in Train Set: 798


In [8]:
#categorical features
categorical_test = test.select_dtypes(include =[np.object])
print("Categorical Features in Test Set:",categorical_test.shape[1])

#numerical features
numerical_test= test.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Test Set:",numerical_test.shape[1])

Categorical Features in Test Set: 2
Numerical Features in Test Set: 797


### Step 3: Data Cleaning

Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction.

In [9]:
train.isnull().sum()

row_id                               0
scout_id                             0
rating_num                           0
winner                               0
team                                 0
                                  ... 
team2_defensive_derived_var_15    1427
team2_offensive_derived_var_16    1427
team2_defensive_derived_var_17    1427
team2_offensive_derived_var_18    1427
team2_offensive_derived_var_19    1427
Length: 800, dtype: int64

In [10]:
test.isnull().sum()

row_id                              0
scout_id                            0
winner                              0
team                                1
competitionId                       1
                                 ... 
team2_defensive_derived_var_15    416
team2_offensive_derived_var_16    416
team2_defensive_derived_var_17    416
team2_offensive_derived_var_18    416
team2_offensive_derived_var_19    416
Length: 799, dtype: int64

__3.1 Loop through the features and remove the columns which are having more than 50% missing values__

In [61]:
na = (train.isna().sum()/len(train))*100
sorted_NA = na.sort_values(ascending = False)
null_var_names = []
for i in range(0, len(sorted_NA)):
    if sorted_NA[i]>25.0:
        null_var_names.append(sorted_NA.index[i])

In [62]:
# Drop the columns with more than 50% missing values
train_new = train.drop(columns = null_var_names)
test_new = test.drop(columns = null_var_names)

In [63]:
test_new.team.isna().sum()

1

In [64]:
train_new[categorical.columns[0]].isna().sum(), test[categorical.columns[0]].isna().sum()

(0, 0)

In [65]:
train_new[categorical.columns[1]].isna().sum(), test[categorical.columns[1]].isna().sum()

(0, 1)

In [66]:
#Convert categorical values to numerical.
# The cat variables are Ordinal data. So, it would be good to label encode them
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

for i in categorical.columns:
    train_new[i] = encoder.fit_transform(train_new[i])
for i in categorical_test.columns:
    test_new[i] = encoder.fit_transform(test_new[i])

In [67]:
train_new.team

0        0
1        1
2        0
3        1
4        0
        ..
20448    1
20449    1
20450    0
20451    0
20452    1
Name: team, Length: 20453, dtype: int32

In [68]:
test_new.winner.nunique()

3

In [69]:
# Replace the missing values with mean of the values. Categorical variables do not have missing values
col_new = train_new.columns

#Replace the missing values with mean value
for i in range(0, train_new.shape[1]):
    m =train_new[col_new[i]].mean()
    train_new[col_new[i]].fillna(m, inplace=True)

In [70]:
# Replace the missing values with mean of the values. Categorical variables do not have missing values
col_new_test = test_new.columns

#Replace the missing values with mean value
for i in range(0, test_new.shape[1]):
    m =test_new[col_new_test[i]].mean()
    test_new[col_new_test[i]].fillna(m, inplace=True)

In [71]:
train_new.isna().sum()

row_id                            0
scout_id                          0
rating_num                        0
winner                            0
team                              0
                                 ..
team2_defensive_derived_var_15    0
team2_offensive_derived_var_16    0
team2_defensive_derived_var_17    0
team2_offensive_derived_var_18    0
team2_offensive_derived_var_19    0
Length: 661, dtype: int64

In [72]:
test_new.isna().sum()

row_id                            0
scout_id                          0
winner                            0
team                              0
competitionId                     0
                                 ..
team2_defensive_derived_var_15    0
team2_offensive_derived_var_16    0
team2_defensive_derived_var_17    0
team2_offensive_derived_var_18    0
team2_offensive_derived_var_19    0
Length: 660, dtype: int64

In [73]:
X = train_new.drop(columns = ['rating_num', 'row_id'])
y = train_new.rating_num
X_validate = test_new.drop(columns = ['row_id'])

In [74]:
X.shape, y.shape, X_validate.shape

((20453, 659), (20453,), (8774, 659))

In [75]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [76]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((14317, 659), (6136, 659), (14317,), (6136,))

In [77]:
# Model building using lightb GBM
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
lgbm = LGBMRegressor()
lgbm.fit(X_train,y_train)
y_pred = lgbm.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(r2)

0.29639795904282096


In [78]:
submission = pd.read_csv('sample_submission_wBWLI0s.csv')
final_predictions = lgbm.predict(X_validate)
submission['rating_num'] = final_predictions
#only positive predictions for the target variable
submission['rating_num'] = submission['rating_num'].apply(lambda x: 0 if x<0 else x)
submission.to_csv('my_submission.csv', index=False)

## How to Make a Submission?

In [38]:
from IPython.display import HTML

HTML('<iframe width="700" height="380" src="https://www.youtube.com/embed/zevnI9TgTtA" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')    
