<a href="https://colab.research.google.com/github/captainawesome78/Data-Analysis/blob/main/PUBG_Playground_Finish_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Jovian Commit Essentials
# Please retain and execute this cell without modifying the contents for `jovian.commit` to work
!pip install jovian --upgrade -q
import jovian
jovian.set_project('pubg-playground-finish-prediction')
jovian.set_colab_id('14MlU_KC76IOg9BbPwa7uO8FbjjPFw4VV')

# PUBG Playground


## About
In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.


## Why are you summoned
You are to predict the players finishing placement based on on their final stats, on a scale from 1 (first place) to 0 (last place).

All the very best!

# The Overview

### Downloading the data
- Loading it into a pandas dataframe
- Exploring the dataset

### Processing and Feature Engineering
- Checking for Duplicate values
- Checking for missing values

### Exploratory Data Analysis and Visualization
- Target Variable
- correlation Heatmap

### Identifying Input and Target Columns
- Creating a list of Input columns and target columns

### Categorical and Numeric Columns
- Imputing missing numerical columns
- Scaling numerical values
- Encoding categorical columns

### Splitting the data for training
- Creating Training data(X_train)
- Creating Validation data(X_val)

## Training and Tuning Different Models
- Regression
- Decision Trees
- Random Forest Regression
- XGBRegressor from xgboost
- Gradient Boosting 

### Training Final Model
- Selecting the best performing model to make final predictions

### Saving The Model
- Using joblib

### Test Prediction
- The results

### Conclusion
- Summary
- Further Explorations

In [2]:
!pip install jovian --upgrade --quiet

In [3]:
import jovian

In [4]:
# Execute this to save new versions of the notebook
jovian.commit(project="PUBG-Playground-Finish-Prediction")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m


[31m[jovian] Error: Looks like the notebook is missing output cells, please save the notebook and try jovian.commit again.[0m


Committed successfully! https://jovian.ai/mailmenaveed88/pubg-playground-finish-prediction


'https://jovian.ai/mailmenaveed88/pubg-playground-finish-prediction'

Lets install all the necessary libaries

In [5]:

!pip install numpy pandas-profiling matplotlib seaborn --quiet
!pip install opendatasets xgboost graphviz lightgbm scikit-learn xgboost lightgbm --quiet

In [6]:
import os
import opendatasets as od
import pandas as pd
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_rows", 200)

Downloading the dataset from Kaggle

In [7]:
od.download('https://www.kaggle.com/c/pubg-finish-placement-prediction/data')

Skipping, found downloaded files in "./pubg-finish-placement-prediction" (use force=True to force download)


In [8]:
os.listdir('pubg-finish-placement-prediction')
# Provides you the list of files present in the folder

['test_V2.csv', 'sample_submission_V2.csv', 'train_V2.csv']

Lets read the dataset train.csv and test.csv

In [None]:
train_df = pd.read_csv('pubg-finish-placement-prediction/train_V2.csv')
test_df = pd.read_csv('pubg-finish-placement-prediction/test_V2.csv')

Lets look at the traning data and the test data

In [None]:
train_df.head()


In [None]:
test_df.head()

Lets check the shape of the training set and test set

In [None]:
train_df.shape


In [None]:
test_df.shape

As above there are rows and columns for train_df and test_df

In [None]:
train_df.info()

In [None]:
train_df.describe()

Lets check the missinf values in both train and test dataframe

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

As you can see there is just one missing enrty, we can drop the NA values

In [None]:
train_df.dropna(subset = ['winPlacePerc'], inplace=True)

In [None]:
train_df.isna().sum()

## Lets perform Exploratory Data Analysis now

Lets look at the columns in train_df

In [None]:
train_df.columns

In [None]:
import seaborn as sns
sns.distplot(train_df['boosts'])

Let us perform some OLS assumptions

In [None]:
!pip install matplotlib



In [None]:
import matplotlib.pyplot as plt
% matplotlib inline
plt.scatter(train_df['assists'],train_df['winPlacePerc'])
plt.xlabel('assists')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['damageDealt'],train_df['winPlacePerc'])
plt.xlabel('damageDealt')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['headshotKills'],train_df['winPlacePerc'])
plt.xlabel('headshotKills')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['heals'],train_df['winPlacePerc'])
plt.xlabel('heals')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['killPlace'],train_df['winPlacePerc'])
plt.xlabel('killPlace')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['killPoints'],train_df['winPlacePerc'])
plt.xlabel('killPoints')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['kills'],train_df['winPlacePerc'])
plt.xlabel('kills')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['killStreaks'],train_df['winPlacePerc'])
plt.xlabel('killStreaks')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['longestKill'],train_df['winPlacePerc'])
plt.xlabel('longestKill')
plt.ylabel('winPlacePerc')

In [None]:
plt.scatter(train_df['matchDuration'],train_df['winPlacePerc'])
plt.xlabel('matchDuration')
plt.ylabel('winPlacePerc')

lets check the distribution of some of the features

In [None]:
# lets commit first
jovian.commit(project="PUBG-Playground-Finish-Prediction")

Let us check for nulti collinearity

Multicollinearity occurs when there are 2 or more independent variables in a multiple regression model which have a high correlation among themselves.

When some features are highly correlated we might have difficukty in distinguihing between theeir individual edffetcs on the dependent variable.

Multicollinearity can be detected by various tehniques, once such is VIF (Variance Inflation Factor)

In VIF method, we pick each feature and regress it against all of the other features. For each regression, the factoir VIF is calculated.

The VIF can be calculated for numerical features only

In [None]:
jovian.commit(project="PUBG-Playground-Finish-Prediction")

First lets identify the numerical coolumns

In [None]:
num_cols = train_df.select_dtypes(exclude = 'object').columns.tolist()
num_cols

We wil install stat models library to use it

In [None]:
pip install statsmodels

In [None]:
vif = pd.DataFrame()
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
variables = train_df[num_cols]

In [None]:
vif['VIF'] = [variance_inflation_factor(variables.values,i) for i in range(variables.shape[1])]
vif['features'] = variables.columns

In [None]:
vif

In [None]:
vif.sort_values(['VIF'], ascending=False)

Ideally the range of VIF consideration is as below:

VIF = 1     -> No Multicollinearity

1 < VIF < 5 -> Perfectly OK

10 < VIF    -> Unacceptable

lets look at the heatmap, what it tell us


In [None]:
plt.figure(figsize = (40,20))
plt.title('Correlation View')
sns.heatmap(train_df.corr(), annot = True, cmap = 'viridis')
plt.show()

So you can see that some features have very positive and strong correlation with ```winPlacePerc```




In [None]:
jovian.commit(project='PUBG-Playground-Finish-Prediction')

##Feature Engineering

Lets check the distributions of fetures to see their oreintations and identify any odd behaviour 

In [None]:
sns.displot(train_df['winPoints'],bins=10) 

In [None]:
jovian.commit(project='PUBG-Playground-Finish-Prediction')

In [None]:
plt.figure(figsize = (20, 10))
sns.countplot(train_df['kills'])

As the distrubution is skewed towards right, lets trim the data to only contain values less than 20

In [None]:
indexes = train_df[train_df['kills'] > 20].index
train_df.drop(index = indexes, inplace = True)

Checking ```Heals```

In [None]:
sns.displot(train_df['heals'],bins=10)

In [None]:
# Removing heals which are more than 35

indexes = train_df[train_df['heals'] > 35].index
train_df.drop(index = indexes, inplace = True)

Checking Weapons Acquired

In [None]:
sns.displot(train_df['weaponsAcquired'], bins=50)

In [None]:
# Removing weaponsAcquired which are more than 50

indexes = train_df[train_df['weaponsAcquired'] > 50].index
train_df.drop(index = indexes, inplace = True)

Checking Swim Distance

In [None]:
sns.displot(train_df['swimDistance'])

In [57]:
# Removing swimDistance which are more than 1500

indexes = train_df[train_df['swimDistance'] > 1500].index
train_df.drop(index = indexes, inplace = True)

Checking Ride Distance

In [None]:
sns.displot(train_df['rideDistance'])

In [None]:
# Removing rideDistance which are more than 20000

indexes = train_df[train_df['rideDistance'] > 20000].index
train_df.drop(index = indexes, inplace = True)

Checking walkDistance

In [None]:
sns.displot(train_df['walkDistance'],bins=1000)

In [None]:
# Removing walkDistance which are more than 10000

indexes = train_df[train_df['walkDistance'] > 10000].index
train_df.drop(index = indexes, inplace = True)

Checking longestKill

In [None]:
#Removing longestKill which are more than 1000

indexes = train_df[train_df['longestKill'] > 1000].index
train_df.drop(index = indexes, inplace = True)

Lets see unique entries of roadkill

In [None]:
train_df['roadKills'].unique()

In [None]:
# we weill just drop all the rows were the roadKills values are greater than 10

indexes = train_df[train_df['roadKills'] > 10].index
train_df.drop(index = indexes, inplace = True)

Checking Damage Dealt

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2)
fig.set_figwidth(20)

sns.distplot(train_df['damageDealt'],ax=ax1)
sns.boxplot(train_df['damageDealt'],ax=ax2)


In [None]:
# dropping damageDealt where more than 1500

indexes = train_df[train_df['damageDealt'] > 1500].index
train_df.drop(index = indexes, inplace = True)

In [None]:
jovian.commit(project='PUBG-Playground-Finish-Prediction.ipynb')