<center><h1>PUBG Player's Data Analysis and Placement Prediction</h1></center>
<center><h2>Fan Wang &nbsp;&nbsp;Bolun Yan </h2></center>
<center><h2>A53277514 &nbsp;&nbsp;A92413094 </h2></center>

In [1]:
# preparation
import numpy as py
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

<h1>Part 1 - Data Pre-process </h1>

In [2]:
# read data
train = pd.read_csv('data/train_v2.csv')
print(train.shape)

(4446966, 29)


In [17]:
# show columns name
columns = list(train.columns.values)
print(columns)

['Id', 'groupId', 'matchId', 'assists', 'boosts', 'damageDealt', 'DBNOs', 'headshotKills', 'heals', 'killPlace', 'killPoints', 'kills', 'killStreaks', 'longestKill', 'matchDuration', 'matchType', 'maxPlace', 'numGroups', 'rankPoints', 'revives', 'rideDistance', 'roadKills', 'swimDistance', 'teamKills', 'vehicleDestroys', 'walkDistance', 'weaponsAcquired', 'winPoints', 'winPlacePerc']


In [73]:
# generate info table
# numerical data
info_table = train.describe().T # This will ignore the non-numerical data type i.e object & string
type_list = [train[col].dtype for col in info_table.index.to_list()]
info_table['type'] = type_list
info_table = info_table[['type', 'count', 'mean', 'std', 'min', 'max']]
print('Information Table for Numerical Data:')
print(info_table)

# non-numerical data
print('\n\nInformation Table for Non-numerical Data:')
non_numerical_columns = ['Id', 'groupId', 'matchId', 'matchType',]
for col in non_numerical_columns:
    print(f'Column \"{col}\" ({train[col].dtypes}) has total {train[col].count()} records with {train[col].nunique()} of unique datas.')

Information Table for Numerical Data:
                    type      count         mean          std  min      max
assists            int64  4446966.0     0.233815     0.588573  0.0     22.0
boosts             int64  4446966.0     1.106908     1.715794  0.0     33.0
damageDealt      float64  4446966.0   130.717138   170.780621  0.0   6616.0
DBNOs              int64  4446966.0     0.657876     1.145743  0.0     53.0
headshotKills      int64  4446966.0     0.226820     0.602155  0.0     64.0
heals              int64  4446966.0     1.370147     2.679982  0.0     80.0
killPlace          int64  4446966.0    47.599350    27.462937  1.0    101.0
killPoints         int64  4446966.0   505.006042   627.504896  0.0   2170.0
kills              int64  4446966.0     0.924783     1.558445  0.0     72.0
killStreaks        int64  4446966.0     0.543955     0.710972  0.0     20.0
longestKill      float64  4446966.0    22.997595    50.972619  0.0   1094.0
matchDuration      int64  4446966.0  1579.506440  

- **groupId** - Integer ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- **matchId** - Integer ID to identify match. There are no matches that are in both the training and testing set.
- **assists** - Number of enemy players this player damaged that were killed by teammates.
- **boosts** - Number of boost items used.
- **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
- **DBNOs** - Number of enemy players knocked.
- **headshotKills** - Number of enemy players killed with headshots.
- **heals** - Number of healing items used.
- **killPlace** - Ranking in match of number of enemy players killed.
- **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.)
- **kills** - Number of enemy players killed.
- **killStreaks** - Max number of enemy players killed in a short amount of time.
- **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a - player and driving away may lead to a large longestKill stat.
- **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- **numGroups** - Number of groups we have data for in the match.
- **revives** - Number of times this player revived teammates.
- **rideDistance** - Total distance traveled in vehicles measured in meters.
- **roadKills** - Number of kills while in a vehicle.
- **swimDistance** - Total distance traveled by swimming measured in meters.
- **teamKills** - Number of times this player killed a teammate.
- **vehicleDestroys** - Number of vehicles destroyed.
- **walkDistance** - Total distance traveled on foot measured in meters.
- **weaponsAcquired** - Number of weapons picked up.
- **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.)
- **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In [74]:
print(list(train['matchType'].unique()))

['squad-fpp', 'duo', 'solo-fpp', 'squad', 'duo-fpp', 'solo', 'normal-squad-fpp', 'crashfpp', 'flaretpp', 'normal-solo-fpp', 'flarefpp', 'normal-duo-fpp', 'normal-duo', 'normal-squad', 'crashtpp', 'normal-solo']


In [82]:
# filter & clean data
cleaned_data = train.copy(deep=True)
print(f'Original records number: {cleaned_data.shape[0]}')

# delete the record where "kill" > 35 (cheating / hack)
cleaned_data = cleaned_data[cleaned_data['kills'] < 35].copy(deep=True)
print(f'Records number after delete the record where "kill" > 35: {cleaned_data.shape[0]}')

# delete the unnormal / crashed matchtype
normal_match_types = ['solo', 'solo-fpp', 'normal-solo', 'normal-solo-fpp',
                      'duo', 'duo-fpp', 'normal-duo', 'normal-duo-fpp',
                      'squad','squad-fpp', 'normal-squad-fpp',  'normal-squad']
cleaned_data = cleaned_data[cleaned_data['matchType'].isin(normal_match_types)].copy(deep=True)
print(f'Records number after delete the unnormal / crashed matchtype: {cleaned_data.shape[0]}')

Original records number: 4446966
Records number after delete the record where "kill" > 35: 4446898
Records number after delete the unnormal / crashed matchtype: 4437017


In [84]:
# split the data according to mathtype solo / duo / squad
solos = cleaned_data[cleaned_data['numGroups']>50]
duos = cleaned_data[(cleaned_data['numGroups']>25) & (cleaned_data['numGroups']<=50)]
squads = cleaned_data[cleaned_data['numGroups']<=25]
print("There are {} ({:.2f}%) solo games, {} ({:.2f}%) duo games and {} ({:.2f}%) squad games.".format(len(solos), 100*len(solos)/len(train), len(duos), 100*len(duos)/len(train), len(squads), 100*len(squads)/len(train),))

There are 709015 (15.94%) solo games, 3287713 (73.93%) duo games and 440289 (9.90%) squad games.


<h1> Part 2 - Data Visualization and Analysis </h1>

<h1> Part 3 - Placement Prediction </h1>