## DNSC 6290 Large Datasets Group Project

### Group 6 PUBG Match Deaths and Statistics

Create SparkContext and SparkSession:

In [47]:
import findspark
findspark.init()

In [48]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
     .appName("Test SparkSession") \
     .getOrCreate()

#Remember to close sc at the end

In [49]:
spark

In [3]:
#aggr0=sc.textFile("s3://bigdata2020group6/aggregate/agg_match_stats_0.csv")
#aggr1=sc.textFile("s3://bigdata2020group6/aggregate/agg_match_stats_1.csv")
#aggr2=sc.textFile("s3://bigdata2020group6/aggregate/agg_match_stats_2.csv")
#aggr3=sc.textFile("s3://bigdata2020group6/aggregate/agg_match_stats_3.csv")
#aggr4=sc.textFile("s3://bigdata2020group6/aggregate/agg_match_stats_4.csv")

In [4]:
#death0=sc.textFile("s3://bigdata2020group6/deaths/kill_match_stats_final_0.csv")
#death1=sc.textFile("s3://bigdata2020group6/deaths/kill_match_stats_final_1.csv")
#death2=sc.textFile("s3://bigdata2020group6/deaths/kill_match_stats_final_2.csv")
#death3=sc.textFile("s3://bigdata2020group6/deaths/kill_match_stats_final_3.csv")
#death4=sc.textFile("s3://bigdata2020group6/deaths/kill_match_stats_final_4.csv")

#### 1. Load and Prepare Data

We need to stack all five aggregate files into one file, as well as stacking all five deaths files into one. 

In [58]:
#This command read every file in the "aggregate" folder
aggr_all = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("s3://bigdata2020group6/aggregate/agg_match_stats_*.csv")

In [59]:
aggr_all.count() #67369236 rows in total

67369231

In [61]:
#This command read every file in the "deaths" folder
death_all = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("s3://bigdata2020group6/deaths/kill_match_stats_final_*.csv")

In [62]:
death_all.count() #65370480 rows in total

65370475

In [None]:
Data clearning: Null value in killer_placement?

#### 3. Explore Data Structure

In [60]:
aggr_all.show(10)

+--------------------+---------+--------------------+----------+----------+--------------+-----------+------------------+------------------+----------+------------+-----------+-------------------+-------+--------------+
|                date|game_size|            match_id|match_mode|party_size|player_assists|player_dbno|  player_dist_ride|  player_dist_walk|player_dmg|player_kills|player_name|player_survive_time|team_id|team_placement|
+--------------------+---------+--------------------+----------+----------+--------------+-----------+------------------+------------------+----------+------------+-----------+-------------------+-------+--------------+
|2017-11-26T20:59:...|       37|2U4GBNA0YmnNZYkzj...|       tpp|         2|             0|          1|          2870.724|        1784.84778|       117|           1|   SnuffIes|            1106.32|      4|            18|
|2017-11-26T20:59:...|       37|2U4GBNA0YmnNZYkzj...|       tpp|         2|             0|          1|2938.4072300000003

In [76]:
death_all.show(10)

+------------+----------------+----------------+-----------------+-----------------+-------+--------------------+----+---------------+----------------+-----------------+-----------------+
|   killed_by|     killer_name|killer_placement|killer_position_x|killer_position_y|    map|            match_id|time|    victim_name|victim_placement|victim_position_x|victim_position_y|
+------------+----------------+----------------+-----------------+-----------------+-------+--------------------+----+---------------+----------------+-----------------+-----------------+
|     Grenade| KrazyPortuguese|             5.0|         657725.1|         146275.2|MIRAMAR|2U4GBNA0YmnLSqvEy...| 823|KrazyPortuguese|             5.0|         657725.1|         146275.2|
|      SCAR-L|nide2Bxiaojiejie|            31.0|         93091.37|         722236.4|MIRAMAR|2U4GBNA0YmnLSqvEy...| 194|    X3evolution|            33.0|         92238.68|         723375.1|
|        S686|        Ascholes|            43.0|         366

In [67]:
aggr_all.printSchema

<bound method DataFrame.printSchema of DataFrame[date: string, game_size: int, match_id: string, match_mode: string, party_size: int, player_assists: int, player_dbno: int, player_dist_ride: double, player_dist_walk: double, player_dmg: int, player_kills: int, player_name: string, player_survive_time: double, team_id: int, team_placement: int]>

In [69]:
death_all.printSchema

<bound method DataFrame.printSchema of DataFrame[killed_by: string, killer_name: string, killer_placement: double, killer_position_x: double, killer_position_y: double, map: string, match_id: string, time: int, victim_name: string, victim_placement: double, victim_position_x: double, victim_position_y: double]>

In [70]:
aggr_all.createOrReplaceTempView("aggr")
death_all.createOrReplaceTempView("death")  

In [73]:
#Split aggregate dateframe into three dfs based on party_size
from pyspark.sql.functions import col
single = aggr_all.filter(col("party_size") == 1)
double = aggr_all.filter(col("party_size") == 2)
quadruple = aggr_all.filter(col("party_size") == 4)

#### 4. Analyze Data

##### (1) SQL:

a. Which locations is "dangerous" for parachuting? 

b. Player's placement vs Number of enemies killed

In [75]:
single.createOrReplaceTempView("single")
double.createOrReplaceTempView("double")  
quadruple.createOrReplaceTempView("quadruple")

In [95]:
#Party Size = 1
party1_kill = spark.sql("""
             select team_placement as rank , avg(player_kills) as avg_kills from single
             group by team_placement
             order by team_placement asc
          """).cache()

In [96]:
party1_kill.show(10)

+----+------------------+
|rank|         avg_kills|
+----+------------------+
|   1| 6.970846857480082|
|   2| 3.599979000144374|
|   3|3.0568580976829995|
|   4| 2.708515111695138|
|   5|  2.47160518182416|
|   6| 2.285916429187261|
|   7| 2.144181446561614|
|   8|  2.01534766485873|
|   9|1.9034656409849717|
|  10|1.8151479399756085|
+----+------------------+
only showing top 10 rows



In [97]:
#Party Size = 2
party2_kill = spark.sql("""
             select team_placement as rank , avg(player_kills) as avg_kills from double
             group by team_placement
             order by team_placement asc
          """).cache()

In [98]:
party2_kill.show(10)  ##Rank 0 ???

+----+------------------+
|rank|         avg_kills|
+----+------------------+
|   0|               1.5|
|   1| 4.416512205489827|
|   2|2.5763982214079224|
|   3|2.2311533877941216|
|   4|1.9527973618868921|
|   5|1.7622350599079941|
|   6|1.6203774044897594|
|   7|1.5054382262074872|
|   8|1.4006394631320518|
|   9|1.3206494950410355|
+----+------------------+
only showing top 10 rows



In [99]:
#Party Size = 4
party4_kill = spark.sql("""
             select team_placement as rank , avg(player_kills) as avg_kills from quadruple
             group by team_placement
             order by team_placement asc
          """).cache()

In [100]:
party4_kill.show(10)  ##Again, rank 0 ??

+----+------------------+
|rank|         avg_kills|
+----+------------------+
|   0|1.1176470588235294|
|   1| 2.937261017288994|
|   2|1.7871176501136514|
|   3|1.5751676654505462|
|   4|1.3730355960439127|
|   5|1.2366525816118643|
|   6|1.1366548365049696|
|   7| 1.051095615368824|
|   8|0.9775038751257716|
|   9|0.9118165067424356|
+----+------------------+
only showing top 10 rows



c. Kill Distance vs Kill By

##### (2) Machine Learning:

#### 5. 

In [46]:
spark.stop()