# BAN5600 Advanced Big Data Computing & Programming Final Project

NFL Big Data Bowl 2021-Evaluation on passing plays of the defensive team

Team member: Yi-Hsuan Tsai, Yuxiu Wang

Instructor: Dr. Hamidreza Ahady Dolatsara

Data: https://www.kaggle.com/c/nfl-big-data-bowl-2021/data

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myproj').getOrCreate()

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format='retina'
from matplotlib.pyplot import figure

In [0]:
sp_play = spark.read.csv('/FileStore/tables/plays.csv',inferSchema=True,header=True)
sp_week1 = spark.read.csv('/FileStore/tables/week1.csv',inferSchema=True,header=True)

# 1. Data Preparation

## 1.1 Exclude football data
Since the missing value is from football, which will not be used in our clustering model. Therefore, we filter it out directly.

In [0]:
#sp_week1 = sp_week1.filter(sp_week1.event != 'Football')

In [0]:
sp_week1.show()

In [0]:
sp_week1.printSchema()

# 2. Feature Engineering

### 2.1 "Event" Variable
First, we reorganize the **event** variable. Since the event dataset have lots of "None". Also, the movement of defensive player could change drastically based on different time. For example, if Cornerbacks change his orientation when the ball is thrown, he is more likely to perform man to man coverage since he has to track his targeting player. One the contrary, if a Cornerback is still facing the line of scrimmage when the ball is thrown, he is more likely to be playing zone coverage. Because of this reason, we reorganize the ‘event’ categorical variable into three categories.

In [0]:
df_week1 = sp_week1.select('*').toPandas()
week_array = np.array(df_week1)
previousEvent = 'ball_snap'
for i, instance in enumerate(week_array):
    event = instance[8]
    frameId = instance[13]
    if (previousEvent == 'ball_snap' and event != 'ball_snap') or frameId == 1:
        week_array[i][8] = 'ball_snap'
        previousEvent = 'ball_snap'
    elif (event == 'ball_snap'):
        previousEvent = 'between_snap'
    elif (previousEvent == 'between_snap' and event != 'pass_forward'):
        week_array[i][8] = 'between_snap'
        previousEvent = 'between_snap'
    elif (event == 'pass_forward'):
        week_array[i][8] = 'after_thrown'
        previousEvent = 'after_thrown'
    elif (previousEvent == 'after_thrown' and frameId != 1):
        week_array[i][8] = 'after_thrown'
        previousEvent = 'after_thrown'
        
new_week1 = pd.DataFrame(week_array, columns=df_week1.columns)
df_week1['event'] = new_week1['event']
new_week1 = df_week1

## 2.2 Creating new features

To create good features that can successfully distinguish man and zone coverages, we refer to (Rishav Dutta, Ronald Yurko, Samuel L. Ventura. "Unsupervised Methods for Identifying Pass Coverage Among Defensive Backs with NFL Player Tracking Data. Apr, 2020)". In this paper, th authors generate a rich set of features describing the coverage types. Therefore, we refer to these features and input some of the features that are influential of distinguishing coverage types into our model. In addition to the initial features that generated by the authors, we also creat three new features to see our feature can bring contribution to this clustering method.

**(1) Features that we refer to**

x_var: Variance in the x coordinate

y_var: Variance in the y coordinate

s_var: Variance in the speed

opp_var: Variance in the distance from the nearest opposing player

mate_var: Variance in the distance from the nearest teammate

opp_mean: Mean distance from the nearest opposing player in each frame

mate_mean: Mean distance from the nearest teammate in each frame

opp_dir_var: Variance in the difference in degrees of the direction of motion between the player and the nearest opposing player

opp_dir_mean: Mean difference in degrees of the direction of motion between the player and the nearest opposing player

rat_mean: Mean ratio of the distance to the nearest opposing player and the distance from the nearest opposing player to the nearest teammate

rat_var: Variance of the ratio of the distance to the nearest opposing player and the distance from the nearest opposing player to the nearest teammate

**(2) Features that generated by us**

One important thing in this year’s tracking dataset is that the NFL provide a new variable this year - player’s orientation. We think this new variable could be very helpful for the clustering model to identify pass coverage type. **Therefore, we generate three new feature for our clustering model.**

opp_ori_var: Variance in the orientation between player and the nearest opposing player

opp_ori_mean: Mean difference in orientation between player and his nearest oppsing player

face_LOS: Does the Cornerback is facing on the line of scrimmage; 1 means "Yes", 0 means "No"

Find the nearest oppnent of the cornerback, defining their ID, location (x, y), direction, orientation, and calculating the nearest distance between Cornerbak and his nearest opponents.

In [0]:
new_week2 = new_week1.groupby(['gameId', 'playId', 'frameId'])
player_xy = {}
for name, new in new_week2:
    player_xy[name] = []
    for row in new.iterrows():
        d = [row[1]['nflId'], row[1]['team'], row[1]['x'], row[1]['y'], row[1]['dir'], row[1]['o']]
        player_xy[name].append(d)

In [0]:
features = list(new_week1.columns)
week1Array = np.array(new_week1)
Near_opp_dist = []
for player in week1Array:
    if player[features.index('team')] != 'football':
        opponentLocations = player_xy[(player[features.index('gameId')], player[features.index('playId')], player[features.index('frameId')])]
        distances = []
        directions = []
        orientations =[]
        opponents = []
        xs = []
        ys = []
        for oppLoca in opponentLocations: 
            if player[features.index('team')] != oppLoca[1] and player[features.index('team')] != 'football' and oppLoca[1] != 'football':
                dx = (player[features.index('x')] - oppLoca[2])**2
                dy = (player[features.index('y')] - oppLoca[3])**2
                dist = np.sqrt(dx+dy)
                distances.append(dist)
                directions.append(oppLoca[4])
                orientations.append(oppLoca[5]) ##
                opponents.append(oppLoca[0])
                xs.append(oppLoca[2])
                ys.append(oppLoca[3])
        minDist = min(distances)
        closestOpponent = opponents[np.argmin(distances)]
        opponentDir = directions[np.argmin(distances)]
        opponentOri = orientations[np.argmin(distances)]
        opponentX = xs[np.argmin(distances)]
        opponentY = ys[np.argmin(distances)]
        summary = [player[features.index('gameId')], player[features.index('playId')], player[features.index('frameId')], player[features.index('nflId')], minDist, closestOpponent, opponentDir, opponentOri, opponentX, opponentY]
        Near_opp_dist.append(summary)
        
Near_opp_dist = pd.DataFrame(Near_opp_dist, columns=['gameId', 'playId', 'frameId', 'nflId', 'oppMinDist', 'closestOpp(nflId)', 'oppDir', 'oppOri', 'oppX', 'oppY'])
new_week1 = pd.merge(new_week1, Near_opp_dist, how='left', on=['gameId', 'frameId', 'playId', 'nflId'])

Find the nearest teammate of the cornerback, defining their ID, location (x, y), direction, orientation, and calculating the nearest distance between Cornerbak and his nearest teammate.

In [0]:
features = list(new_week1.columns)
week1Array = np.array(new_week1)
Near_mate_dist = []
for player in week1Array:
    if player[features.index('team')] != 'football':
        mateLocations = player_xy[(player[features.index('gameId')], player[features.index('playId')], player[features.index('frameId')])]
        distances = []
        mates = []
        xs = []
        ys = []
        for mateLoca in mateLocations: 
            if player[features.index('team')] == mateLoca[1] and player[features.index('nflId')] != mateLoca[0] and player[features.index('team')] != 'football' and mateLoca[1] != 'football':
                dx = (player[features.index('x')] - mateLoca[2])**2
                dy = (player[features.index('y')] - mateLoca[3])**2
                dist = np.sqrt(dx+dy)
                distances.append(dist)
                mates.append(mateLoca[0])
                xs.append(oppLoca[2])
                ys.append(oppLoca[3])
        minDist = min(distances)
        closestMate = mates[np.argmin(distances)]
        mateX = xs[np.argmin(distances)]
        mateY = ys[np.argmin(distances)]
        summary = [player[features.index('gameId')], player[features.index('playId')], player[features.index('frameId')], player[features.index('nflId')], minDist, closestMate, mateX, mateY]
        Near_mate_dist.append(summary)

Near_mate_dist = pd.DataFrame(Near_mate_dist, columns=['gameId', 'playId', 'frameId', 'nflId', 'mateMinDist', 'closestMate(nflId)', 'mateX', 'mateY'])
new_week1 = pd.merge(new_week1, Near_mate_dist, how='left', on=['gameId', 'frameId', 'playId', 'nflId'])


## 2.2.1 Feature that we refer to

## (1) x_var, y_var, s_var
x_var: Variance in the x coordinate

y_var: Variance in the y coordinate

s_var: Variance in the speed

In [0]:
#x_var feature baesd on the three time period
x_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['x'].agg(['var']).reset_index().rename(columns={"var": "x_var"})
#y_var feature baesd on the three time period
y_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['y'].agg(['var']).reset_index().rename(columns={"var": "y_var"})
#s_var feature
s_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['s'].agg(['var']).reset_index().rename(columns={"var": "s_var"})

## (2) opp_var & mate_var
opp_var: Variance in the distance from the nearest opposing player
mate_var: Variance in the distance from the nearest teammate

In [0]:
opp_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMinDist'].agg(['var']).reset_index().rename(columns={"var": "opp_var"})
mate_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['mateMinDist'].agg(['var']).reset_index().rename(columns={"var": "mate_var"})

## (3) opp_mean & mate_mean
opp_mean: Mean distance from the nearest opposing player

mate_mean: Mean distance from the nearest teammate

In [0]:
opp_mean = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['oppMinDist'].agg(['mean']).reset_index().rename(columns={"mean": "opp_mean"})
mate_mean = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['mateMinDist'].agg(['mean']).reset_index().rename(columns={"mean": "mate_mean"})

## (4) rat_mean ＆ rat_var
rat_mean: Mean ratio of the distance to the nearest opposing player and the distance from the nearest opposing player to the nearest teammate

rat_var: Variance of the ratio of the distance to the nearest opposing player and the distance from the nearest opposing player to the nearest teammate

In [0]:
ratio = new_week1['oppMinDist'] / np.sqrt((new_week1['oppX'] - new_week1['mateX'])**2 + (new_week1['oppY'] - new_week1['mateY'])**2)
new_week1['opp_mate_dist_ratio'] = ratio
rat_mean = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['opp_mate_dist_ratio'].agg(['mean']).reset_index().rename(columns={"mean": "rat_mean"})
rat_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['opp_mate_dist_ratio'].agg(['var']).reset_index().rename(columns={"var": "rat_var"})

## (5) opp_dir_var & opp_dir_mean
opp_dir_var: Variance difference in degrees of the direction of motion between the player and the nearest opposing player

opp_dir_mean: Mean difference in degrees of the direction of motion between the player and the nearest opposing player

In [0]:
diff_dir = np.absolute(new_week1['dir'] - new_week1['oppDir'])
new_week1['diff_dir'] = diff_dir
opp_dir_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['diff_dir'].agg(['var']).reset_index().rename(columns={"var": "opp_dir_var"})
opp_dir_mean = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['diff_dir'].agg(['mean']).reset_index().rename(columns={"mean": "opp_dir_mean"})

## 2.2.2 Features that generated by us

## (6) opp_ori_var & opp_ori_mean
opp_ori_var: Variance difference in the orientation of motion between the player and the nearest opposing player

opp_ori_mean: Mean difference in the orientation of motion between the player and the nearest opposing player

In [0]:
diff_oir = np.absolute(new_week1['o'] - new_week1['oppOri'])
new_week1['diff_oir'] = diff_oir
opp_ori_var = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['diff_oir'].agg(['var']).reset_index().rename(columns={"var": "opp_ori_var"})
opp_ori_mean = new_week1.groupby(['gameId', 'playId', 'event', 'nflId'])['diff_oir'].agg(['mean']).reset_index().rename(columns={"mean": "opp_ori_mean"})

## (7) face_LOS
face_Los: Does the Cornerback face on the line of scrimmage; 1 means "Yes", 0 means "No"

In [0]:
new_week1["face_LOS"] = np.nan
lst = [new_week1]

for col in lst:
    col.loc[col["o"] < 70, "face_LOS"] = 0
    col.loc[(col["o"] >= 70.0) & (col["o"] <= 110), "face_LOS"] = 1
    col.loc[(col["o"] > 110) & (col["o"] < 250.0), "face_LOS"] = 0
    col.loc[(col["o"] >= 250.0) & (col["o"] <= 290.0), "face_LOS"] = 1
    col.loc[col["o"] >290, "face_LOS"] = 0

## 2.2.3 Combine all new feature into the original data

In [0]:
features = [x_var, y_var, s_var, opp_var, opp_mean, mate_var, mate_mean, opp_dir_var, opp_dir_mean, rat_mean, rat_var, opp_ori_var, opp_ori_mean]
for feature in features:
    new_week1 = pd.merge(new_week1, feature, how='left', on=['gameId', 'event', 'playId', 'nflId'])

# 3. Input features into clustering model
To see if our new features are helpful for the clustering model. We first only input the initial 11 feature into GaussianMixture model. After that, we then combine the initial 11 feature with the 3 features that we generated into GaussianMixture model. Finally, we will compare their performance by silhouettes score.

## 3.1 Set the train dataset with the initial 11 features.

In [0]:
#Choose the data only with cornerbacks
features=['x_var','y_var', 's_var', 'opp_var', 'opp_mean', 'mate_var', 'mate_mean', 'opp_dir_var', 'opp_dir_mean', 'rat_mean', 'rat_var']
only_CB = new_week1.loc[new_week1['position'] == 'CB']
only_CB = only_CB[features].dropna()
only_CB1 = only_CB.drop_duplicates()

sp_only_CB = spark.createDataFrame(only_CB)
sp_only_CB1 = spark.createDataFrame(only_CB1)

In [0]:
#Prepare for machine learning
from pyspark.ml.feature import (VectorAssembler,VectorIndexer)
seed_data = VectorAssembler(inputCols = sp_only_CB1.columns, outputCol='features').transform(sp_only_CB1)

#Scale the data
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

seed_data_scaled = scaler.fit(seed_data).transform(seed_data) #New dataset

In [0]:
#Use GaussianMixture model
from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture(featuresCol='scaledFeatures').setK(2).setSeed(101)
model = gmm.fit(seed_data_scaled)

#The result
prediction = model.transform(seed_data_scaled)
#prediction.show()

#Cluster evaluation
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette = ClusteringEvaluator().evaluate(prediction)
print("Silhouette value is = " + str(silhouette))

In [0]:
silhouette_scores = [] 

for k in range(2, 8):
  gmm2 = GaussianMixture(featuresCol='scaledFeatures').setK(k).setSeed(101)
  models = gmm2.fit(seed_data_scaled)
  predictions = models.transform(seed_data_scaled)
  silhouettes = ClusteringEvaluator().evaluate(predictions)
  print("Silhouette value is = " + str(silhouettes))
# Divide data into 2 cluster is the best choice

## The clustering model with the 11 features is only 0.0986

## 3.2 Set the train dataset with the initial 11 features and the 3 features that we generated.

In [0]:
features=['x_var','y_var', 's_var', 'opp_var', 'opp_mean', 'mate_var', 'mate_mean', 'opp_dir_var', 'opp_dir_mean', 'rat_mean', 'rat_var', 'opp_ori_var', 'opp_ori_mean']
only_CB = new_week1.loc[new_week1['position'] == 'CB']
only_CB = only_CB[features].dropna()
only_CB1 = only_CB.drop_duplicates()

sp_only_CB = spark.createDataFrame(only_CB)
sp_only_CB1 = spark.createDataFrame(only_CB1)

In [0]:
#Prepare for machine learning
from pyspark.ml.feature import (VectorAssembler,VectorIndexer)
seed_data = VectorAssembler(inputCols = sp_only_CB1.columns, outputCol='features').transform(sp_only_CB1)

In [0]:
#Scale the data
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [0]:
seed_data_scaled = scaler.fit(seed_data).transform(seed_data) #New dataset

In [0]:
#Use GaussianMixture model
from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture(featuresCol='scaledFeatures').setK(2).setSeed(101)
model = gmm.fit(seed_data_scaled)

#The result
prediction = model.transform(seed_data_scaled)
#prediction.show()

#Cluster evaluation
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette = ClusteringEvaluator().evaluate(prediction)
print("Silhouette value is = " + str(silhouette))

In [0]:
silhouette_scores = [] 

for k in range(2, 8):
  gmm2 = GaussianMixture(featuresCol='scaledFeatures').setK(k).setSeed(101)
  models = gmm2.fit(seed_data_scaled)
  predictions = models.transform(seed_data_scaled)
  silhouettes = ClusteringEvaluator().evaluate(predictions)
  print("Silhouette value is = " + str(silhouettes))
# Divide data into 2 cluster is the best choice

## We can see that the Silhouette score increase form 0.0986 to 0.5759, which means the new features that we generated improve the model ability of labeling man coverage and zone coverage.

## 3.4 Fit the scaled features into K-means Cluster

In [0]:
from pyspark.ml.clustering import KMeans
for k in range(2, 5):
  kmeans = KMeans(featuresCol='scaledFeatures',k=k).setSeed(101)
  k_model = kmeans.fit(seed_data_scaled)
  k_predictions = k_model.transform(seed_data_scaled)
  k_silhouette = ClusteringEvaluator().evaluate(k_predictions)
  print("Silhouette value is = " + str(k_silhouette))

## 3.5 Fit the scaled features into BisectingKMeans Cluster

In [0]:
from pyspark.ml.clustering import BisectingKMeans
for k in range(2, 5):
  bkm = BisectingKMeans(featuresCol='scaledFeatures').setK(k).setSeed(101)
  bkm_model = bkm.fit(seed_data_scaled)
  bkm_predictions = bkm_model.transform(seed_data_scaled)
  bkm_silhouette = ClusteringEvaluator().evaluate(bkm_predictions)
  print("Silhouette with squared euclidean distance = " + str(bkm_silhouette))

## Gaussian mixture model with k=2 has the best performance

# 4. Merge the predicting result into our orginal data

In [0]:
#Merge the reasult data into the original data
sp_new_week1 = spark.createDataFrame(new_week1)
new_df = sp_new_week1.join(prediction, on=['x_var','y_var', 's_var', 'opp_var', 'opp_mean', 'mate_var', 'mate_mean', 'opp_dir_var', 'opp_dir_mean', 'rat_mean', 'rat_var', 'opp_ori_var', 'opp_ori_mean'], how='left_outer')

In [0]:
#Choose only CB data
new_df2 = new_df.filter(new_df.position=="CB")
#Using SQL
new_df2.createOrReplaceTempView('new_df2')

In [0]:
results = spark.sql("SELECT gameId, playId, position, jerseyNumber, frameId, probability, prediction FROM new_df2 WHERE playId==75" )
results.show()