# Evaluating NFL Win-Loss Potential Using KMeans Clustering Analysis

The purpose of this ananlysis is to use clustering analysis to identifying playing style archetypes within the NFL and compare the level of success between these archetypes measured by the number of wins obtained in a season over the span of a decade. The information resulting from this analysis may help identify team statistical attributes associated with success which could be useful for the purposes of sports betting or team building.  

#### Imported Libraries

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold, train_test_split,GridSearchCV
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import completeness_score, homogeneity_score

#### Data

This data contains offensive and defensive seasonal statistics from every NFL team from the years 2010-2019 and is sourced from pro-football-reference.com. The dataset contains 320 observations and 53 features.

In [20]:
#Importing the dataset
nflData= pd.read_csv('data\\NFL_Combined.csv')
nflData.rename(columns= {'Unnamed: 0': 'Year'}, inplace= True)
nflData

Unnamed: 0,Year,Rk,Tm,ID,PF_Off,Yds_Off,Ply_Off,Y/P_Off,TO_Off,FL_Off,...,YdsRush_Def,TDrush_Def,Y/Arush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,W
0,2010,27,Arizona Cardinals,1,289,4309,931,4.6,35,16,...,2323,19,4.4,123,108,894,38,39.7,13.2,5
1,2011,24,Arizona Cardinals,2,312,5192,993,5.2,32,9,...,1986,15,4.2,118,122,950,41,32.8,9.3,8
2,2012,31,Arizona Cardinals,3,250,4209,1018,4.1,34,13,...,2192,12,4.3,107,100,810,21,27.0,14.2,5
3,2013,16,Arizona Cardinals,4,379,5542,1037,5.3,31,9,...,1351,5,3.7,68,111,960,33,28.9,13.7,10
4,2014,24,Arizona Cardinals,5,310,5116,993,5.2,17,5,...,1739,9,4.4,77,130,1192,28,29.8,12.8,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,2015,10,Washington Redskins,316,388,5661,1011,5.6,22,11,...,1962,10,4.8,103,112,955,24,35.4,14.9,9
316,2016,12,Washington Redskins,317,396,6454,1009,6.4,21,9,...,1916,19,4.5,109,100,1023,34,38.5,11.8,8
317,2017,16,Washington Redskins,318,342,5199,982,5.3,27,14,...,2146,13,4.5,105,100,887,31,35.9,11.8,7
318,2018,29,Washington Redskins,319,281,4795,967,5.0,19,4,...,1860,12,4.5,109,102,791,24,35.1,14.6,7


In [21]:
nflData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320 entries, 0 to 319
Data columns (total 53 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          320 non-null    int64  
 1   Rk            320 non-null    int64  
 2   Tm            320 non-null    object 
 3   ID            320 non-null    int64  
 4   PF_Off        320 non-null    int64  
 5   Yds_Off       320 non-null    int64  
 6   Ply_Off       320 non-null    int64  
 7   Y/P_Off       320 non-null    float64
 8   TO_Off        320 non-null    int64  
 9   FL_Off        320 non-null    int64  
 10  1stDTot_Off   320 non-null    int64  
 11  Cmp_Off       320 non-null    int64  
 12  AttPass_Off   320 non-null    int64  
 13  YdsPass_Off   320 non-null    int64  
 14  TDPass_Off    320 non-null    int64  
 15  Int_Off       320 non-null    int64  
 16  NY/Apass_Off  320 non-null    float64
 17  1stDPass_Off  320 non-null    int64  
 18  AttRush_Off   320 non-null    

In [22]:
nflData.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,320.0,2014.5,2.87678,2010.0,2012.0,2014.5,2017.0,2019.0
Rk,320.0,16.5,9.247553,1.0,8.75,16.5,24.25,32.0
ID,320.0,160.5,92.520268,1.0,80.75,160.5,240.25,320.0
PF_Off,320.0,362.290625,70.133012,193.0,311.5,361.0,406.0,606.0
Yds_Off,320.0,5542.271875,591.598582,3865.0,5108.0,5526.5,5956.25,7474.0
Ply_Off,320.0,1021.05,47.915521,878.0,993.0,1017.5,1052.25,1191.0
Y/P_Off,320.0,5.4225,0.481045,4.1,5.1,5.4,5.8,6.8
TO_Off,320.0,23.74375,6.537327,8.0,19.0,23.0,28.0,44.0
FL_Off,320.0,9.534375,3.299863,2.0,7.0,9.0,11.0,22.0
1stDTot_Off,320.0,317.546875,36.029825,225.0,292.0,315.5,341.0,444.0


In [23]:
#Discretizing the target column for better interpretability
#Four Categories: Losing-Low(0-4), Losing(5-8), Winning(9-12), Winning-High(13-16)
nflData['W_binned']= pd.cut(nflData['W'], bins=[0,4,8,12,16],labels=['Losing-Low','Losing','Winning','Winning-High'],include_lowest=True)
nflData

Unnamed: 0,Year,Rk,Tm,ID,PF_Off,Yds_Off,Ply_Off,Y/P_Off,TO_Off,FL_Off,...,TDrush_Def,Y/Arush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,W,W_binned
0,2010,27,Arizona Cardinals,1,289,4309,931,4.6,35,16,...,19,4.4,123,108,894,38,39.7,13.2,5,Losing
1,2011,24,Arizona Cardinals,2,312,5192,993,5.2,32,9,...,15,4.2,118,122,950,41,32.8,9.3,8,Losing
2,2012,31,Arizona Cardinals,3,250,4209,1018,4.1,34,13,...,12,4.3,107,100,810,21,27.0,14.2,5,Losing
3,2013,16,Arizona Cardinals,4,379,5542,1037,5.3,31,9,...,5,3.7,68,111,960,33,28.9,13.7,10,Winning
4,2014,24,Arizona Cardinals,5,310,5116,993,5.2,17,5,...,9,4.4,77,130,1192,28,29.8,12.8,11,Winning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,2015,10,Washington Redskins,316,388,5661,1011,5.6,22,11,...,10,4.8,103,112,955,24,35.4,14.9,9,Winning
316,2016,12,Washington Redskins,317,396,6454,1009,6.4,21,9,...,19,4.5,109,100,1023,34,38.5,11.8,8,Losing
317,2017,16,Washington Redskins,318,342,5199,982,5.3,27,14,...,13,4.5,105,100,887,31,35.9,11.8,7,Losing
318,2018,29,Washington Redskins,319,281,4795,967,5.0,19,4,...,12,4.5,109,102,791,24,35.1,14.6,7,Losing


#### Preparing Data for KMeans Clustering

In [24]:
#Separating out numerical statistical values
statsMatrix= nflData.loc[:,'PF_Off':'TO%_Def']
statsMatrix

Unnamed: 0,PF_Off,Yds_Off,Ply_Off,Y/P_Off,TO_Off,FL_Off,1stDTot_Off,Cmp_Off,AttPass_Off,YdsPass_Off,...,AttRush_Def,YdsRush_Def,TDrush_Def,Y/Arush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def
0,289,4309,931,4.6,35,16,241,285,561,2921,...,526,2323,19,4.4,123,108,894,38,39.7,13.2
1,312,5192,993,5.2,32,9,286,307,550,3567,...,475,1986,15,4.2,118,122,950,41,32.8,9.3
2,250,4209,1018,4.1,34,13,246,337,608,3005,...,506,2192,12,4.3,107,100,810,21,27.0,14.2
3,379,5542,1037,5.3,31,9,329,363,574,4002,...,370,1351,5,3.7,68,111,960,33,28.9,13.7
4,310,5116,993,5.2,17,5,302,320,568,3808,...,396,1739,9,4.4,77,130,1192,28,29.8,12.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,388,5661,1011,5.6,22,11,317,386,555,4095,...,406,1962,10,4.8,103,112,955,24,35.4,14.9
316,396,6454,1009,6.4,21,9,345,407,607,4758,...,423,1916,19,4.5,109,100,1023,34,38.5,11.8
317,342,5199,982,5.3,27,14,278,347,540,3751,...,472,2146,13,4.5,105,100,887,31,35.9,11.8
318,281,4795,967,5.0,19,4,280,311,509,3021,...,413,1860,12,4.5,109,102,791,24,35.1,14.6


There are several feateures such as total yards on offense (Yds_Off) that are redundant or used to create other stats. These features must be removed to avoid biasing the analysis. In the case of redundant features, more granular feature were preferred as they are more descriptive. For example, total passing yards and total rushing yards are preferrable to total yards. 

In [25]:
#Removing some redundant features
statsMatrix.drop(columns=['Yds_Off','Ply_Off','Y/P_Off','TO_Off','1stDTot_Off','NY/Apass_Off','Y/Arush_Off','Yds_Def','Ply_Def','Y/P_Def','TO_Def','1stDTot_Def','NY/Apass_Def','Y/Arush_Def'],inplace=True)
statsMatrix

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,1stDPass_Def,AttRush_Def,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def
0,289,16,285,561,2921,10,19,154,320,1388,...,178,526,2323,19,123,108,894,38,39.7,13.2
1,312,9,307,550,3567,21,23,175,389,1625,...,174,475,1986,15,118,122,950,41,32.8,9.3
2,250,13,337,608,3005,11,21,168,352,1204,...,160,506,2192,12,107,100,810,21,27.0,14.2
3,379,9,363,574,4002,24,22,205,422,1540,...,208,370,1351,5,68,111,960,33,28.9,13.7
4,310,5,320,568,3808,21,12,191,397,1308,...,195,396,1739,9,77,130,1192,28,29.8,12.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,388,11,386,555,4095,30,11,208,429,1566,...,202,406,1962,10,103,112,955,24,35.4,14.9
316,396,9,407,607,4758,25,12,226,379,1696,...,225,423,1916,19,109,100,1023,34,38.5,11.8
317,342,14,347,540,3751,27,13,191,401,1448,...,171,472,2146,13,105,100,887,31,35.9,11.8
318,281,4,311,509,3021,16,15,156,414,1774,...,197,413,1860,12,109,102,791,24,35.1,14.6


The scales of the remaining features vary significantly. As a result, some features may have more influence determining the similarity between teams despite being equal in importance to other statistics. To mitigate this issue, min-max normalization is applied to the features of the statistical matrix.

In [26]:
#Standardizing the feature values using Min-Max Normaliztion
MnMx= MinMaxScaler()
statsMatrixSc= MnMx.fit_transform(statsMatrix)
statsMatrixSc

array([[0.23244552, 0.7       , 0.26666667, ..., 0.58333333, 0.75464684,
        0.54385965],
       [0.28813559, 0.35      , 0.35294118, ..., 0.64583333, 0.49814126,
        0.31578947],
       [0.13801453, 0.55      , 0.47058824, ..., 0.22916667, 0.28252788,
        0.60233918],
       ...,
       [0.36077482, 0.6       , 0.50980392, ..., 0.4375    , 0.6133829 ,
        0.4619883 ],
       [0.21307506, 0.1       , 0.36862745, ..., 0.29166667, 0.58364312,
        0.62573099],
       [0.17675545, 0.3       , 0.31764706, ..., 0.5       , 0.83643123,
        0.48538012]])

The initial pass of KMeans will use k value of 4. This value was chosen because number of wins feature was binned into 4 classes. The purity of the resulting clusters in regard to the classes of winning will be evaluated using homogeneity and completeness scores. The resulting cluster means will be evaluated to identify the features that are most characteristic of the teams within the cluster. 

In [27]:
#Reporting the 4 cluster centers
km4= KMeans(n_clusters=4, random_state=9)
km4.fit(statsMatrixSc)
km4.cluster_centers_

array([[0.26815981, 0.46      , 0.46960784, 0.46858209, 0.41155436,
        0.26968085, 0.55133929, 0.34429825, 0.26263587, 0.23595794,
        0.27403846, 0.22443182, 0.47835648, 0.52319724, 0.41611842,
        0.30348665, 0.61982249, 0.72016729, 0.27440476, 0.51384943,
        0.36828704, 0.51859742, 0.48455882, 0.33362069, 0.48137019,
        0.57186192, 0.60025585, 0.45982143, 0.5875    , 0.4585    ,
        0.44487395, 0.42838542, 0.71570632, 0.34758772],
       [0.54713479, 0.34055556, 0.68204793, 0.59512438, 0.67326187,
        0.49692671, 0.41706349, 0.60220923, 0.32805958, 0.2621028 ,
        0.40811966, 0.28846801, 0.45226337, 0.49248262, 0.53654971,
        0.61519947, 0.44135437, 0.56869062, 0.37407407, 0.65511364,
        0.56563786, 0.62448274, 0.46666667, 0.42681992, 0.59893162,
        0.33417015, 0.45618583, 0.36230159, 0.46217494, 0.51896296,
        0.51232493, 0.41041667, 0.61664601, 0.50565302],
       [0.51346852, 0.33046875, 0.40012255, 0.29412313, 0.40863323,
  

In [28]:
#Viewing the distribution of teams within the 4 clusters
cluster_assignments= km4.predict(statsMatrixSc)
pd.Series(cluster_assignments).value_counts()

1    90
3    86
0    80
2    64
dtype: int64

#### Evaluating Cluster Purity

The resulting clusters are evaluated for purity using homogeity and completeness scores relative to the 4 winning classes developed above. Scores closer to 1.0 indicate that the clusters are homogenous and therefore closely associated with number of wins in a season.

In [29]:
homogeneity_score(nflData['W_binned'],cluster_assignments)

0.2540751887696435

In [30]:
completeness_score(nflData['W_binned'],cluster_assignments)

0.22340135622607654

#### Evaluating Cluster Means

In [31]:
k4centers= pd.DataFrame(km4.cluster_centers_,columns=statsMatrix.columns)
k4centers

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,1stDPass_Def,AttRush_Def,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def
0,0.26816,0.46,0.469608,0.468582,0.411554,0.269681,0.551339,0.344298,0.262636,0.235958,...,0.48137,0.571862,0.600256,0.459821,0.5875,0.4585,0.444874,0.428385,0.715706,0.347588
1,0.547135,0.340556,0.682048,0.595124,0.673262,0.496927,0.417063,0.602209,0.32806,0.262103,...,0.598932,0.33417,0.456186,0.362302,0.462175,0.518963,0.512325,0.410417,0.616646,0.505653
2,0.513469,0.330469,0.400123,0.294123,0.408633,0.369016,0.292411,0.33708,0.595675,0.471145,...,0.472656,0.230779,0.331177,0.227121,0.306516,0.422708,0.404779,0.407227,0.415428,0.611385
3,0.321077,0.371512,0.437255,0.405415,0.387222,0.269916,0.455565,0.329185,0.409842,0.279667,...,0.338886,0.423616,0.455216,0.313123,0.422934,0.475504,0.478249,0.398983,0.450463,0.478648


For each cluster we will indentify features where its average value within the cluster is the highest among all clusters. These features will be used to characterize each cluster. 

In [32]:
#Indentifying the cluster with the highest average value for each feature
featmax= k4centers.idxmax()
featmax

PF_Off          1
FL_Off          0
Cmp_Off         1
AttPass_Off     1
YdsPass_Off     1
TDPass_Off      1
Int_Off         0
1stDPass_Off    1
AttRush_Off     2
YdsRush_Off     2
TDrush_Off      2
1stDRush_Off    2
Pen_Off         0
PenYds_Off      0
1stPy_Off       1
Sc%_Off         1
TO%_Off         0
PF_Def          0
FL_Def          2
Cmp_Def         1
AttPass_Def     2
YdsPass_Def     1
TDPass_Def      0
Int_Def         2
1stDPass_Def    1
AttRush_Def     0
YdsRush_Def     0
TDrush_Def      0
1stDrush_Def    0
Pen_Def         1
PenYds_Def      1
1stPy_Def       0
Sc%_Def         0
TO%_Def         2
dtype: int64

In [33]:
clust= {}

for f in featmax.index:
    if featmax[f] in clust:
        clust[featmax[f]].append(f)
    else:
        clust[featmax[f]]= [f]

for c in clust:
    print('Cluster {}:'.format(c),clust[c],'\n')

Cluster 1: ['PF_Off', 'Cmp_Off', 'AttPass_Off', 'YdsPass_Off', 'TDPass_Off', '1stDPass_Off', '1stPy_Off', 'Sc%_Off', 'Cmp_Def', 'YdsPass_Def', '1stDPass_Def', 'Pen_Def', 'PenYds_Def'] 

Cluster 0: ['FL_Off', 'Int_Off', 'Pen_Off', 'PenYds_Off', 'TO%_Off', 'PF_Def', 'TDPass_Def', 'AttRush_Def', 'YdsRush_Def', 'TDrush_Def', '1stDrush_Def', '1stPy_Def', 'Sc%_Def'] 

Cluster 2: ['AttRush_Off', 'YdsRush_Off', 'TDrush_Off', '1stDRush_Off', 'FL_Def', 'AttPass_Def', 'Int_Def', 'TO%_Def'] 



#### Results

Defining the clusters:

cluster 0:
These teams are prone to turnovers and costly penalties on offense and are vulnerable to the run game on defense.

cluster 1:
These teams have high scoring, passing oriented offenses, but notably weak, undisciplined defense particularly against the pass.

cluster 2:
These teams have running oriented offenses and tend to have defenses that force turnovers.

cluster 3:
These teams are not particularly notable in any statistical category.

In [37]:
#Creating new dataframe with cluster assignments
k4complete= pd.DataFrame(statsMatrix,columns=statsMatrix.columns)
k4complete['Cluster']= pd.Series(cluster_assignments)
k4complete['Wins']= nflData['W']
k4complete['W_binned']= nflData['W_binned']
k4complete

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,Cluster,Wins,W_binned
0,289,16,285,561,2921,10,19,154,320,1388,...,19,123,108,894,38,39.7,13.2,0,5,Losing
1,312,9,307,550,3567,21,23,175,389,1625,...,15,118,122,950,41,32.8,9.3,0,8,Losing
2,250,13,337,608,3005,11,21,168,352,1204,...,12,107,100,810,21,27.0,14.2,3,5,Losing
3,379,9,363,574,4002,24,22,205,422,1540,...,5,68,111,960,33,28.9,13.7,1,10,Winning
4,310,5,320,568,3808,21,12,191,397,1308,...,9,77,130,1192,28,29.8,12.8,3,11,Winning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315,388,11,386,555,4095,30,11,208,429,1566,...,10,103,112,955,24,35.4,14.9,1,9,Winning
316,396,9,407,607,4758,25,12,226,379,1696,...,19,109,100,1023,34,38.5,11.8,1,8,Losing
317,342,14,347,540,3751,27,13,191,401,1448,...,13,105,100,887,31,35.9,11.8,0,7,Losing
318,281,4,311,509,3021,16,15,156,414,1774,...,12,109,102,791,24,35.1,14.6,3,7,Losing


Distribution of Wins by Cluster

In [45]:
#Total number of teams within each cluster
k4complete['Cluster'].value_counts()

1    90
3    86
0    80
2    64
Name: Cluster, dtype: int64

In [47]:
#Distribution of win categories by cluster
#Losing-Low(0-4), Losing(5-8), Winning(9-12), Winning-High(13-16)
pd.crosstab(k4complete['Cluster'],k4complete['W_binned'], margins=True)

W_binned,Losing-Low,Losing,Winning,Winning-High,All
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,41,37,2,0,80
1,3,40,37,10,90
2,0,6,46,12,64
3,4,49,31,2,86
All,48,132,116,24,320


In [49]:
#Distribution of win categories by cluster as a percentage of total cluster size
#Losing-Low(0-4), Losing(5-8), Winning(9-12), Winning-High(13-16)
pd.crosstab(k4complete['Cluster'],k4complete['W_binned'], normalize='index')

W_binned,Losing-Low,Losing,Winning,Winning-High
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.5125,0.4625,0.025,0.0
1,0.033333,0.444444,0.411111,0.111111
2,0.0,0.09375,0.71875,0.1875
3,0.046512,0.569767,0.360465,0.023256


In [39]:
#Distribution of total number of wins by cluster
pd.crosstab(k4complete['Cluster'],k4complete['Wins'])

Wins,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,1,1,10,11,18,9,14,7,7,0,1,1,0,0,0,0
1,0,0,0,1,2,7,8,15,10,11,10,6,10,9,0,1
2,0,0,0,0,0,0,0,4,2,6,15,14,11,8,3,1
3,0,0,0,0,4,8,11,16,14,13,10,5,3,2,0,0


Statistical Summary of Clusters

In [40]:
#Cluster 0: High Turnovers, Costly Penalties, poor run defense
k4_cluster0= k4complete[k4complete['Cluster']==0]
k4_cluster0.describe()

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,Cluster,Wins
count,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,...,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0
mean,303.75,11.2,336.75,561.975,3522.725,20.675,17.4375,180.875,392.4875,1660.95,...,2030.4375,15.875,108.225,103.3875,877.7,30.5625,38.6525,9.84375,0.0,4.7625
std,46.419796,3.519925,38.360878,51.761374,449.511858,5.071077,4.327843,22.622591,38.170833,282.885892,...,266.079071,4.384596,15.629391,13.661823,124.717037,7.828613,3.110781,2.479171,0.0,2.118297
min,193.0,4.0,256.0,455.0,2289.0,8.0,8.0,125.0,320.0,1156.0,...,1566.0,7.0,77.0,69.0,634.0,13.0,31.0,3.9,0.0,0.0
25%,276.75,9.0,306.25,524.75,3255.0,17.0,14.75,161.75,361.5,1447.75,...,1810.25,13.0,97.25,94.75,790.75,25.0,36.675,8.175,0.0,3.0
50%,307.0,11.0,336.0,557.5,3560.5,21.0,18.0,180.0,386.5,1632.0,...,2011.0,15.0,105.5,103.5,890.5,31.0,38.9,9.7,0.0,4.0
75%,337.25,14.0,365.0,605.5,3805.75,24.0,20.0,197.25,409.25,1820.25,...,2225.25,18.0,120.25,112.25,948.5,35.25,40.725,11.825,0.0,6.0
max,445.0,22.0,445.0,681.0,4729.0,32.0,29.0,237.0,512.0,2408.0,...,2714.0,31.0,147.0,144.0,1208.0,58.0,46.3,15.3,0.0,11.0


In [41]:
#Cluster 1: High Scoring Pass Offense, Undisciplined,Poor Defense against the pass
k4_cluster1= k4complete[k4complete['Cluster']==1]
k4_cluster1.describe()

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,Cluster,Wins
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,...,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,418.966667,8.811111,390.922222,604.366667,4377.2,31.355556,13.677778,224.977778,410.544444,1716.9,...,1784.077778,13.144444,96.444444,107.922222,917.833333,29.7,35.987778,12.546667,1.0,8.844444
std,57.569391,3.165454,35.412887,43.179128,413.42152,7.042904,4.617572,22.892017,36.65759,225.81732,...,229.23283,3.66447,14.559534,11.596423,121.348162,7.691305,4.022488,2.593605,0.0,2.685421
min,320.0,2.0,306.0,512.0,3427.0,18.0,4.0,173.0,333.0,1221.0,...,1181.0,5.0,60.0,79.0,641.0,10.0,26.9,8.2,1.0,3.0
25%,379.25,7.0,368.5,574.0,4090.5,27.0,11.0,210.0,389.25,1527.25,...,1626.0,11.0,88.0,100.25,843.25,25.0,33.125,10.5,1.0,7.0
50%,410.0,8.5,384.5,598.0,4301.5,30.0,14.0,222.0,408.5,1725.5,...,1806.0,13.0,99.0,109.0,920.0,30.0,35.6,12.65,1.0,9.0
75%,449.25,11.0,410.0,631.5,4661.75,35.0,17.0,237.75,435.75,1872.75,...,1939.25,15.0,104.0,115.0,987.25,34.75,39.275,14.175,1.0,11.0
max,606.0,16.0,472.0,740.0,5444.0,55.0,30.0,293.0,523.0,2231.0,...,2361.0,21.0,130.0,136.0,1202.0,54.0,45.3,20.4,1.0,15.0


In [42]:
#Cluster 2: Running Offense, Strong Defense 
k4_cluster2= k4complete[k4complete['Cluster']==2]
k4_cluster2.describe()

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,Cluster,Wins
count,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,...,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0
mean,405.0625,8.609375,319.03125,503.53125,3513.1875,25.34375,10.1875,179.640625,484.40625,2164.25,...,1570.3125,9.359375,81.8125,100.703125,853.84375,29.546875,30.575,14.354687,2.0,10.859375
std,44.667688,2.292965,34.035448,48.304579,385.538991,5.901893,3.531603,18.184563,36.420262,306.919099,...,220.954486,3.751686,13.260366,13.23497,113.87186,6.049871,4.161959,2.769558,0.0,1.807104
min,321.0,4.0,244.0,405.0,2751.0,14.0,2.0,148.0,396.0,1422.0,...,1004.0,3.0,53.0,69.0,613.0,14.0,19.4,9.0,2.0,7.0
25%,374.25,7.0,291.75,473.0,3216.5,21.0,8.0,164.75,461.5,1968.25,...,1412.75,7.0,72.0,90.75,779.75,25.0,27.275,11.975,2.0,10.0
50%,402.0,9.0,314.5,501.0,3539.5,24.5,9.0,176.5,481.0,2119.0,...,1552.0,9.0,81.5,101.0,858.0,30.0,30.15,14.25,2.0,11.0
75%,428.5,10.0,344.25,538.25,3791.25,29.0,12.25,196.25,502.0,2331.5,...,1734.5,11.0,91.0,110.25,928.5,33.0,33.3,16.15,2.0,12.0
max,531.0,14.0,386.0,620.0,4308.0,38.0,20.0,210.0,596.0,3296.0,...,2130.0,22.0,122.0,137.0,1191.0,44.0,39.3,21.0,2.0,15.0


In [43]:
#Cluster 3: Average Offense and Defense
k4_cluster3= k4complete[k4complete['Cluster']==3]
k4_cluster3.describe()

Unnamed: 0,PF_Off,FL_Off,Cmp_Off,AttPass_Off,YdsPass_Off,TDPass_Off,Int_Off,1stDPass_Off,AttRush_Off,YdsRush_Off,...,YdsRush_Def,TDrush_Def,1stDrush_Def,Pen_Def,PenYds_Def,1stPy_Def,Sc%_Def,TO%_Def,Cluster,Wins
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,...,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,325.604651,9.430233,328.5,540.813953,3443.27907,20.686047,14.755814,178.290698,433.116279,1754.488372,...,1782.418605,11.767442,92.755814,104.662791,897.55814,29.151163,31.517442,12.084884,3.0,7.906977
std,46.281563,3.337984,39.622557,47.110262,460.239341,5.590457,4.469336,23.692573,38.711427,279.096346,...,220.412846,3.43571,12.364024,11.122431,116.810818,8.084142,3.166152,2.677335,0.0,2.145463
min,212.0,2.0,217.0,429.0,2179.0,11.0,4.0,122.0,352.0,1204.0,...,1337.0,4.0,68.0,79.0,676.0,11.0,23.6,5.5,3.0,4.0
25%,289.0,7.0,299.0,504.25,3103.5,16.0,12.0,162.0,409.0,1591.5,...,1632.75,9.25,84.25,97.0,806.25,24.0,29.4,10.325,3.0,6.0
50%,325.0,9.0,329.5,539.5,3411.5,20.0,14.0,179.0,426.5,1737.0,...,1789.0,11.0,92.0,104.0,882.0,29.0,31.7,12.3,3.0,8.0
75%,355.0,11.0,354.0,577.75,3760.0,23.0,17.5,192.0,456.75,1902.5,...,1919.0,14.0,99.75,112.0,983.25,34.0,33.6,13.975,3.0,9.0
max,441.0,18.0,450.0,661.0,4519.0,34.0,26.0,236.0,546.0,2632.0,...,2359.0,22.0,126.0,130.0,1196.0,50.0,38.0,19.0,3.0,13.0


#### Conclusion

After several runs of KMeans using a k-value of 4, a consistent set of cluster with distinct characteristics emerges. The teams evaluated in this dataset fall into one of 4 clusters:

High scoring, passing offense/Mistake-prone, poor pass defense
Running-oriented offense/Strong, turnover forcing defense
High turnover, mistake-prone offense/Poor run defense
Average offense and defense

Of these 4 archetypes, the most successful in terms of winning by a significant margin are Running-oriented offense/Strong, turnover forcing defense teams. Of the football teams within this cluster, 90.63% finished the season with winning records. By comparison, only 52.22% of teams in the second most successful group, High-scoring pass offense/Mistake-prone, poor pass defense experienced a winning season. Unsurprisingly the cluster defined by offensive turnovers, penalties, and poor run defense were the least successful. Only 2.5% of these teams were able to finish the season with a winning record. Average teams, largely unremarkable in any offensive or defensive category generally fair poorly. 61.63% of teams with average performance end the season with a losing record. This suggests that teams may be better off figuring out how to strategically specialize their teams and maximize their advantages rather than creating balanced teams that are decent at most things, but great at nothing.

Interestingly, despite clearly being the most successful, the running offense/strong defense cluster is notably smaller than the other groups. The NFL is sometimes referred to as a copy cat league, where teams often chase success by copying the successful. If this is true, one might expect a greater number of total teams and more balance between winning and losing teams in this group, but this is not the case. One possibility may be that despite teams of this archetype having more regular season success, they may not have as much playoff or Super Bowl success, which is the ultimate goal of all NFL teams. It is possible that teams with greater post-season success have different characteristics than those defining this group. In this case, teams may try to imitate those teams rather than those of the running offense/strong defense archetype. This could be evaluated in the future using data that includes playoff records and stats.