# PGA Tour Statistical EDA
***By Brent Parsons***

- The dataset used for this project can be found [here.](https://www.kaggle.com/laranikal/golf-data-set-from-kaggle-that-became-unavailable)
- Data consists of statistics for the PGA Tour from 2019.
- EDA will be performed in this notebook to determine what statistics are relevant for this project.

## Imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Import dataset
pga_raw = pd.read_csv("data/PGA_Tour_Golf_Data_2019_Kaggle.csv")
pga_raw

Unnamed: 0,Player Name,Date,Statistic,Variable,Value
0,Cameron Champ,2019-08-25,Driving Distance,Driving Distance - (ROUNDS),78
1,Rory McIlroy,2019-08-25,Driving Distance,Driving Distance - (ROUNDS),72
2,Luke List,2019-08-25,Driving Distance,Driving Distance - (ROUNDS),66
3,Dustin Johnson,2019-08-25,Driving Distance,Driving Distance - (ROUNDS),73
4,Wyndham Clark,2019-08-25,Driving Distance,Driving Distance - (ROUNDS),87
...,...,...,...,...,...
9720524,Beau Hossler,2019-07-07,Fairway Bunker Tendency,Fairway Bunker Tendency - (RELATIVE TO PAR),+0.087
9720525,John Chin,2019-07-07,Fairway Bunker Tendency,Fairway Bunker Tendency - (RELATIVE TO PAR),+0.212
9720526,Matt Every,2019-07-07,Fairway Bunker Tendency,Fairway Bunker Tendency - (RELATIVE TO PAR),+0.485
9720527,Stewart Cink,2019-07-07,Fairway Bunker Tendency,Fairway Bunker Tendency - (RELATIVE TO PAR),+0.226


## Review Data

In [3]:
pga_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9720529 entries, 0 to 9720528
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Player Name  object
 1   Date         object
 2   Statistic    object
 3   Variable     object
 4   Value        object
dtypes: object(5)
memory usage: 370.8+ MB


In [4]:
#Dataframe field insight
pga_raw.describe()

Unnamed: 0,Player Name,Date,Statistic,Variable,Value
count,9720045,9720529,9720529,9720529,9546362
unique,1448,31,378,1496,220889
top,Charles Howell III,2019-05-12,Official World Golf Ranking,Official World Golf Ranking - (AVG POINTS),1
freq,44538,325412,181032,30172,225294


**NOTES:**
- 31 unique dates, explore further

### Dates

In [5]:
#Date values contained in dataset
pga_raw["Date"].value_counts()

2019-05-12    325412
2019-03-03    325345
2019-02-24    324528
2019-04-14    323284
2019-05-05    323206
2019-05-19    323185
2019-04-07    322696
2019-04-28    322347
2019-03-10    322279
2019-03-17    322259
2019-06-16    321982
2019-03-31    321908
2019-06-09    321406
2019-05-26    319795
2019-07-28    319092
2019-06-02    318803
2019-02-10    317553
2019-06-23    316909
2019-01-27    316559
2019-04-21    315805
2019-06-30    314204
2019-07-21    313478
2019-02-17    312356
2019-02-03    311598
2019-08-04    311572
2019-07-07    311467
2019-08-11    310055
2019-07-14    309935
2019-08-18    307811
2019-08-25    282476
2019-03-24    211224
Name: Date, dtype: int64

**NOTES:**
- The format of this dataset appears to show statistics added at the end of each week of tournament play on the PGA Tour.
- I will keep all dates in the final dataset, but use an average of most of the measures I choose to analyze. This will ensure the most all-encompassing view of the data as possible.

### Statistical Fields

In [6]:
pga_raw["Statistic"].value_counts()

Official World Golf Ranking                         181032
Consecutive GIR                                      88854
Consecutive Sand Saves                               82824
Lowest Round                                         61728
% of Potential Pts won - FedExCup Regular Season     45072
                                                     ...  
Total 3 Putts - Inside 5'                             3498
Last 15 Events - Putting                                66
Last 15 Events - Scoring                                66
Last 5 Events - Putting                                 12
Last 5 Events - Scoring                                 12
Name: Statistic, Length: 378, dtype: int64

**NOTES:**
- 360 different statistical categories
- Export to Excel to review and determine which ones to use

In [7]:
# Export to CSV for review
pga_raw.to_csv("output/pga_filtered.csv")

In [8]:
# Chosen stats from review (28 total)

pga1 = '3-Putts per Round - (AVG)'
pga2 = 'All-Around Ranking - (TOTAL)'
pga3 = 'Average Approach Shot Distance - (AVG)'
pga4 = 'Birdie Average - (# OF BIRDIES)'
pga5 = 'Birdie or Better Percentage - (%)'
pga6 = 'Birdie to Bogey Ratio - (BIRDIE TO BOGEY RATIO)'
pga7 = 'Bogey Average - (AVERAGE BOGEYS PER ROUND)'
pga8 = 'Club Head Speed - (AVG.)'
pga9 = 'Driving Distance - (AVG.)'
pga10 = 'FedExCup Season Points - (POINTS)'
pga11 = 'GIR Percentage from Fairway - (%)'
pga12 = 'GIR Percentage from Other than Fairway - (%)'
pga13 = 'Good Drive Percentage - (%)'
pga14 = 'Greens in Regulation Percentage - (%)'
pga15 = 'Hit Fairway Percentage - (%)'
pga16 = 'Lowest Round - (VALUE)'
pga17 = 'Official Money - (MONEY)'
pga18 = 'Overall Putting Average - (AVG)'
pga19 = 'Par 3 Scoring Average - (AVG)'
pga20 = 'Par 4 Scoring Average - (AVG)'
pga21 = 'Par 5 Scoring Average - (AVG)'
pga22 = "Putting from - > 25' - (% MADE)"
pga23 = "Putting from - 10-15' - (% MADE)"
pga24 = "Putting from 4-8' - (% MADE)"
pga25 = 'Putts Per Round - (AVG)'
pga26 = 'Scoring Average - (AVG)'
pga27 = 'Scrambling - (%)'
pga28 = 'SG: Total - (AVERAGE)'

pga_stats = [pga1,pga2,pga3,pga4,pga5,pga6,pga7,pga8,pga9,pga10,pga11,pga12,pga13,pga14,pga15,pga16,pga17,pga18,
pga19,pga20,pga21,pga22,pga23,pga24,pga25,pga26,pga27,pga28]

print(pga_stats)

['3-Putts per Round - (AVG)', 'All-Around Ranking - (TOTAL)', 'Average Approach Shot Distance - (AVG)', 'Birdie Average - (# OF BIRDIES)', 'Birdie or Better Percentage - (%)', 'Birdie to Bogey Ratio - (BIRDIE TO BOGEY RATIO)', 'Bogey Average - (AVERAGE BOGEYS PER ROUND)', 'Club Head Speed - (AVG.)', 'Driving Distance - (AVG.)', 'FedExCup Season Points - (POINTS)', 'GIR Percentage from Fairway - (%)', 'GIR Percentage from Other than Fairway - (%)', 'Good Drive Percentage - (%)', 'Greens in Regulation Percentage - (%)', 'Hit Fairway Percentage - (%)', 'Lowest Round - (VALUE)', 'Official Money - (MONEY)', 'Overall Putting Average - (AVG)', 'Par 3 Scoring Average - (AVG)', 'Par 4 Scoring Average - (AVG)', 'Par 5 Scoring Average - (AVG)', "Putting from - > 25' - (% MADE)", "Putting from - 10-15' - (% MADE)", "Putting from 4-8' - (% MADE)", 'Putts Per Round - (AVG)', 'Scoring Average - (AVG)', 'Scrambling - (%)', 'SG: Total - (AVERAGE)']


In [9]:
# Limit data set to chosen statistics (variable) fields
pga_df_stats = pga_raw.loc[pga_raw['Variable'].isin(pga_stats)]
pga_df_stats

Unnamed: 0,Player Name,Date,Statistic,Variable,Value
188,Cameron Champ,2019-08-25,Driving Distance,Driving Distance - (AVG.),317.9
189,Rory McIlroy,2019-08-25,Driving Distance,Driving Distance - (AVG.),313.5
190,Luke List,2019-08-25,Driving Distance,Driving Distance - (AVG.),313.3
191,Dustin Johnson,2019-08-25,Driving Distance,Driving Distance - (AVG.),312.0
192,Wyndham Clark,2019-08-25,Driving Distance,Driving Distance - (AVG.),311.8
...,...,...,...,...,...
9703454,Grayson Murray,2019-07-07,SG: Total,SG: Total - (AVERAGE),-1.443
9703455,Whee Kim,2019-07-07,SG: Total,SG: Total - (AVERAGE),-1.581
9703456,Satoshi Kodaira,2019-07-07,SG: Total,SG: Total - (AVERAGE),-1.721
9703457,Michael Kim,2019-07-07,SG: Total,SG: Total - (AVERAGE),-1.978


In [10]:
# Convert value field to numeric to pivot dataset so statistical variable fields are columns
pga_df_stats["Value"] = pd.to_numeric(pga_df_stats["Value"],errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pga_df_stats["Value"] = pd.to_numeric(pga_df_stats["Value"],errors='coerce')


In [11]:
# Pivot dataset so statistical fields are columns
pga_df_cleaned = pd.pivot_table(data = pga_df_stats, index = 'Player Name', columns = 'Variable', values = 'Value')
pga_df_cleaned.head(10)

Variable,3-Putts per Round - (AVG),All-Around Ranking - (TOTAL),Average Approach Shot Distance - (AVG),Birdie Average - (# OF BIRDIES),Birdie or Better Percentage - (%),Birdie to Bogey Ratio - (BIRDIE TO BOGEY RATIO),Bogey Average - (AVERAGE BOGEYS PER ROUND),Club Head Speed - (AVG.),Driving Distance - (AVG.),FedExCup Season Points - (POINTS),...,Par 3 Scoring Average - (AVG),Par 4 Scoring Average - (AVG),Par 5 Scoring Average - (AVG),Putting from - 10-15' - (% MADE),Putting from - > 25' - (% MADE),Putting from 4-8' - (% MADE),Putts Per Round - (AVG),SG: Total - (AVERAGE),Scoring Average - (AVG),Scrambling - (%)
Player Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aaron Baddeley,0.345161,796.645161,166.173333,162.483871,21.903226,1.731,2.058966,113.669,287.096774,357.7,...,3.014194,3.964839,4.70129,38.315484,5.29,70.513,28.072258,0.450633,70.749032,67.201935
Aaron Rai,,,,,,,,,,,...,,,,,,,,,,
Aaron Wise,0.445161,640.193548,170.206667,188.645161,25.926774,1.609667,2.523103,116.913,303.816129,295.6,...,2.946129,4.025806,4.569032,24.548387,5.631613,64.028667,29.058065,0.110967,71.033871,53.427097
Abraham Ancer,0.500645,580.612903,167.723333,223.0,23.212258,1.746,2.187586,112.179333,294.883871,497.333333,...,3.034839,3.961613,4.617742,32.322581,4.636452,70.173333,28.707097,0.5969,70.681065,64.098065
Adam Bland,,,,,,,,,,,...,,,,,,,,,,
Adam Hadwin,0.597742,453.451613,166.536667,213.419355,23.494516,1.485,2.583103,112.175333,291.706452,607.633333,...,3.061613,4.029032,4.498065,31.475484,5.326129,67.184,28.722581,0.163367,70.586129,58.182258
Adam Long,0.569677,929.076923,165.803333,145.451613,19.13871,1.157,2.58931,110.987667,292.43871,609.9,...,3.074194,4.056774,4.671935,23.604516,4.254194,67.519667,29.336129,-0.369733,71.497129,58.781613
Adam Schenk,0.504839,666.322581,164.643333,265.16129,22.613548,1.563333,2.305517,123.065667,298.883871,355.2,...,3.031613,4.009677,4.566129,33.662903,6.090645,68.618,29.014516,0.5814,70.862226,60.91129
Adam Scott,0.709677,448.193548,174.016667,146.806452,24.679355,1.557,2.577586,119.972667,300.035484,565.176471,...,3.056129,4.025161,4.441613,39.876452,4.902258,61.053333,29.152258,1.611767,69.969355,61.336129
Adam Svensson,0.742903,887.115385,164.563333,171.451613,21.45129,1.307667,2.695862,113.809,287.770968,129.166667,...,3.015806,4.037419,4.635161,24.083871,7.262903,65.458,29.246774,0.270667,71.315387,59.752903


In [12]:
# Identify any rows that contain null (naN) values in the statistical variable fields
pga_df_cleaned.isnull().sum()

Variable
3-Putts per Round - (AVG)                          373
All-Around Ranking - (TOTAL)                       423
Average Approach Shot Distance - (AVG)             379
Birdie Average - (# OF BIRDIES)                    373
Birdie or Better Percentage - (%)                  373
Birdie to Bogey Ratio - (BIRDIE TO BOGEY RATIO)    373
Bogey Average - (AVERAGE BOGEYS PER ROUND)         373
Club Head Speed - (AVG.)                           379
Driving Distance - (AVG.)                          378
FedExCup Season Points - (POINTS)                  364
GIR Percentage from Fairway - (%)                  391
GIR Percentage from Other than Fairway - (%)       373
Good Drive Percentage - (%)                        373
Greens in Regulation Percentage - (%)              373
Hit Fairway Percentage - (%)                       379
Lowest Round - (VALUE)                               0
Overall Putting Average - (AVG)                    373
Par 3 Scoring Average - (AVG)                      373
P

In [13]:
# Drop any rows that contain null (naN) values
pga_df_nonull = pga_df_cleaned.dropna()
pga_df_nonull

Variable,3-Putts per Round - (AVG),All-Around Ranking - (TOTAL),Average Approach Shot Distance - (AVG),Birdie Average - (# OF BIRDIES),Birdie or Better Percentage - (%),Birdie to Bogey Ratio - (BIRDIE TO BOGEY RATIO),Bogey Average - (AVERAGE BOGEYS PER ROUND),Club Head Speed - (AVG.),Driving Distance - (AVG.),FedExCup Season Points - (POINTS),...,Par 3 Scoring Average - (AVG),Par 4 Scoring Average - (AVG),Par 5 Scoring Average - (AVG),Putting from - 10-15' - (% MADE),Putting from - > 25' - (% MADE),Putting from 4-8' - (% MADE),Putts Per Round - (AVG),SG: Total - (AVERAGE),Scoring Average - (AVG),Scrambling - (%)
Player Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aaron Baddeley,0.345161,796.645161,166.173333,162.483871,21.903226,1.731000,2.058966,113.669000,287.096774,357.700000,...,3.014194,3.964839,4.701290,38.315484,5.290000,70.513000,28.072258,0.450633,70.749032,67.201935
Aaron Wise,0.445161,640.193548,170.206667,188.645161,25.926774,1.609667,2.523103,116.913000,303.816129,295.600000,...,2.946129,4.025806,4.569032,24.548387,5.631613,64.028667,29.058065,0.110967,71.033871,53.427097
Abraham Ancer,0.500645,580.612903,167.723333,223.000000,23.212258,1.746000,2.187586,112.179333,294.883871,497.333333,...,3.034839,3.961613,4.617742,32.322581,4.636452,70.173333,28.707097,0.596900,70.681065,64.098065
Adam Hadwin,0.597742,453.451613,166.536667,213.419355,23.494516,1.485000,2.583103,112.175333,291.706452,607.633333,...,3.061613,4.029032,4.498065,31.475484,5.326129,67.184000,28.722581,0.163367,70.586129,58.182258
Adam Long,0.569677,929.076923,165.803333,145.451613,19.138710,1.157000,2.589310,110.987667,292.438710,609.900000,...,3.074194,4.056774,4.671935,23.604516,4.254194,67.519667,29.336129,-0.369733,71.497129,58.781613
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Vaughn Taylor,0.354516,671.580645,163.043333,232.258065,22.806452,1.619000,2.140000,107.793333,282.080645,387.766667,...,3.011613,3.992903,4.650000,32.964839,5.102903,69.121000,28.210968,0.507867,70.705323,62.364516
Webb Simpson,0.375484,436.032258,174.510000,165.000000,23.049677,1.998000,1.982069,110.009000,285.596774,614.416667,...,3.013871,3.936452,4.542258,32.593548,5.986774,68.556000,28.496452,1.436100,69.656387,67.100968
Wes Roach,0.417097,844.111111,166.580000,138.612903,19.729677,1.348333,2.463793,110.083000,290.432258,123.966667,...,2.965806,4.066452,4.589032,29.601290,4.278387,71.781333,29.021935,-0.159100,71.663903,61.987742
Wyndham Clark,0.285161,634.645161,166.340000,238.483871,24.776774,1.623000,2.371034,121.138000,312.132258,304.033333,...,3.089677,4.005161,4.475484,34.852258,4.250000,76.268667,28.038387,0.288700,70.714613,59.041290


## Export Cleansed Data

In [14]:
# Export cleansed dataset for visual analysis in Tableau
pga_df_nonull.to_excel("output/pga_cleansed.xlsx")