# Golf Performance Analysis

## Description
This is the final project within the course Sports Analytics, TDDE64, at Linköping University which analyzes the impact of various golf performance metrics on player outcomes using machine learning techniques and regression analysis.

## Table of Contents
1. Data Preprocessing
2. Exploratory Data Analysis (EDA)
3. Feature Selection
4. Modeling
5. Feature Importance Analysis
6. Conclusion

### Data Preprocessing
pgaTourData.csv contains 1674 rows and 18 columns. Each row indicates a golfer's performance for that year.

- Player Name: Name of the golfer
- Rounds: The number of games that a player played
- Fairway Percentage: The percentage of time a tee shot lands on the fairway
- Year: The year in which the statistic was collected
- Avg Distance: The average distance of the tee-shot
- gir: (Green in Regulation) is met if any part of the ball is touching the putting surface while the number of strokes taken is at least two fewer than par
- Average Putts: The average number of strokes taken on the green
- Average Scrambling: Scrambling is when a player misses the green in regulation, but still makes par or better on a hole
- Average Score: Average Score is the average of all the scores a player has played in that year
- Points: The number of FedExCup points a player earned in that year. These points can be earned by competing in tournaments.
- Wins: The number of competition a player has won in that year
- Top 10: The number of competitions where a player has placed in the Top 10
- Average SG Putts: Strokes gained: putting measures how many strokes a player gains (or loses) on the greens.
- Average SG Total: The Off-the-tee + approach-the-green + around-the-green + putting statistics combined
- SG:OTT: Strokes gained: off-the-tee measures player performance off the tee on all par-4s and par-5s.
- SG:APR: Strokes gained: approach-the-green measures player performance on approach shots. Approach shots include all shots that are not from the tee on par-4 and par-5 holes and are not included in strokes gained: around-the-green and strokes gained: putting. - Approach shots include tee shots on par-3s.
- SG:ARG: Strokes gained: around-the-green measures player performance on any shot within 30 yards of the edge of the green. This statistic does not include any shots taken on the putting green.
- Money: The amount of prize money a player has earned from tournaments

In [49]:
# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
df = pd.read_csv('data/raw/pgaTourData.csv')
print(df.head())

      Player Name  Rounds  Fairway Percentage  Year  Avg Distance    gir  \
0  Henrik Stenson    60.0               75.19  2018         291.5  73.51   
1     Ryan Armour   109.0               73.58  2018         283.5  68.22   
2     Chez Reavie    93.0               72.24  2018         286.5  68.67   
3      Ryan Moore    78.0               71.94  2018         289.2  68.80   
4    Brian Stuard   103.0               71.44  2018         278.9  67.12   

   Average Putts  Average Scrambling  Average Score Points  Wins  Top 10  \
0          29.93               60.67         69.617    868   NaN     5.0   
1          29.31               60.13         70.758  1,006   1.0     3.0   
2          29.12               62.27         70.432  1,020   NaN     3.0   
3          29.17               64.16         70.015    795   NaN     5.0   
4          29.11               59.23         71.038    421   NaN     3.0   

   Average SG Putts  Average SG Total  SG:OTT  SG:APR  SG:ARG       Money  
0         

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312 entries, 0 to 2311
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         2312 non-null   object 
 1   Rounds              1678 non-null   float64
 2   Fairway Percentage  1678 non-null   float64
 3   Year                2312 non-null   int64  
 4   Avg Distance        1678 non-null   float64
 5   gir                 1678 non-null   float64
 6   Average Putts       1678 non-null   float64
 7   Average Scrambling  1678 non-null   float64
 8   Average Score       1678 non-null   float64
 9   Points              2296 non-null   object 
 10  Wins                293 non-null    float64
 11  Top 10              1458 non-null   float64
 12  Average SG Putts    1678 non-null   float64
 13  Average SG Total    1678 non-null   float64
 14  SG:OTT              1678 non-null   float64
 15  SG:APR              1678 non-null   float64
 16  SG:ARG

In [52]:
df.shape

(2312, 18)

### Data Cleaning

From a rough look at the initial data, I realized that the data needs to be further cleaned.

- For the columns Top 10 and Wins, convert the NaNs to 0s.
- Change Top 10 and Wins into an int
- Drop NaN values for players who do not have the full statistics
- Change the columns Rounds into int
- Change points to int
- Remove the dollar sign ($) and commas in the column Money

In [53]:
# Replace NaN values with 0 in the 'Top 10' column
df['Top 10'] = df['Top 10'].fillna(0).astype(int)

# Replace NaN values with 0 in No. of Wins column
df['Wins'] = df['Wins'].fillna(0).astype(int)

# Drop NaN values
df = df.dropna(axis = 0)

# Change Round column to int
df['Rounds'] = df['Rounds'].astype(int)

# Change Points column to int
df['Points'] = df['Points'].apply(lambda x: x.replace(',', '') if isinstance(x, str) else x).astype(int)

# Remove the $ sign from the Money column
df['Money'] = df['Money'].astype(str).apply(lambda x: x.replace('$', ''))
df['Money'] = df['Money'].apply(lambda x: x.replace(',', '')).astype(float)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1674 entries, 0 to 1677
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         1674 non-null   object 
 1   Rounds              1674 non-null   int64  
 2   Fairway Percentage  1674 non-null   float64
 3   Year                1674 non-null   int64  
 4   Avg Distance        1674 non-null   float64
 5   gir                 1674 non-null   float64
 6   Average Putts       1674 non-null   float64
 7   Average Scrambling  1674 non-null   float64
 8   Average Score       1674 non-null   float64
 9   Points              1674 non-null   int64  
 10  Wins                1674 non-null   int64  
 11  Top 10              1674 non-null   int64  
 12  Average SG Putts    1674 non-null   float64
 13  Average SG Total    1674 non-null   float64
 14  SG:OTT              1674 non-null   float64
 15  SG:APR              1674 non-null   float64
 16  SG:ARG     

In [55]:
df.head()

Unnamed: 0,Player Name,Rounds,Fairway Percentage,Year,Avg Distance,gir,Average Putts,Average Scrambling,Average Score,Points,Wins,Top 10,Average SG Putts,Average SG Total,SG:OTT,SG:APR,SG:ARG,Money
0,Henrik Stenson,60,75.19,2018,291.5,73.51,29.93,60.67,69.617,868,0,5,-0.207,1.153,0.427,0.96,-0.027,2680487.0
1,Ryan Armour,109,73.58,2018,283.5,68.22,29.31,60.13,70.758,1006,1,3,-0.058,0.337,-0.012,0.213,0.194,2485203.0
2,Chez Reavie,93,72.24,2018,286.5,68.67,29.12,62.27,70.432,1020,0,3,0.192,0.674,0.183,0.437,-0.137,2700018.0
3,Ryan Moore,78,71.94,2018,289.2,68.8,29.17,64.16,70.015,795,0,5,-0.271,0.941,0.406,0.532,0.273,1986608.0
4,Brian Stuard,103,71.44,2018,278.9,67.12,29.11,59.23,71.038,421,0,3,0.164,0.062,-0.227,0.099,0.026,1089763.0


In [56]:
df.describe()

Unnamed: 0,Rounds,Fairway Percentage,Year,Avg Distance,gir,Average Putts,Average Scrambling,Average Score,Points,Wins,Top 10,Average SG Putts,Average SG Total,SG:OTT,SG:APR,SG:ARG,Money
count,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0,1674.0
mean,78.769415,61.448614,2014.002987,290.786081,65.667103,29.163542,58.120687,70.922877,631.125448,0.206691,2.337515,0.025408,0.147527,0.037019,0.065192,0.020192,1488682.0
std,14.241512,5.057758,2.609352,8.908379,2.743211,0.518966,3.386783,0.698738,452.741472,0.516601,2.060691,0.344145,0.6954,0.379702,0.380895,0.223493,1410333.0
min,45.0,43.02,2010.0,266.4,53.54,27.51,44.01,68.698,3.0,0.0,0.0,-1.475,-3.209,-1.717,-1.68,-0.93,24650.0
25%,69.0,57.955,2012.0,284.9,63.8325,28.8025,55.9025,70.49425,322.0,0.0,1.0,-0.18775,-0.26025,-0.19025,-0.18,-0.123,565641.2
50%,80.0,61.435,2014.0,290.5,65.79,29.14,58.29,70.9045,530.0,0.0,2.0,0.04,0.147,0.055,0.081,0.0225,1046144.0
75%,89.0,64.91,2016.0,296.375,67.5875,29.52,60.42,71.34375,813.75,0.0,3.0,0.2585,0.5685,0.28775,0.3145,0.17575,1892478.0
max,120.0,76.88,2018.0,319.7,73.52,31.0,69.33,74.4,4169.0,5.0,14.0,1.13,2.406,1.485,1.533,0.66,12030460.0
