# Golf Performance Analysis

## Description
This is the final project within the course Sports Analytics, TDDE64, at Linköping University which analyzes the impact of various golf performance metrics on player outcomes using machine learning techniques and regression analysis. The dataset contains pga tour player data from 2010-2017.

#### Can we determine the most important features that affect a golfer's Average Score and predict the Average Score of a golfer based on their performance metrics?

# <a id='TOC'>Table of Contents</a>
<ol>
<li><a href='#section 1'>Understanding the Problem and the Data</a></li>
<li><a href='#section_2'>Data cleaning and formatting</a></li>
<li><a href='#section_3'>Exploratory Data Analysis</a></li>
<li><a href='#section_4'>Baseline Model</a></li>
<li><a href='#section_5'>Improved Model</a></li>
<li><a href='#section_5'>Conclusion</a></li>
</ol>

### 1. <a id='section_1'>Understanding the Problem and the Data</a>
pgatour_cleaned.csv contains 11 columns. Each row indicates a golfer's performance for that year.

- **Name**: Name of the golfer
- **Rounds**: Number of PGA tour rounds played in that year by that particular player.
- **Scoring**: Average score per round played for that year.
- **Driving Distance**: Average drive distance is calculated from two holes per round, chosen to negate wind effects. The measurement ends where the drive stops, regardless of fairway placement.
- **FWY_%**: The percentage of time a tee shot comes to rest in the fairway (regardless of club)
- **GIR_%**: Green in Regulation (GIR) percentage is the frequency a player's ball touches the green after the GIR stroke, which is par minus two (1st stroke on par 3, 2nd on par 4, 3rd on par 5).
- **SG_P (Strokes gained putting)**: Strokes gained putting is calculated by comparing a player's putts from a specific distance to a baseline, subtracting the field average, and summing this for all holes. This total is then divided by the number of rounds played.
- **SG_TTG (Strokes gained tee to green)**: Average per round of how a player's strokes compare to the field average for the same course and event, excluding strokes gained putting.
- **SG_T (Strokes Gained Total)**: The per round average of the number of strokes the player was better or worse than the field average on the same course & event.
- **Points**: FedExCup points earned.
- **TOP_10**: Yearly count of a player's top 10 finishes
- **1ST**: The number of wins the player had in that year.
- **Year**: The year the data was collected.
- **Money**: The amount of money the player earned in that year.
- **Country**: Home country for player.

### 2. <a id='section_2'>Data Cleaning and Formatting</a>

#### Importing Libraries

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [38]:
df = pd.read_csv('data/raw/pgatour_raw.csv', encoding = 'cp1252', index_col = 0)

In [39]:
df.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_TTG,SG_T,POINTS,TOP 10,1ST,Year,MONEY,COUNTRY
0,Aaron Baddeley,83,70.088,291.9,60.0,60.35,0.629,0.435,1.064,17703,7.0,1.0,2007,"$3,441,119",AUS
1,Adam Scott,69,70.008,300.9,59.17,65.44,0.129,1.105,1.234,15630,6.0,1.0,2007,"$3,413,185",AUS
2,Alex Cejka,80,70.437,288.9,68.08,69.44,-0.479,1.207,0.728,2400,4.0,,2007,"$868,303",GER
3,Anders Hansen,55,70.856,280.7,66.95,62.85,-0.176,0.087,-0.089,1989,,,2007,"$461,216",DEN
4,Andrew Buckle,77,71.443,294.7,58.14,62.52,0.161,-0.426,-0.265,1875,1.0,,2007,"$513,630",AUS


In [40]:
df.shape

(2044, 15)

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2044 entries, 0 to 2043
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NAME            2044 non-null   object 
 1   ROUNDS          2044 non-null   int64  
 2   SCORING         2044 non-null   float64
 3   DRIVE_DISTANCE  2044 non-null   float64
 4   FWY_%           2044 non-null   float64
 5   GIR_%           2044 non-null   float64
 6   SG_P            2044 non-null   float64
 7   SG_TTG          2044 non-null   float64
 8   SG_T            2044 non-null   float64
 9   POINTS          2044 non-null   object 
 10  TOP 10          1692 non-null   float64
 11  1ST             371 non-null    float64
 12  Year            2044 non-null   int64  
 13  MONEY           2044 non-null   object 
 14  COUNTRY         2044 non-null   object 
dtypes: float64(9), int64(2), object(4)
memory usage: 255.5+ KB


From a rough look at the initial data, the data needs to be cleaned

- For the columns Top 10 and 1ST, insert 0 in empty cells.
- Make 'Year' capitals
- Convert POINTS, TOP 10, 1ST to integer
- Remove the dollar sign ($) and commas in MONEY

In [42]:
# Replace empty cells with 0
df.fillna({'TOP 10': 0, '1ST': 0}, inplace = True)

# Write the header 'Year' in capital letters
df.rename(columns = {'year':'Year'}, inplace = True)

# Convert POINTS to integer
df['POINTS'] = df['POINTS'].astype(str).apply(lambda x: x.replace(',',''))
df['POINTS'] = df['POINTS'].astype(int)

# Convert TOP 10 to integer
df['TOP 10'] = df['TOP 10'].astype(int)
df['1ST'] = df['1ST'].astype(int)

# Remove the $ and commas in money 
df['MONEY'] = df['MONEY'].astype(str).apply(lambda x: x.replace('$',''))
df['MONEY'] = df['MONEY'].apply(lambda x: x.replace(',',''))
df['MONEY'] = df['MONEY'].astype(float).astype(int)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2044 entries, 0 to 2043
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NAME            2044 non-null   object 
 1   ROUNDS          2044 non-null   int64  
 2   SCORING         2044 non-null   float64
 3   DRIVE_DISTANCE  2044 non-null   float64
 4   FWY_%           2044 non-null   float64
 5   GIR_%           2044 non-null   float64
 6   SG_P            2044 non-null   float64
 7   SG_TTG          2044 non-null   float64
 8   SG_T            2044 non-null   float64
 9   POINTS          2044 non-null   int64  
 10  TOP 10          2044 non-null   int64  
 11  1ST             2044 non-null   int64  
 12  Year            2044 non-null   int64  
 13  MONEY           2044 non-null   int64  
 14  COUNTRY         2044 non-null   object 
dtypes: float64(7), int64(6), object(2)
memory usage: 255.5+ KB


This output looks much cleaner.

In [44]:
df.describe()

Unnamed: 0,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_TTG,SG_T,POINTS,TOP 10,1ST,Year,MONEY
count,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0,2044.0
mean,79.355186,70.91263,289.540068,62.036404,65.41248,0.022556,0.129178,0.151788,1790.963796,2.675147,0.234344,2011.949119,1419303.0
std,14.604295,0.681575,8.710074,5.209013,2.650798,0.35115,0.621524,0.675791,3833.522276,2.411051,0.593136,3.185158,1325270.0
min,45.0,67.794,259.0,41.86,54.23,-1.475,-3.34,-3.209,6.0,0.0,0.0,2007.0,45460.0
25%,69.0,70.4905,283.6,58.51,63.64,-0.194,-0.2595,-0.2625,360.0,1.0,0.0,2009.0,556418.8
50%,80.0,70.899,289.3,62.04,65.56,0.036,0.1405,0.1575,659.0,2.0,0.0,2012.0,1016720.0
75%,90.0,71.33875,295.2,65.605,67.1225,0.261,0.528,0.56425,1272.0,4.0,0.0,2015.0,1809302.0
max,124.0,74.262,318.4,80.42,73.52,1.13,2.38,3.189,53607.0,15.0,7.0,2017.0,12030460.0


In [45]:
df.describe(include = ['O'])

Unnamed: 0,NAME,COUNTRY
count,2044,2044
unique,478,29
top,Aaron Baddeley,USA
freq,11,1405


In [46]:
# Create 300+ and winner column
df['300+'] = df['DRIVE_DISTANCE'] >= 300
df['Winner'] = df['1ST'] == 1

The FED EX cup points system was different in 2007 and 2008 compared to the rest of the years in this dataset. Because of this, a dataframe is created that encompasses only the years 2009-2017 so those features can be use in the machine learning model later.

In [47]:
# Only use data from 2009 onwards
df2 = df[(df['Year'] != 2007) & (df['Year'] != 2008)]

In [48]:
df2.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_TTG,SG_T,POINTS,TOP 10,1ST,Year,MONEY,COUNTRY,300+,Winner
389,Aaron Baddeley,66,71.153,287.8,56.48,59.57,0.604,-0.755,-0.151,431,2,0,2009,837065,AUS,False,False
390,Adam Scott,53,71.72,294.9,58.77,62.82,-0.881,0.22,-0.66,432,1,0,2009,783138,AUS,False,False
391,Alex Cejka,82,70.98,281.2,69.8,66.52,-0.322,0.555,0.233,416,3,0,2009,953664,GER,False,False
392,Andres Romero,58,71.462,298.5,51.62,64.91,-0.044,-0.247,-0.291,329,2,0,2009,789305,ARG,False,False
393,Anthony Kim,76,70.507,299.0,53.65,62.69,0.245,0.235,0.479,1420,3,0,2009,1972155,USA,False,False


Export the cleaned data to a new csv file.

In [50]:
df.to_csv('data/processed/pgatour_cleaned.csv', index = False)