## **PGA PLAYERS** <a id="1"></a>

<a><img style="float: right;" src="https://www.linkpicture.com/q/images2_4.jpg" width="300" /></a>
<a><img style="float: right;" src="https://www.linkpicture.com/q/images_539.png" width="300" /></a>
 



- Dataset source: https://zenodo.org/record/5235684#.ZBcmKtDMLIU

### 1.2 Notebook Preparation <a id="1.2"></a>

This part of the notebook deals with the relevant library import and visual configuration.

In [9]:
# Import libraries

import pandas as pd
import numpy as np 
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
import plotly.express as px

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import silhouette_samples, silhouette_score

## **2. Data Preparation** <a id="2"></a>

The below section provides an initial exploration of the data.

In [37]:
# Import the PGA data as a DataFrame and check first 10 rows

df = pd.read_csv('pga_dataset_players.csv')

df.head(10)

Unnamed: 0,player,Height cm,Weight lbs,DOB,Age,player id,date,course,tournament name,tournament id,...,visibility,winddirDegree,windspeedKmph,GreensGrass,FariwaysGrass,Water,Bunkers,Slope,Length,Par
0,Robert Allenby,185.0,180.0,12/07/1971,46.0,5.0,29/10/2017,"Country Club of Jackson - Jackson, MS",Sanderson Farms Championship,3763.0,...,10.0,224.0,3.0,2.0,2.0,1.0,4.0,128.0,6532.0,72.0
1,Robert Allenby,185.0,180.0,12/07/1971,47.0,5.0,20/05/2018,"Trinity Forest - Dallas, TX",AT&T Byron Nelson,401025251.0,...,9.0,208.0,11.0,2.0,8.0,1.0,4.0,134.0,7447.0,72.0
2,Robert Allenby,185.0,180.0,12/07/1971,47.0,5.0,10/06/2018,"TPC Southwind, Memphis, TN",FedEx St. Jude Classic,401025254.0,...,10.0,211.0,11.0,1.0,8.0,1.0,7.0,149.0,7244.0,70.0
3,Robert Allenby,185.0,180.0,12/07/1971,47.0,5.0,15/07/2018,"TPC Deere Run - Silvis, IL",John Deere Classic,401025258.0,...,8.0,237.0,6.0,1.0,1.0,1.0,6.0,138.0,7258.0,71.0
4,Robert Allenby,185.0,180.0,12/07/1971,47.0,5.0,23/07/2018,"Keene Trace - Nicholasville, KY",Barbasol Championship,401025271.0,...,10.0,149.0,9.0,1.0,1.0,1.0,3.0,139.0,7334.0,72.0
5,Robert Allenby,185.0,180.0,12/07/1971,48.0,5.0,24/02/2019,"Coco Beach - Rio Grande, Puero Rico",Puerto Rico Open,401056517.0,...,9.0,267.0,38.0,3.0,4.0,1.0,4.0,136.0,7557.0,72.0
6,Robert Allenby,185.0,180.0,12/07/1971,48.0,5.0,31/03/2019,"Corales Puntacana GC - Punta Cana, Dominican R...",Corales Puntacana Resort & Club Championship,401056525.0,...,10.0,136.0,9.0,3.0,4.0,1.0,10.0,140.0,7650.0,72.0
7,Robert Allenby,185.0,180.0,12/07/1971,48.0,5.0,14/07/2019,"TPC Deere Run - Silvis, IL",John Deere Classic,401025258.0,...,9.0,173.0,6.0,1.0,1.0,1.0,6.0,138.0,7258.0,71.0
8,Robert Allenby,185.0,180.0,12/07/1971,48.0,5.0,21/07/2019,"Keene Trace - Nicholasville, KY",Barbasol Championship,401025271.0,...,10.0,97.0,22.0,1.0,1.0,1.0,3.0,139.0,7334.0,72.0
9,Robert Allenby,185.0,180.0,12/07/1971,48.0,5.0,22/09/2019,"Country Club of Jackson - Jackson, MS",Sanderson Farms Championship,3763.0,...,9.0,232.0,21.0,2.0,2.0,1.0,4.0,128.0,6532.0,72.0


In [38]:
# Let us count the number of rows and columns in the PGA dataset.

df.shape

(14041, 64)

- The dataset contains 14,041 rows and 64 columns

In [39]:
# Check data types and if any records are missing

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14041 entries, 0 to 14040
Data columns (total 64 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   player                 14040 non-null  object 
 1   Height cm              14040 non-null  float64
 2   Weight lbs             14040 non-null  float64
 3   DOB                    14040 non-null  object 
 4   Age                    14040 non-null  float64
 5   player id              14040 non-null  float64
 6   date                   14040 non-null  object 
 7   course                 14040 non-null  object 
 8   tournament name        14040 non-null  object 
 9   tournament id          14040 non-null  float64
 10  season                 14040 non-null  float64
 11  final position         14040 non-null  float64
 12  major                  14040 non-null  float64
 13  made_cut               14040 non-null  float64
 14  Consecutive_Cuts_Made  13721 non-null  float64
 15  Fi

- Seems we have missing records in our datasets. However, we are only interested in the missing records of Age, Weight, and Height in the dataset.

- Let us count the missing values in each column of our dataset.

In [40]:
# We can count the missing values in each column of our dataset.

df.isnull().sum()

player        1
Height cm     1
Weight lbs    1
DOB           1
Age           1
             ..
Water         1
Bunkers       1
Slope         1
Length        1
Par           1
Length: 64, dtype: int64

- Since we are only interested in the average height, average weight and average age, let us delete the rows with missing records of Height, Weight and Age.

In [42]:
# We are deleting rows with missing Age, Height,and Weight records

df = df.dropna(subset=['Height cm', 'Weight lbs', 'Age'])

In [43]:
# Let us check if we still have missing Age, Height, and Weight records in the PGA dataset.

df.isnull().sum()

player        0
Height cm     0
Weight lbs    0
DOB           0
Age           0
             ..
Water         0
Bunkers       0
Slope         0
Length        0
Par           0
Length: 64, dtype: int64

- Now, the dataset has no missing records of height, age, and weight.

- Let us extract Age, Weight and Height information from the dataset

In [44]:
# Here, we extract the ages of PGA players from our dataset (years)

df_Age = df['Age']

df_Age.head()

0    46.0
1    47.0
2    47.0
3    47.0
4    47.0
Name: Age, dtype: float64

In [45]:
# Let us find the average age of PGA players

df_Age_mean = df_Age.mean()

df_Age_mean

33.340598290598294

In [46]:
# Here, we extract the weight (lbs) of PGA players from our dataset and display the top 20 

df_Weight = df['Weight lbs']

df_Weight.head(20)

0     180.0
1     180.0
2     180.0
3     180.0
4     180.0
5     180.0
6     180.0
7     180.0
8     180.0
9     180.0
10    180.0
11    180.0
12    180.0
13    195.0
14    195.0
15    195.0
16    195.0
17    195.0
18    195.0
19    195.0
Name: Weight lbs, dtype: float64

In [47]:
# Let us find the average weight of PGA players in (lbs)

df_Weight.mean()

179.03867521367522

In [48]:
# Here, we extract the height (cm) of PGA players from our dataset and display the top 20 

df_Height = df['Height cm']

df_Height.head(20)

0     185.0
1     185.0
2     185.0
3     185.0
4     185.0
5     185.0
6     185.0
7     185.0
8     185.0
9     185.0
10    185.0
11    185.0
12    185.0
13    185.0
14    185.0
15    185.0
16    185.0
17    185.0
18    185.0
19    185.0
Name: Height cm, dtype: float64

In [50]:
# Let us find the average height of PGA players in 'cm'

df_Height.mean()

182.51709401709402

## **5. Conclusion** <a id="5"></a>

- We can establish here that the average age, average weight, and average height of PGA players based on our data
  records are 33 years, 179 lbs, and 182.5 cm