# **Top Ten Player Study**

## Objectives

* Answer business requirement one:
  * The client wishes us to conduct an analysis of current elite-level golf tournament data 
    to determine which golfing skills (e.g., driving, approach play, chipping, and putting) 
    are most likely to result in a player reaching the top ten of a tournament. 
    They are specifically interested in learning which skill to focus on to help a player 
    improve from a 30th–11th place finish to a top-ten finish.

## Inputs

* inputs\datasets\raw\ASA All PGA Raw Data - Tourn Level.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the StreamLit App.

## Additional Comments

* Although more will be done later, a level of data cleaning is done in this notebook to sort the confusion in the data between 'pos' and 'finish' features discovered in the previous notebook. This was necessary at this stage to avoid analysing data with errors. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [31]:
import os
current_dir = os.getcwd()
current_dir

'c:\\'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [32]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [33]:
current_dir = os.getcwd()
current_dir

'c:\\'

In [24]:
import pandas as pd
import numpy as np

First, gain an overview of the data (bearing in mind missing fields and some data errors have been studied in the Data Collection phase).

In [34]:
file_path = "inputs/datasets/raw/ASA All PGA Raw Data - Tourn Level.csv"
df = pd.read_csv(file_path)

print("Data loaded successfully!")
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]:,} columns\n")

display(df.head())

df.info()

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/datasets/raw/ASA All PGA Raw Data - Tourn Level.csv'

In [None]:
num_features = df.select_dtypes(include=np.number).columns.tolist()
cat_features = df.select_dtypes(exclude=np.number).columns.tolist()

print("\nNumerical features:", len(num_features))
print(num_features)
print("\nCategorical features:", len(cat_features))
print(cat_features)


Numerical features: 31
['tournament id', 'player id', 'hole_par', 'strokes', 'hole_DKP', 'hole_FDP', 'hole_SDP', 'streak_DKP', 'streak_FDP', 'streak_SDP', 'n_rounds', 'made_cut', 'pos', 'finish_DKP', 'finish_FDP', 'finish_SDP', 'total_DKP', 'total_FDP', 'total_SDP', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'purse', 'season', 'no_cut', 'sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total']

Categorical features: 6
['Player_initial_last', 'player', 'tournament name', 'course', 'date', 'Finish']


Check for duplicates.

In [None]:
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")


Number of duplicate rows: 0


Before further data exploration, it is necessary to create a data frame of data that combines the pos and finish features (see data collection notebook) to check that this data will be suitable for the client's business requirements.
To do this, first we need to make 'Finish' entirely numeric data.

In [None]:
df_temp = df.copy()

df_temp['finish_clean'] = df_temp['Finish'].astype(str).str.replace('T', '', regex=False)

df_temp['finish_numeric'] = pd.to_numeric(df_temp['finish_clean'], errors='coerce')

df_temp['finish_numeric'] = df_temp['finish_numeric'].fillna(0)

print(df_temp[['pos', 'Finish', 'finish_clean', 'finish_numeric']].head(10))



    pos Finish finish_clean  finish_numeric
0  32.0    T32           32            32.0
1  18.0    T18           18            18.0
2   NaN    CUT           CU             0.0
3   NaN    CUT           CU             0.0
4   NaN    CUT           CU             0.0
5   NaN    CUT           CU             0.0
6  26.0    T26           26            26.0
7  26.0    T26           26            26.0
8  67.0    T67           67            67.0
9   NaN    CUT           CU             0.0


Next, check for discrepencies between finish_numeric and pos.

In [30]:
df_temp['pos_differs'] = df_temp['finish_numeric'] != df_temp['pos']
num_differences = df_temp['pos_differs'].sum()
print(f"Number of rows where finish_numeric and pos differ: {num_differences}")
print(df_temp[df_temp['pos_differs']][['pos', 'Finish', 'finish_numeric']].head(10))

Number of rows where finish_numeric and pos differ: 5505
      pos Finish  finish_numeric
94   18.0    NaN             0.0
324  77.0    NaN             0.0
325   8.0    NaN             0.0
394  81.0    NaN             0.0
534  41.0    NaN             0.0
601   5.0    NaN             0.0
677  33.0    NaN             0.0
798  10.0    NaN             0.0
800  21.0    NaN             0.0
927  14.0    NaN             0.0


Turn everything that is non numeric in 'pos' to a 0 (to indicate a bad finish in the tournament).

In [None]:
df_temp['pos'] = df_temp['pos'].fillna(0)
print(df_temp['pos'].sample(20, random_state=42).to_list())

[0.0, 0.0, 8.0, 0.0, 63.0, 32.0, 3.0, 12.0, 15.0, 32.0, 11.0, 29.0, 0.0, 0.0, 0.0, 69.0, 29.0, 54.0, 0.0, 36.0]


Now check of any discrepencies between finish_numeric and pos again.

In [None]:
df_temp['pos_differs'] = df_temp['finish_numeric'] != df_temp['pos']
num_differences = df_temp['pos_differs'].sum()
print(f"Number of rows where finish_numeric and pos differ: {num_differences}")
print(df_temp[df_temp['pos_differs']][['pos', 'Finish', 'finish_numeric']].head(10))

Number of rows where finish_numeric and pos differ: 5505
      pos Finish  finish_numeric
94   18.0    NaN             0.0
324  77.0    NaN             0.0
325   8.0    NaN             0.0
394  81.0    NaN             0.0
534  41.0    NaN             0.0
601   5.0    NaN             0.0
677  33.0    NaN             0.0
798  10.0    NaN             0.0
800  21.0    NaN             0.0
927  14.0    NaN             0.0


From manual checks during the Data Collection phase, we believe pos to be the more reliable field. However, in some cases finish_numeric will have a result in the top ten that is accurate and pos will be inaccurate.

In [None]:
top_ten_mismatches = df_temp[
    (df_temp['finish_numeric'].between(1, 10, inclusive='both')) &
    (df_temp['finish_numeric'] != df_temp['pos'])
]
print(top_ten_mismatches[['pos', 'Finish', 'finish_numeric']])
print(f"\nNumber of mismatches: {len(top_ten_mismatches)}")


        pos Finish  finish_numeric
2771    0.0     T8             8.0
3629    0.0      3             3.0
4211    0.0     T6             6.0
4434    0.0     T5             5.0
8676    0.0      4             4.0
8677    0.0     T2             2.0
8685    0.0     T8             8.0
8729    0.0     T5             5.0
8737    0.0     T8             8.0
8761    0.0      1             1.0
8763    0.0     T8             8.0
8765    0.0     T2             2.0
8769    0.0     T8             8.0
8772    0.0     T5             5.0
8792    0.0     T8             8.0
8811    0.0     T5             5.0
8836    0.0     T6             6.0
8838    0.0     T6             6.0
8852    0.0     T6             6.0
8862    0.0     T4             4.0
8871    0.0      3             3.0
8877    0.0     T4             4.0
8887    0.0     T6             6.0
8907    0.0      2             2.0
8923    0.0      1             1.0
8934    0.0     T6             6.0
8940    0.0     T6             6.0
18524   0.0     T7  

We now need to create a new feature that uses the pos value (apart from these 41 which will use finish) cases called true_pos.

In [None]:
df_temp['true_pos'] = np.where(
    (df_temp['finish_numeric'].between(1, 10, inclusive='both')) &
    (df_temp['finish_numeric'] != df_temp['pos']),
    df_temp['finish_numeric'],
    df_temp['pos']
)

Print one occurence where there was an issue to check for accuracy.

In [None]:
print(df_temp.loc[2771, ['pos', 'true_pos']])

pos         0.0
true_pos    8.0
Name: 2771, dtype: object


Finally, use the true_pos value to create a new feature called top_ten, whereby 0 = not in the top ten and 1 = in the top ten.

In [None]:
df_temp['top_ten'] = np.where(df_temp['true_pos'].between(1, 10, inclusive='both'), 1, 0)
print(df_temp['top_ten'].value_counts())

top_ten
0    33116
1     3748
Name: count, dtype: int64


## Conclusions and Next Steps ##

- Further data cleaning is necessary to remove unwanted features and consider what to do with fields with missing data.

## Push files to Repo
It will be time-efficient to push the df_temp dataframe to the repo in preparation for further data cleaning in the next phase of the project.

In [26]:
output_path = "outputs/data/interim"

if os.path.exists(output_path):
    raise FileExistsError(f"The folder '{output_path}' already exists. Please remove or rename it before continuing.")
else:
    os.makedirs(output_path)
    print(f"Folder created at: {output_path}")

file_path = os.path.join(output_path, "cleaned_data.csv")
df_temp.to_csv(file_path, index=False)

print(f"Data saved to: {file_path}")

Folder created at: outputs/data/interim
Data saved to: outputs/data/interim\cleaned_data.csv
