# Feature Engineering 
## Objectives

* Split the data into training, validation, and test sets
* Create a preprocessing pipeline to scale the strokes gained features for use in the ML modeling process.

## Inputs

* outputs\data\final cleaned_golfdata.csv

## Outputs

* A scikit-learn pipeline (or preprocessing pipeline) that scales the numeric features, ready to be used in the final ML model.

## Additional Comments

* Due to business requirements and prior work, minimal feature engineering is needed. All required features are already numeric.

* Each strokes gained feature represents a distinct aspect of golf performance, so no features will be removed or combined.

* No categorical encoding or additional transformations are required at this stage.

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics'

## Load the Data Frame

In [5]:
import pandas as pd

df = pd.read_csv(r'outputs\data\final\cleaned_golfdata.csv')

print(df.head())


   player_id  tournament_id  finish_numeric  true_pos  top_ten  mid_band  \
0       9261      401353224            32.0      32.0        0         0   
1       5548      401353224            18.0      18.0        0         1   
2       4989      401353224             0.0      91.0        0         0   
3       6015      401353224             0.0      91.0        0         0   
4       3832      401353224             0.0      91.0        0         0   

   sg_putt  sg_arg  sg_app  sg_ott  sg_t2g  sg_total  
0     0.20   -0.13   -0.08    0.86    0.65      0.85  
1     0.36    0.75    0.31    0.18    1.24      1.60  
2    -0.56    0.74   -1.09    0.37    0.02     -0.54  
3    -1.46   -1.86   -0.02    0.80   -1.08     -2.54  
4     0.53   -0.36   -1.39    0.19   -1.56     -1.04  


## Idenitfy issue on low field/ no cut tournaments.
A significant minority of golf tournaments are played with smaller fields and do not have a cut. These tournaments could have implications for reaching our business requirements as a player could play very badly and finish last, but still have a true_pos value of 30 (which in other tournaments would appear quite good). This needs to be investigated as could be a target leakage issue.

In [6]:
tournaments_with_high_pos = df.loc[df["true_pos"] > 60, "tournament_id"].unique()

df_filtered = df[~df["tournament_id"].isin(tournaments_with_high_pos)]

print(df_filtered)


       player_id  tournament_id  finish_numeric  true_pos  top_ten  mid_band  \
2343        9261      401353203            35.0      35.0        0         0   
2344        4383      401353203            33.0      33.0        0         0   
2345        1651      401353203            23.0      23.0        0         1   
2346        6798      401353203            28.0      28.0        0         1   
2347       10863      401353203            10.0      10.0        1         0   
...          ...            ...             ...       ...      ...       ...   
26501       5619           2262            18.0      18.0        0         1   
26502       4015           2262            18.0      18.0        0         1   
26503       1112           2262            12.0      12.0        0         1   
26504       1037           2262            25.0      25.0        0         1   
26505        686           2262             8.0       8.0        1         0   

       sg_putt  sg_arg  sg_app  sg_ott 

This table verifies the issue. Player 4383 has finished 33rd in a tournament (relatively good) but has negative strokes gained data in all categories. This is an issue that needs fixing. We will create aa feature called adj_pos that sales all finishig positions to a value between 0 and 1 based on participants in tournaments size.

In [7]:
df['adj_pos'] = df.groupby('tournament_id')['true_pos'] \
    .transform(lambda x: (x - x.min()) / (x.max() - x.min()))

print(df[['tournament_id', 'true_pos', 'adj_pos']].head(20))

    tournament_id  true_pos   adj_pos
0       401353224      32.0  0.344444
1       401353224      18.0  0.188889
2       401353224      91.0  1.000000
3       401353224      91.0  1.000000
4       401353224      91.0  1.000000
5       401353224      91.0  1.000000
6       401353224      26.0  0.277778
7       401353224      26.0  0.277778
8       401353224      67.0  0.733333
9       401353224      91.0  1.000000
10      401353224      45.0  0.488889
11      401353224       2.0  0.011111
12      401353224      91.0  1.000000
13      401353224      18.0  0.188889
14      401353224      91.0  1.000000
15      401353224       1.0  0.000000
16      401353224      32.0  0.344444
17      401353224      60.0  0.655556
18      401353224      10.0  0.100000
19      401353224      69.0  0.755556


In [11]:
tournaments_30 = (
    df.groupby('tournament_id')['true_pos']
      .max()
      .reset_index()
      .query('true_pos == 30')
)

if not tournaments_30.empty:
    sample_tourney_id = tournaments_30.iloc[0]['tournament_id']
    print(f"Tournament selected (max true_pos=30): {sample_tourney_id}")

    display_cols = ['tournament_id', 'player_id', 'true_pos', 'adj_pos']
    sample_records = df[df['tournament_id'] == sample_tourney_id][display_cols] \
                        .sort_values('true_pos') \
                        .head(20)

    print("\nSample of 20 records from tournament with 30 participants:")
    print(sample_records.to_string(index=False))
else:
    print("No tournaments found with exactly 30 participants.")


Tournament selected (max true_pos=30): 2718.0

Sample of 20 records from tournament with 30 participants:
 tournament_id  player_id  true_pos  adj_pos
          2718      10140       1.0 0.000000
          2718       4848       2.0 0.034483
          2718       2552       3.0 0.068966
          2718       5409       3.0 0.068966
          2718         72       5.0 0.137931
          2718       6798       6.0 0.172414
          2718       9780       7.0 0.206897
          2718       5467       7.0 0.206897
          2718       2230       7.0 0.206897
          2718        569      10.0 0.310345
          2718        158      10.0 0.310345
          2718        257      10.0 0.310345
          2718       5579      13.0 0.413793
          2718       1614      13.0 0.413793
          2718       9025      15.0 0.482759
          2718        707      16.0 0.517241
          2718       1680      17.0 0.551724
          2718       3448      17.0 0.551724
          2718       3550      19.0 0.6

Check for missing data in the new feature

In [None]:
nan_adjpos = df[df['adj_pos'].isna()]

nan_adjpos_sorted = nan_adjpos.sort_values(['tournament_id', 'true_pos'])

print(nan_adjpos_sorted[['tournament_id', 'player_id', 'true_pos', 'adj_pos']])
print(f"\nTotal rows with NaN in adj_pos: {len(nan_adjpos_sorted)}")


      tournament_id  player_id  true_pos  adj_pos
5238      401243433      11253      91.0      NaN
5239      401243433      10049      91.0      NaN

Total rows with NaN in adj_pos: 2


In [None]:
tournament_to_check = 401243433

tournament_results = df[df['tournament_id'] == tournament_to_check] \
    .sort_values('true_pos') \
    .reset_index(drop=True)

print(tournament_results[['tournament_id', 'player_id', 'true_pos', 'adj_pos']])
print(f"\nTotal players in tournament {tournament_to_check}: {len(tournament_results)}")


   tournament_id  player_id  true_pos  adj_pos
0      401243433      11253      91.0      NaN
1      401243433      10049      91.0      NaN

Total players in tournament 401243433: 2


As this tournament appears to only have two entrants we will delete it from the dataframe.

In [21]:
tournament_to_remove = 401243433

df = df[df['tournament_id'] != tournament_to_remove].reset_index(drop=True)

print(f"Tournament {tournament_to_remove} removed. Remaining tournaments: {df['tournament_id'].nunique()}")


Tournament 401243433 removed. Remaining tournaments: 246


## Split the data into Train, Test and Validation sets.

We will use the ratio 70/15/15.
As we are dealing with data from a range of golf tournaments, it makes sense to use the tournament_id feature as the method for splitting. Although this might slightly change the ratios, it wi will ensure all data from each tournament remains in the same set.

In [22]:
import numpy as np

tournaments = df['tournament_id'].unique()
np.random.seed(42)  # for reproducibility
np.random.shuffle(tournaments)

n_total = len(tournaments)
n_train = int(0.7 * n_total)
n_val = int(0.15 * n_total)


train_tournaments = tournaments[:n_train]
val_tournaments = tournaments[n_train:n_train+n_val]
test_tournaments = tournaments[n_train+n_val:]

train_data = df[df['tournament_id'].isin(train_tournaments)].reset_index(drop=True)
val_data = df[df['tournament_id'].isin(val_tournaments)].reset_index(drop=True)
test_data = df[df['tournament_id'].isin(test_tournaments)].reset_index(drop=True)

print(f"Train: {len(train_data)} rows, Validation: {len(val_data)} rows, Test: {len(test_data)} rows")
train_data.head(10)



Train: 20468 rows, Validation: 4292 rows, Test: 4414 rows


Unnamed: 0,player_id,tournament_id,finish_numeric,true_pos,top_ten,mid_band,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total,adj_pos
0,9261,401353224,32.0,32.0,0,0,0.2,-0.13,-0.08,0.86,0.65,0.85,0.344444
1,5548,401353224,18.0,18.0,0,1,0.36,0.75,0.31,0.18,1.24,1.6,0.188889
2,4989,401353224,0.0,91.0,0,0,-0.56,0.74,-1.09,0.37,0.02,-0.54,1.0
3,6015,401353224,0.0,91.0,0,0,-1.46,-1.86,-0.02,0.8,-1.08,-2.54,1.0
4,3832,401353224,0.0,91.0,0,0,0.53,-0.36,-1.39,0.19,-1.56,-1.04,1.0
5,5502,401353224,0.0,91.0,0,0,-0.97,0.14,-2.02,0.31,-1.56,-2.54,1.0
6,10906,401353224,26.0,26.0,0,1,2.05,0.74,-1.32,-0.12,-0.7,1.35,0.277778
7,10372,401353224,26.0,26.0,0,1,-0.96,-0.01,1.84,0.48,2.31,1.35,0.277778
8,388,401353224,67.0,67.0,0,0,-0.82,-1.79,2.0,-1.04,-0.83,-1.65,0.733333
9,9484,401353224,0.0,91.0,0,0,-1.89,-0.71,0.71,-0.65,-0.65,-2.54,1.0


## Create a Scaling Pipeling

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd

sg_features = ['sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total']

strokes_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

X_train_sg = strokes_pipeline.fit_transform(train_data[sg_features])
X_val_sg   = strokes_pipeline.transform(val_data[sg_features])
X_test_sg  = strokes_pipeline.transform(test_data[sg_features])

X_train_sg = pd.DataFrame(X_train_sg, columns=sg_features, index=train_data.index)
X_val_sg   = pd.DataFrame(X_val_sg, columns=sg_features, index=val_data.index)
X_test_sg  = pd.DataFrame(X_test_sg, columns=sg_features, index=test_data.index)

print(X_train_sg.head())


    sg_putt    sg_arg    sg_app    sg_ott    sg_t2g  sg_total
0  0.286684 -0.122696  0.018379  1.125777  0.515438  0.590914
1  0.430778  1.097130  0.369261  0.281225  0.878199  0.975308
2 -0.397762  1.083268 -0.890314  0.517203  0.128083 -0.121496
3 -1.208291 -2.520764  0.072361  1.051258 -0.548250 -1.146547
4  0.583878 -0.441514 -1.160223  0.293645 -0.843378 -0.377759


## Save Train, Test and Validation sets to the repo

In [24]:
base_path = "outputs/data/final"

folders = ["train", "validation", "test"]

for folder in folders:
    path = os.path.join(base_path, folder)
    os.makedirs(path, exist_ok=True)

train_data.to_csv(os.path.join(base_path, "train", "train_data.csv"), index=False)
val_data.to_csv(os.path.join(base_path, "validation", "val_data.csv"), index=False)
test_data.to_csv(os.path.join(base_path, "test", "test_data.csv"), index=False)

print("Train, validation, and test datasets saved successfully!")

Train, validation, and test datasets saved successfully!


## Save the Pipeline to the repo

In [25]:
import joblib

pipeline_path = "outputs/pipelines"
os.makedirs(pipeline_path, exist_ok=True)

joblib.dump(strokes_pipeline, os.path.join(pipeline_path, "strokes_pipeline.pkl"))
print("Pipeline saved successfully in outputs/pipelines!")


Pipeline saved successfully in outputs/pipelines!
