# Feature Engineering 
## Objectives

* Split the data into training, validation, and test sets
* Create a preprocessing pipeline to scale the strokes gained features for use in the ML modeling process.

## Inputs

* outputs\data\final cleaned_golfdata.csv

## Outputs

* A scikit-learn pipeline (or preprocessing pipeline) that scales the numeric features, ready to be used in the final ML model.

## Additional Comments

* Due to business requirements and prior work, minimal feature engineering is needed. All required features are already numeric.

* Each strokes gained feature represents a distinct aspect of golf performance, so no features will be removed or combined.

* No categorical encoding or additional transformations are required at this stage.

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics'

## Load the Data Frame

In [4]:
import pandas as pd

df = pd.read_csv(r'outputs\data\final\cleaned_golfdata.csv')

print(df.head())


   player_id  tournament_id  finish_numeric  true_pos  top_ten  mid_band  \
0       9261      401353224            32.0      32.0        0         0   
1       5548      401353224            18.0      18.0        0         1   
2       4989      401353224             0.0      91.0        0         0   
3       6015      401353224             0.0      91.0        0         0   
4       3832      401353224             0.0      91.0        0         0   

   sg_putt  sg_arg  sg_app  sg_ott  sg_t2g  sg_total  
0     0.20   -0.13   -0.08    0.86    0.65      0.85  
1     0.36    0.75    0.31    0.18    1.24      1.60  
2    -0.56    0.74   -1.09    0.37    0.02     -0.54  
3    -1.46   -1.86   -0.02    0.80   -1.08     -2.54  
4     0.53   -0.36   -1.39    0.19   -1.56     -1.04  


## Split the data into Train, Test and Validation sets.

We will use the ratio 70/15/15.
As we are dealing with data from a range of golf tournaments, it makes sense to use the tournament_id feature as the method for splitting. Although this might slightly change the ratios, it wi will ensure all data from each tournament remains in the same set.

In [7]:
import numpy as np

tournaments = df['tournament_id'].unique()
np.random.seed(42)  # for reproducibility
np.random.shuffle(tournaments)

n_total = len(tournaments)
n_train = int(0.7 * n_total)
n_val = int(0.15 * n_total)


train_tournaments = tournaments[:n_train]
val_tournaments = tournaments[n_train:n_train+n_val]
test_tournaments = tournaments[n_train+n_val:]

train_data = df[df['tournament_id'].isin(train_tournaments)].reset_index(drop=True)
val_data = df[df['tournament_id'].isin(val_tournaments)].reset_index(drop=True)
test_data = df[df['tournament_id'].isin(test_tournaments)].reset_index(drop=True)

print(f"Train: {len(train_data)} rows, Validation: {len(val_data)} rows, Test: {len(test_data)} rows")
train_data.head(10)



Train: 20300 rows, Validation: 4201 rows, Test: 4675 rows


Unnamed: 0,player_id,tournament_id,finish_numeric,true_pos,top_ten,mid_band,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total
0,9261,401353224,32.0,32.0,0,0,0.2,-0.13,-0.08,0.86,0.65,0.85
1,5548,401353224,18.0,18.0,0,1,0.36,0.75,0.31,0.18,1.24,1.6
2,4989,401353224,0.0,91.0,0,0,-0.56,0.74,-1.09,0.37,0.02,-0.54
3,6015,401353224,0.0,91.0,0,0,-1.46,-1.86,-0.02,0.8,-1.08,-2.54
4,3832,401353224,0.0,91.0,0,0,0.53,-0.36,-1.39,0.19,-1.56,-1.04
5,5502,401353224,0.0,91.0,0,0,-0.97,0.14,-2.02,0.31,-1.56,-2.54
6,10906,401353224,26.0,26.0,0,1,2.05,0.74,-1.32,-0.12,-0.7,1.35
7,10372,401353224,26.0,26.0,0,1,-0.96,-0.01,1.84,0.48,2.31,1.35
8,388,401353224,67.0,67.0,0,0,-0.82,-1.79,2.0,-1.04,-0.83,-1.65
9,9484,401353224,0.0,91.0,0,0,-1.89,-0.71,0.71,-0.65,-0.65,-2.54


## Create a Scaling Pipeling

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

sg_features = ['sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total']

strokes_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

X_train_sg = strokes_pipeline.fit_transform(X_train[sg_features])

X_val_sg = strokes_pipeline.transform(X_val[sg_features])
X_test_sg = strokes_pipeline.transform(X_test[sg_features])

print(X_train_sg[:5])


[[ 0.28469734 -0.12529097  0.01489375  1.11767429  0.50633579  0.58105437]
 [ 0.42837887  1.08800199  0.36490074  0.27704708  0.86636486  0.96229045]
 [-0.39778992  1.07421457 -0.89153458  0.51192821  0.12189796 -0.12550317]
 [-1.20599853 -2.51051463  0.06874098  1.0435013  -0.54934268 -1.14213272]
 [ 0.5810405  -0.44240163 -1.16077072  0.28940924 -0.84224769 -0.37966056]]


## Save Train, Test and Validation sets to the repo

In [16]:
base_path = "outputs/data/final"

folders = ["train", "validation", "test"]

for folder in folders:
    path = os.path.join(base_path, folder)
    os.makedirs(path, exist_ok=True)

train_data.to_csv(os.path.join(base_path, "train", "train_data.csv"), index=False)
val_data.to_csv(os.path.join(base_path, "validation", "val_data.csv"), index=False)
test_data.to_csv(os.path.join(base_path, "test", "test_data.csv"), index=False)

print("Train, validation, and test datasets saved successfully!")

Train, validation, and test datasets saved successfully!


## Save the Pipeline to the repo

In [17]:
import joblib

pipeline_path = "outputs/pipelines"
os.makedirs(pipeline_path, exist_ok=True)

joblib.dump(strokes_pipeline, os.path.join(pipeline_path, "strokes_pipeline.pkl"))
print("Pipeline saved successfully in outputs/pipelines!")


Pipeline saved successfully in outputs/pipelines!
