# **Data Collection**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\project-five-golf-data-analytics'

# Fetch data from Kaggle

Ensure kaggle.json file is recognised.

In [5]:
os.environ["KAGGLE_CONFIG_DIR"] = os.getcwd()

Define the dataset and download it.

In [6]:
KaggleDatasetPath = "robikscube/pga-tour-golf-data-20152022"
DestinationFolder = "inputs/datasets/raw"
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder} -q

Dataset URL: https://www.kaggle.com/datasets/robikscube/pga-tour-golf-data-20152022
License(s): CC0-1.0


Unzip and delete kaggle.json file

In [7]:
import zipfile
import glob

for zip_path in glob.glob(os.path.join(DestinationFolder, "*.zip")):
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)

kaggle_json_path = "kaggle.json"
if os.path.exists(kaggle_json_path):
    os.remove(kaggle_json_path)

print(f"All zip files extracted and cleaned up. kaggle.json deleted.")



All zip files extracted and cleaned up. kaggle.json deleted.


---

# Load and Inspect Kaggle Data

Get a data summary.

In [8]:
import pandas as pd

data_folder = "inputs/datasets/raw"
csv_files = [f for f in os.listdir(data_folder) if f.endswith('.csv')]

if csv_files:
    csv_path = os.path.join(data_folder, csv_files[0])
    df = pd.read_csv(csv_path)
    print(f"Loaded dataset: {csv_files[0]}")
else:
    raise FileNotFoundError("No CSV file found in the data folder.")

df.head()


Loaded dataset: ASA All PGA Raw Data - Tourn Level.csv


Unnamed: 0,Player_initial_last,tournament id,player id,hole_par,strokes,hole_DKP,hole_FDP,hole_SDP,streak_DKP,streak_FDP,...,purse,season,no_cut,Finish,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total
0,A. Ancer,401353224,9261,288,289,60.0,51.1,56,3,7.6,...,12.0,2022,0,T32,0.2,-0.13,-0.08,0.86,0.65,0.85
1,A. Hadwin,401353224,5548,288,286,72.5,61.5,61,8,13.0,...,12.0,2022,0,T18,0.36,0.75,0.31,0.18,1.24,1.6
2,A. Lahiri,401353224,4989,144,147,21.5,17.4,27,0,0.0,...,12.0,2022,0,CUT,-0.56,0.74,-1.09,0.37,0.02,-0.54
3,A. Long,401353224,6015,144,151,20.5,13.6,17,0,0.4,...,12.0,2022,0,CUT,-1.46,-1.86,-0.02,0.8,-1.08,-2.54
4,A. Noren,401353224,3832,144,148,23.5,18.1,23,0,1.2,...,12.0,2022,0,CUT,0.53,-0.36,-1.39,0.19,-1.56,-1.04


In [10]:
df.columns.tolist()


['Player_initial_last',
 'tournament id',
 'player id',
 'hole_par',
 'strokes',
 'hole_DKP',
 'hole_FDP',
 'hole_SDP',
 'streak_DKP',
 'streak_FDP',
 'streak_SDP',
 'n_rounds',
 'made_cut',
 'pos',
 'finish_DKP',
 'finish_FDP',
 'finish_SDP',
 'total_DKP',
 'total_FDP',
 'total_SDP',
 'player',
 'Unnamed: 2',
 'Unnamed: 3',
 'Unnamed: 4',
 'tournament name',
 'course',
 'date',
 'purse',
 'season',
 'no_cut',
 'Finish',
 'sg_putt',
 'sg_arg',
 'sg_app',
 'sg_ott',
 'sg_t2g',
 'sg_total']

Check for missing data.

In [11]:
missing_values = df.isnull().sum()
print(missing_values)

Player_initial_last        0
tournament id              0
player id                  0
hole_par                   0
strokes                    0
hole_DKP                   0
hole_FDP                   0
hole_SDP                   0
streak_DKP                 0
streak_FDP                 0
streak_SDP                 0
n_rounds                   0
made_cut                   0
pos                    15547
finish_DKP                 0
finish_FDP                 0
finish_SDP                 0
total_DKP                  0
total_FDP                  0
total_SDP                  0
player                     0
Unnamed: 2             36864
Unnamed: 3             36864
Unnamed: 4             36864
tournament name            0
course                     0
date                       0
purse                      0
season                     0
no_cut                     0
Finish                  7683
sg_putt                 7684
sg_arg                  7684
sg_app                  7684
sg_ott        

Check the missing values in 'pos' field.

In [14]:
missing_pos = df[df['pos'].isnull()]
missing_pos.head()

Unnamed: 0,Player_initial_last,tournament id,player id,hole_par,strokes,hole_DKP,hole_FDP,hole_SDP,streak_DKP,streak_FDP,...,purse,season,no_cut,Finish,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total
2,A. Lahiri,401353224,4989,144,147,21.5,17.4,27,0,0.0,...,12.0,2022,0,CUT,-0.56,0.74,-1.09,0.37,0.02,-0.54
3,A. Long,401353224,6015,144,151,20.5,13.6,17,0,0.4,...,12.0,2022,0,CUT,-1.46,-1.86,-0.02,0.8,-1.08,-2.54
4,A. Noren,401353224,3832,144,148,23.5,18.1,23,0,1.2,...,12.0,2022,0,CUT,0.53,-0.36,-1.39,0.19,-1.56,-1.04
5,A. Putnam,401353224,5502,144,151,19.5,12.0,19,0,6.0,...,12.0,2022,0,CUT,-0.97,0.14,-2.02,0.31,-1.56,-2.54
9,A. Smalley,401353224,9484,144,151,18.0,10.9,20,0,0.6,...,12.0,2022,0,CUT,-1.89,-0.71,0.71,-0.65,-0.65,-2.54


Check that the no values in 'pos' field are due to players missing the cut (in golf missing the cut = a very bad finish so would still be relevant data for this study).

In [16]:
cut_check = df[df['pos'].notnull() & (df['finish_DKP'] == 'cut')]
cut_check[['tournament id', 'pos', 'finish_DKP']]

Unnamed: 0,tournament id,pos,finish_DKP


No values above indicates that missing data in 'pos' field is a missed cut (low finish- significantly outside the top ten).

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
