# Notebook 04 - Feature Engineering

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* CSV file with cleaned data generated in notebook 02: outputs/datasets/cleaned/v1/credit_card_data_cleaned.csv

## Outputs

* Data split into train and test sets, saved in the outputs/datasets/cleaned/v1 folder
* List of variables and feature engineering steps


## Conclusions, Additional Comments

* [xxx]

## Feature Engineering Steps
* Split train and test sets before beginning to avoid data leakage
* Assess what feature engineering processes to apply to each variable


---

# Change working directory

* This notebook is stored in the `jupyter_notebooks` subfolder
* The current working directory therefore needs to be changed to the workspace, i.e., the working directory needs to be changed from the current folder to its parent folder

Firstly, the current directory is accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default\\jupyter_notebooks'

Next, the working directory is set as the parent of the current `jupyter_notebooks` directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* This allows access to all the files and folders within the workspace, rather than solely those within the `jupyter_notebooks` directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Finally, confirm that the new current directory has been successfully set

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\credit-card-default'

---

# Load data

The data is loaded from the outputs/datasets/cleaned/v1 folder:

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/cleaned/v1/credit_card_data_cleaned.csv')
df.head()

Unnamed: 0,credit_limit,sex,education,marital_status,age,late_sep,late_aug,late_jul,late_jun,late_may,...,prev_payment_sep,prev_payment_aug,prev_payment_jul,prev_payment_jun,prev_payment_may,prev_payment_apr,default_next_month,any_default,total_default,greatest_default
0,20000.0,female,university,married,24,2,2,-1,-1,-2,...,0.0,689.0,0.0,0.0,0.0,0.0,1,1,4,2
1,120000.0,female,university,single,26,-1,2,0,0,0,...,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1,1,4,2
2,90000.0,female,university,single,34,0,0,0,0,0,...,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0,0,0,0
3,50000.0,female,university,married,37,0,0,0,0,0,...,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0,0,0,0
4,50000.0,male,university,married,57,-1,0,-1,0,0,...,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0,0,0,0


---

# Split train and test set

The train and test set must be split before feature engineering to avoid the potential for data leakage
* For example, consider a feature engineering step where data are scaled using the min and max values for the whole dataset
    - If this step were performed before feature engineering, this 'knowledge' of the range of the whole dataset would be indirectly included in the test set
    - Note that the data cleaning pipeline already applied to the data in notebook 02 does not contain any such steps, being limited to intrarow operations, dropping of a column that does not contain any predictive information and renaming of columns to conform to naming standards
    - More information and explanation on this topic is available at [Machine Learning Mastery](https://machinelearningmastery.com/data-preparation-without-data-leakage/)

The percentage of data to put into the train and test sets is a matter of some debate, with [Machine Learning Mastery](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/) noting that common split percentages include 80% : 20%, 67% : 33% or even 50% : 50%.

Some models using very large datasets can have much more extreme split percentages, with ratios as high as 99% : 1%.
  
The research article ["Optimal ratio for data splitting"](https://onlinelibrary.wiley.com/doi/full/10.1002/sam.11583) (Joseph, 2022) aims to mathematically formulate an optimal ratio based on the number of unique rows and predictor variables in the dataset
* Given that we do not know the number of predictor variables before beginning feature engineering, since some are likely to be removed when we apply `SmartCorrelatedSelection`, a method for estimating the number of predictor variables `p` is also provided, where `p` is the square root of the number of unique rows in the dataset, which in this case is 30,000, giving a value for `p` of 173.2
* The dataset can then be split in the ratio `√p : 1`: in this case, 13.2 : 1 or about 93% : 7%

Given the wide variety of split percentages suggested from various sources, a split of 80% : 20% is used here, since it is a generally accepted split and falls well within the ranges mentioned above.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['default_next_month'], axis=1),
    df['default_next_month'],
    test_size=0.2,
    random_state=42,
)
print(f"Training features dataset: {X_train.shape}")
print(f"Training target variable: {y_train.shape}")
print(f"Testing features dataset: {X_test.shape}")
print(f"Testing target variable: {y_test.shape}")

Training features dataset: (24000, 26)
Training target variable: (24000,)
Testing features dataset: (6000, 26)
Testing target variable: (6000,)


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [5]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)