## Loading data and preparing feature lists

In this step I load the dataset and separate the target column from the feature matrix.
I also identify which features are numerical and which are categorical by checking their dtypes.
This separation is important because numerical and categorical variables need different preprocessing steps later in the pipeline.

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import joblib
from sklearn.impute import SimpleImputer


# ---- Load data ----
data_path = "../Dataset/train.csv"  # adjust
df = pd.read_csv(data_path)

target_col = "Credit_Score"  # adjust if different

X = df.drop(columns=[target_col])
y = df[target_col]

# ---- Identify column types ----
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

numeric_features, categorical_features

  df = pd.read_csv(data_path)


(['Monthly_Inhand_Salary',
  'Num_Bank_Accounts',
  'Num_Credit_Card',
  'Interest_Rate',
  'Delay_from_due_date',
  'Num_Credit_Inquiries',
  'Credit_Utilization_Ratio',
  'Total_EMI_per_month'],
 ['ID',
  'Customer_ID',
  'Month',
  'Name',
  'Age',
  'SSN',
  'Occupation',
  'Annual_Income',
  'Num_of_Loan',
  'Type_of_Loan',
  'Num_of_Delayed_Payment',
  'Changed_Credit_Limit',
  'Credit_Mix',
  'Outstanding_Debt',
  'Credit_History_Age',
  'Payment_of_Min_Amount',
  'Amount_invested_monthly',
  'Payment_Behaviour',
  'Monthly_Balance'])

## Train/Test split

Here I split the dataset into training and test sets using an 80/20 ratio.
The split is stratified on the target variable to keep the same class distribution in both sets.
Stratification is important for classification problems because it prevents the model from being trained on an unbalanced sample.

In [25]:
# ---- Train / test split (stratified) ----
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42,
)

X_train.shape, X_test.shape


((80000, 27), (20000, 27))

## Fixing data types

In this block I make sure that all numeric features are stored as numeric data types, and all categorical features are stored as strings.
This avoids errors in later preprocessing steps.
I work on copies of the training and test sets to avoid `SettingWithCopyWarning`.
Ensuring correct dtypes is essential because scikit-learn transformers expect clean and consistent data formats.

In [26]:
# ---- Fix dtypes: numeric as numbers, categorical as strings ----

# Work on copies to avoid SettingWithCopy warnings
X_train = X_train.copy()
X_test = X_test.copy()

# 1) Ensure numeric features are numeric
for col in numeric_features:
    X_train[col] = pd.to_numeric(X_train[col], errors="coerce")
    X_test[col] = pd.to_numeric(X_test[col], errors="coerce")

# 2) Ensure categorical features are strings
for col in categorical_features:
    X_train[col] = X_train[col].astype(str)
    X_test[col] = X_test[col].astype(str)


## Building the preprocessing pipelines

Here I create two preprocessing pipelines: one for numerical features and one for categorical features.
The numerical pipeline imputes missing values using the median and scales the data with StandardScaler.
The categorical pipeline imputes missing values using the most frequent category and then applies One-Hot Encoding.
Finally, both pipelines are combined inside a ColumnTransformer that applies the correct transformations to the right feature groups.
This modular approach makes preprocessing clean, reliable, and ready for integration with machine learning models.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# ---- Preprocessor: impute + scale numeric, impute + one-hot categorical ----
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Fit on training data only
preprocessor.fit(X_train)


0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


## Saving all artifacts

Here I save all important objects into a dictionary: training and test splits, the fitted preprocessor, the feature lists, and the target column name.
I then store this dictionary as a `.`pkl file using `joblib.dump()`.
Saving the artifacts allows me to reuse the exact same preprocessing setup in other notebooks, ensuring consistency across training, validation, and evaluation.

In [None]:
# ---- Save artifacts for later notebooks ----
artifacts = {
    "X_train": X_train,
    "X_test": X_test,
    "y_train": y_train,
    "y_test": y_test,
    "preprocessor": preprocessor,
    "numeric_features": numeric_features,
    "categorical_features": categorical_features,
    "target_col": target_col,
}

joblib.dump(artifacts, "../Dataset/preprocessed_artifacts.pkl")

"Saved preprocessed_artifacts.pkl"


'Saved preprocessed_artifacts.pkl'