<h2>Overview</h2>

This script preprocesses insurance data and uses PyCaret’s AutoML capabilities for classification. It incorporates cross-validation, GPU acceleration, and a pre-trained model to generate predictions for submission.

<h2>1.	Library Imports</h2>

Purpose: 
- Import libraries for data handling, PyCaret setup, and classification tasks.

In [None]:
from pycaret.datasets import get_data
from pycaret.classification import setup
from pycaret.classification import *
import pandas as pd
import numpy as np

<h2>2.	Data Loading</h2>

Purpose: 
- Load training and test datasets.

In [None]:
# 데이터 로드
train_data = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv')

X_train = train_data.drop(['id'], axis = 1)
X_test = test_data.drop(['id'], axis = 1)

<h2>3.	Data Preprocessing</h2>

Purpose:
- Remove ID column.
- Encode categorical variables into numeric format.
- Apply logarithmic scaling to the Annual_Premium column to reduce skewness.

In [None]:
X_train['Region_Code'] = X_train['Region_Code'].astype('int8')
X_test['Region_Code'] = X_test['Region_Code'].astype('int8')

X_train['Gender'] = X_train['Gender'].map({'Male': 1,'Female': 0}).astype('int8')
X_test['Gender'] = X_test['Gender'].map({'Male': 1,'Female': 0}).astype('int8')

X_train['Vehicle_Age'] = X_train['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2}).astype('int8')
X_test['Vehicle_Age'] = X_test['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2}).astype('int8')

X_train['Vehicle_Damage'] = X_train['Vehicle_Damage'].map({'Yes': 1, 'No': 0}).astype('int8')
X_test['Vehicle_Damage'] = X_test['Vehicle_Damage'].map({'Yes': 1, 'No': 0}).astype('int8')

X_train['Annual_Premium'] = np.log(X_train['Annual_Premium'])
X_test['Annual_Premium'] = np.log(X_test['Annual_Premium'])

<h2>4.	PyCaret Setup</h2>

Purpose: Configure PyCaret’s AutoML setup for the classification task, including:
- Defining features.
- Enabling preprocessing.
- Setting up cross-validation.

In [None]:
s = setup(
    X_train, 
    target='Response', 
    session_id=123,
    train_size = 0.85,
    numeric_features=['Age','Annual_Premium', 'Vintage', 'Vehicle_Age', 'Driving_License', 'Previously_Insured', 'Vehicle_Damage'],
    categorical_features=['Region_Code', 'Policy_Sales_Channel'],
    fold = 5, #cross-validation subset의 갯수를 2개로 지정
    preprocess = True,
    use_gpu = True,
    )
        

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...




[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs hav

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Response
2,Target type,Binary
3,Original data shape,"(11504798, 11)"
4,Transformed data shape,"(11504798, 11)"
5,Transformed train set shape,"(9779078, 11)"
6,Transformed test set shape,"(1725720, 11)"
7,Numeric features,7
8,Categorical features,2
9,Preprocess,True


[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


<h2>5.	Load Pre-trained Model (check automl.ipynb)</h2>

Purpose: 
- Load a pre-trained and hyperparameter-tuned XGBoost model.

In [None]:
loaded_model = load_model('/kaggle/input/hyper-parameter-tuned/xgboost_5fold_training85_third_hypprm')

Transformation Pipeline and Model Successfully Loaded


<h2>6.	Prediction</h2>

Purpose: 
- Predict responses and filter low-confidence predictions.

In [None]:
#prediction = predict_model(finalized, data = X_test)
prediction = predict_model(loaded_model, data = X_test)
submission = pd.DataFrame(prediction)
temp = submission[submission['prediction_score'] < 0.1]
len(temp)

<h2>7.	Final Submission</h2>

Purpose: 
- Create the submission file for Kaggle, adjusting response probabilities for better calibration.

In [None]:
final_submission = test_data[['id']]
final_submission['Response'] = 1 - submission['prediction_score']
final_submission.to_csv('/kaggle/working/submission.csv', index=False)