# **Phase-3** :
**--Transforming raw insurance claim data using feature engineering, building and evaluating predictive models, and deploying**

# **Feature Engineering :**


-Date Parsing: Converted insurance_apply_date and insurance_claimed_date from string to datetime formats to enable calculation of elapsed time between policy application and claim.
-Handling Missing Values: Removed records containing missing values to ensure model input quality and integrity.
-Categorical Encoding: Encoded categorical variables (region, sex, smoker) into numerical formats using LabelEncoder, making them machine-readable for modeling.
-Derived Feature Creation: Engineered a new feature, claim_delay, quantifying the number of days between policy application and claim—potentially informative for identifying delays or fraud.
-Outlier Treatment: Used z-score filtering to detect and remove records where numeric features (age, children, bmi, bill_amount, claimed_amount, amount_paid, duration) had unusually high or low values, thus minimizing model distortion from anomalies.
-Dropping Irrelevant Features: Removed non-informative columns such as identifiers (patient_id, full_name, region_code) and raw date columns, focusing only on predictive features for model training.
-Numerical Scaling: Standardized key numeric variables using StandardScaler to ensure all features contributed proportionally during model training, improving convergence and performance of most algorithms.
-Final Feature Matrix: After processing, the feature set included only relevant, encoded, and scaled attributes with strong predictive potential for insurance claim classification and fraud detection.


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load data
df = pd.read_csv('C:\\Users\\RENUKA\\Downloads\\nhic_data.csv')

# 1. Convert date columns to datetime
df['insurance_apply_date'] = pd.to_datetime(df['insurance_apply_date'])
df['insurance_claimed_date'] = pd.to_datetime(df['insurance_claimed_date'])

# 2. Handle missing values (if any)
df = df.dropna()  # or you can impute, depending on context

# 3. Encode categorical variables
lbl_region = LabelEncoder()
lbl_sex = LabelEncoder()
lbl_smoker = LabelEncoder()
df['region'] = lbl_region.fit_transform(df['region'])
df['sex'] = lbl_sex.fit_transform(df['sex'])
df['smoker'] = lbl_smoker.fit_transform(df['smoker'])

# 4. Create derived features
df['claim_delay'] = (df['insurance_claimed_date'] - df['insurance_apply_date']).dt.days

# 5. Detect and treat outliers in numerical columns with Z-score
numerical_cols = ['age', 'children', 'bmi', 'bill_amount', 'claimed_amount', 'amount_paid', 'duration']
zscores = np.abs((df[numerical_cols] - df[numerical_cols].mean()) / df[numerical_cols].std())
df = df[(zscores < 3).all(axis=1)]

# 6. Drop irrelevant columns
df = df.drop(['patient_id', 'full_name', 'region_code', 'insurance_apply_date', 'insurance_claimed_date'], axis=1)

# 7. Normalize/Scale numerical features
scaler = StandardScaler()
num_feature_cols = ['age', 'children', 'bmi', 'bill_amount', 'claimed_amount', 'amount_paid', 'duration', 'claim_delay']
df[num_feature_cols] = scaler.fit_transform(df[num_feature_cols])

# 8. Ready for model training
X = df.drop(['insuranceclaim'], axis=1)
y = df['insuranceclaim']

# **Column Name Check and Cleanup**:

-Used print(df.columns.tolist()) to display the current column names in the DataFrame and diagnose issues with spaces or typos from source files.
-Applied df.columns = df.columns.str.strip() to remove any leading or trailing whitespace in column names, ensuring all column references in code are correct, consistent, and free from formatting errors.
-This step is especially important after reading in externally sourced CSV files, as even minor inconsistencies or hidden spaces can cause KeyErrors when selecting or transforming columns.

In [None]:
print(df.columns.tolist())

['age', 'children', 'sex', 'region', 'bmi', 'smoker', 'bill_amount', 'insuranceclaim', 'claimed_amount', 'amount_paid', 'duration', 'year_billing', 'claim_delay']


In [None]:
df.columns = df.columns.str.strip()  # Ex: " insuranceclaim "  ----->   "insuranceclaim"

# **Model Training and Selection :**

-Data Preparation: After feature engineering, split the processed data into features (X) and target labels (y), followed by a train-test split to objectively evaluate model performance.
-Scaling: Standardized feature values using StandardScaler to ensure uniform input ranges for all models, promoting fair comparison and optimal training for algorithms sensitive to feature scale.
-Model Comparison: Trained three classification algorithms—Random Forest, XGBoost, and Logistic Regression—on the standardized training data. Evaluated each using key metrics: accuracy, precision, recall, and F1 score.
-Best Model Selection: Computed all metrics on the test set. The model achieving the highest F1 Score (balancing precision and recall) was selected for deployment.
-Serialization: Saved the best-performing model and the scaler together in a .pkl file using Python’s pickle module. This ensures deployment in other environments, including Streamlit, without retraining or re-scaling steps.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pickle

# Load data (with correct column names)
df = pd.read_csv('C:\\Users\\RENUKA\\Downloads\\nhic_data.csv')
df.columns = df.columns.str.strip()  # Clean column names

# Convert date columns to datetime
df['insurance_apply_date'] = pd.to_datetime(df['insurance_apply_date'])
df['insurance_claimed_date'] = pd.to_datetime(df['insurance_claimed_date'])

# Feature engineering
df['claim_delay'] = (df['insurance_claimed_date'] - df['insurance_apply_date']).dt.days
for col in ['sex', 'region', 'smoker']:
    df[col] = LabelEncoder().fit_transform(df[col])

# Prepare feature matrix and target
X = df.drop(['insuranceclaim', 'patient_id', 'full_name', 'insurance_apply_date', 'insurance_claimed_date', 'region_code'], axis=1)
y = df['insuranceclaim']

# Train/test split and scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Make sure xgboost is installed:
# pip install xgboost OR conda install -c conda-forge xgboost

models = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred)
    }

best_model_name = max(results, key=lambda k: results[k]['F1 Score'])
best_model = models[best_model_name]

with open('final_model.pkl', 'wb') as f:
    pickle.dump((best_model, scaler), f)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [None]:
!pip install xgboost



# **Model Performance Metrics Table:**
-Compiled the evaluation metrics (Accuracy, Precision, Recall, and F1 Score) for each trained classification model (Random Forest, XGBoost, and Logistic Regression) into a pandas DataFrame for easy comparison.
-Displayed the metrics in tabular format to visually assess which model performed best on the test set, supporting informed selection for deployment.
-This step helps interpret model strengths, weaknesses, and trade-offs between different algorithms and is critical for transparent reporting in machine learning projects.

In [None]:
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)

                     Accuracy  Precision  Recall  F1 Score
Random Forest        1.000000   1.000000     1.0  1.000000
XGBoost              0.996269   0.993711     1.0  0.996845
Logistic Regression  1.000000   1.000000     1.0  1.000000


# **Best Model Identification and Metrics:**


-Programmatically determined the best-performing classifier by selecting the model with the highest F1 score from all evaluated machine learning algorithms.
-Printed both the model’s name and its complete set of test metrics (accuracy, precision, recall, F1) for easy reference and clear justification of the model selection process.
-These outputs provide transparency in the workflow and support robust, data-driven decision-making for deployment in production or application settings.

In [None]:
print("Best Model:", best_model_name)
print("Best Model Metrics:", results[best_model_name])

Best Model: Random Forest
Best Model Metrics: {'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}


# **Model and Scaler Loading for Deployment:**


-Loaded the best-performing trained model and the associated StandardScaler from the serialized final_model.pkl file using Python’s pickle library.
-This step ensures that both the model and the scaler can be reused for future predictions and deployments (e.g., Streamlit app) without retraining or re-scaling, maintaining workflow consistency and reproducibility.
-Printing the loaded model confirms that the correct classifier is available and ready for inference on new data.

In [None]:
import pickle
with open('final_model.pkl', 'rb') as f:
    model, scaler = pickle.load(f)
print(model)

RandomForestClassifier(random_state=42)


# **Feature List Confirmation:**


-Displayed the full list of input (feature) columns used for model training after preprocessing and feature engineering.
-This printout serves as a reference to ensure all necessary features are included, in the correct order, for reproducible training, evaluation, and deployment (including Streamlit integration).
-Allows quick verification that non-predictive, identifier, or dropped columns are excluded, and that derived variables (like claim_delay) are present.

In [None]:
print(X.columns.tolist())

['age', 'children', 'sex', 'region', 'bmi', 'smoker', 'bill_amount', 'claimed_amount', 'amount_paid', 'duration', 'year_billing', 'claim_delay']


In [None]:
# Assume X_test is your test feature matrix
fraud_prob = best_model.predict_proba(X_test)[:, 1]  # probability of class '1' (fraudulent claim)

# Add to your DataFrame (for test set)
results_df = pd.DataFrame(X_test, columns=X.columns)  # or use pd.DataFrame(scaled values, ...)
results_df['fraud_probability'] = fraud_prob
results_df['actual_claim'] = y_test.values  # Optional: add actual class for comparison

# Save or inspect
print(results_df.head())
# Optionally, save to CSV:
results_df.to_csv("insurance_test_with_fraud_prob.csv", index=False)

        age  children       sex    region       bmi    smoker  bill_amount  \
0  0.568178  1.603053  1.007505 -1.350761 -0.142996  1.976931     1.006389   
1 -1.137352 -0.892485 -0.992551  1.350761 -0.417860 -0.505835    -0.871717   
2  0.994560  1.603053 -0.992551 -1.350761 -0.421132 -0.505835    -0.120153   
3  1.065624  1.603053 -0.992551  1.350761 -1.505864 -0.505835    -0.059240   
4  0.639241 -0.060639  1.007505  1.350761  0.981003 -0.505835     1.268720   

   claimed_amount  amount_paid  duration  year_billing  claim_delay  \
0        1.170329    -0.411173 -0.754181     -0.333887    -0.754181   
1       -0.515613    -0.726613 -1.199331     -0.333887    -1.199331   
2        0.175844    -0.636885  0.084357      0.862960     0.084357   
3       -0.711943     1.422050 -0.236565      1.461383    -0.236565   
4        1.431031    -0.421921 -0.971580      0.264537    -0.971580   

   fraud_probability  actual_claim  
0               1.00             1  
1               1.00          



# **Probability Score and Data Export**



In [None]:
# Step 1: Prepare your full feature matrix (exclude target and identifiers)
X_full = df.drop(['insuranceclaim', 'patient_id', 'full_name', 'insurance_apply_date', 'insurance_claimed_date', 'region_code'], axis=1)

# Step 2: Scale the features using the trained scaler
X_full_scaled = scaler.transform(X_full)

# Step 3: Predict fraud probability for each record
fraud_prob = best_model.predict_proba(X_full_scaled)[:, 1]  # Probability of class '1' (fraudulent claim)

# Step 4: Construct output dataframe
results_full = X_full.copy()
results_full['fraud_probability'] = fraud_prob
results_full['actual_claim'] = df['insuranceclaim'].values

# Step 5: Save to CSV
results_full.to_csv("insurance_fraud_prob_full.csv", index=False)

print(results_full.head())

   age  children  sex  region     bmi  smoker  bill_amount  claimed_amount  \
0   35         2    0       2  38.095       0     24915.05            0.00   
1   18         0    1       2  28.500       0      1712.23         1606.95   
2   48         0    0       0  28.900       0      8277.52         7547.65   
3   18         0    1       1  53.130       0      1163.46         1054.89   
4   22         1    1       1  52.580       1     44501.40        40097.64   

   amount_paid  duration  year_billing  claim_delay  fraud_probability  \
0     24915.05       362          2019          362               0.00   
1       105.28       254          2020          254               1.00   
2       729.87       187          2022          187               1.00   
3       108.57       181          2022          181               1.00   
4      4403.76       158          2023          158               0.98   

   actual_claim  
0             0  
1             1  
2             1  
3             

In [None]:
results_full.to_csv("insurance_fraud_prob_full.csv", index=False)