# PROBLEM STATEMENT 

Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# OVERVIEW 

The dataset contains 10,000 rows and 20 columns (original data without preprocessing), the no. of columns becomes 23 post preprocessing because of encoding, the 23 columns includes both numerical and categorical variables. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day. 

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan). 
- (this column has been encoded into three different columns during the preprocessing stage)
 - Diet_Type_Vegan,Diet_Type_Vegetarian

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O), this column values are encoded too .

It is clear from the above description that the predictor variable is the 'Target' column.


# -----------------------------------------------------------------------------

## Guidelines to follow in this notebook 
- The name of the dataframe should be df 
- Keep the seed value 42
- Names of training and testing variables should be X_train, X_test, y_train, y_test
- Keep the name of model instance as "model", e.g. model = DecisionTreeClassifer()
- Keep the predictions on training and testing data in a variable named y_train_pred and y_test_pred respectively.

# -------------------------------------------------------------------------------


Let us begin with importing the necessary libraries.

In [1]:

#import relevant data libraries 

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
!pip install xgboost
import xgboost as xgb
!pip install lightgbm
from lightgbm import LGBMClassifier
!pip install catboost 
from catboost import CatBoostClassifier




## Load the dataset

In [2]:
#Load the dataset
df = pd.read_csv('Healthcare_Dataset_Preprocessednew.csv')
df.head()

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,Target,Smoking,Alcohol,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies,Diet_Type__Vegan,Diet_Type__Vegetarian,Blood_Group_AB,Blood_Group_B,Blood_Group_O
0,2.0,26.0,111.0,198.0,99.0,72.0,4.0,1.0,5.0,5.0,1,2,2,1,2,1,0,1,False,True,True,False,False
1,8.0,24.0,121.0,199.0,103.0,75.0,2.0,1.0,2.0,9.0,1,0,1,1,2,1,2,2,False,False,True,False,False
2,81.0,27.0,147.0,203.0,100.0,74.0,10.0,-0.0,5.0,1.0,0,2,1,2,0,0,1,0,True,False,False,False,False
3,25.0,21.0,150.0,199.0,102.0,70.0,7.0,3.0,3.0,3.0,0,2,0,1,2,1,2,0,True,False,False,True,False
4,24.0,26.0,146.0,202.0,99.0,76.0,10.0,2.0,5.0,1.0,0,0,1,2,0,2,0,2,False,True,False,True,False


In [3]:
#shape of data
df.shape


(9549, 23)

In [4]:
# Column names in the dataset
df.columns

Index(['Age', 'BMI', 'Blood_Pressure', 'Cholesterol', 'Glucose_Level', 'Heart_Rate', 'Sleep_Hours', 'Exercise_Hours', 'Water_Intake', 'Stress_Level', 'Target', 'Smoking', 'Alcohol', 'Diet', 'MentalHealth', 'PhysicalActivity', 'MedicalHistory', 'Allergies', 'Diet_Type__Vegan', 'Diet_Type__Vegetarian', 'Blood_Group_AB', 'Blood_Group_B', 'Blood_Group_O'], dtype='object')

# Separate the indpendent features in the dataframe 'X' and  the target in a variable 'y '


In [5]:
# your code here
X = df.drop(['Target'], axis=1)
y = df['Target']

In [6]:
X.shape , y.shape

((9549, 22), (9549,))

# Splitting Dataset into Train and Test Sets


In [7]:
# Splitting the dataset into training and testing sets keeping the test size as 25% and seed value as 42

# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create a simple Decision tree 

The first step is to train the model using simple decision tree 

In [8]:
# Initialize a simple Decision Tree classifier with depth 15 and seed 42. Name it 'model' and then fit it 
# your code here
model = DecisionTreeClassifier(max_depth=15, random_state=42)

In [9]:
## Begin hidden test
assert model.max_depth == 15, "Max_depth is not set to 15"
## End hidden test

In [10]:
# your code here
model.fit(X_train, y_train.values.ravel())

In [11]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Evaluate model performance

After creating model and getting the predictions on the test set, 
 calculate f1 score for evaluating the performance of the model 

In [12]:
from sklearn.metrics import f1_score

# your code here
f1_score(y_test,y_test_pred)

0.903097696584591

# APPLY ADABOOST ALGORITHM FOR TRAINING 

After creating simple decision tree, its now time to create classifier model using AdaBoostClassifier 

In [13]:

# your code here
model = AdaBoostClassifier(random_state=42)
# Train the AdaBoost model
model.fit(X_train, y_train.values.ravel())

After training the model using AdaBoostClassifier , do prediction on the test data and calculate f1 score for evaluation

In [14]:
# Predict on training and validation data
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [15]:
from sklearn.metrics import f1_score
# your code here
f1_score(y_test,y_test_pred)

0.903097696584591

# APPLY GRADIENT BOOSTING 

### Now use GradientBoostingClassifier for training the model 


In [16]:
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train.values.ravel())

## Evaluation

Prediciton on training and testing data using GradientBoostingClassifier and then calculate f1 score 

In [17]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [18]:
f1_score(y_test,y_test_pred)

0.903097696584591

# APPLY XGBOOST 

### Here, use XGBClassifier to train the model.


In [19]:
from xgboost import XGBClassifier

model = XGBClassifier(random_state=42)
model.fit(X_train, y_train.values.ravel())

In [20]:
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

In [21]:
f1_score(y_test,y_test_pred)

0.903097696584591

# APPLY LIGHTGBM 

### TRAIN MODEL WITH LGBMClassifier this time 

In [22]:
from lightgbm import LGBMClassifier

model = LGBMClassifier(random_state=42)
model.fit(X_train, y_train.values.ravel())

[LightGBM] [Info] Number of positive: 3709, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000266 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 396
[LightGBM] [Info] Number of data points in the train set: 7161, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.517944 -> initscore=0.071809
[LightGBM] [Info] Start training from score 0.071809


In [23]:
# Predict on the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)


In [24]:
f1_score(y_test,y_test_pred)

0.9511811023622048

# APPLY CATBOOST 

### Train model with CatBoostClassifier 

In [25]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(random_state=42)

model.fit(X_train, y_train.values.ravel())

Learning rate set to 0.023879
0:	learn: 0.6749614	total: 156ms	remaining: 2m 35s
1:	learn: 0.6560950	total: 168ms	remaining: 1m 24s
2:	learn: 0.6380358	total: 180ms	remaining: 59.8s
3:	learn: 0.6204795	total: 185ms	remaining: 46s
4:	learn: 0.6054240	total: 190ms	remaining: 37.9s
5:	learn: 0.5910817	total: 195ms	remaining: 32.3s
6:	learn: 0.5759772	total: 199ms	remaining: 28.2s
7:	learn: 0.5627926	total: 202ms	remaining: 25s
8:	learn: 0.5511314	total: 206ms	remaining: 22.7s
9:	learn: 0.5395501	total: 210ms	remaining: 20.8s
10:	learn: 0.5284412	total: 215ms	remaining: 19.3s
11:	learn: 0.5172321	total: 219ms	remaining: 18.1s
12:	learn: 0.5091017	total: 225ms	remaining: 17.1s
13:	learn: 0.5001794	total: 228ms	remaining: 16.1s
14:	learn: 0.4927348	total: 232ms	remaining: 15.3s
15:	learn: 0.4844060	total: 237ms	remaining: 14.6s
16:	learn: 0.4759033	total: 241ms	remaining: 14s
17:	learn: 0.4676664	total: 245ms	remaining: 13.4s
18:	learn: 0.4600816	total: 249ms	remaining: 12.9s
19:	learn: 0.45

<catboost.core.CatBoostClassifier at 0x1d4913d4cd0>

In [26]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [27]:
f1_score(y_test,y_test_pred)

0.9569680221081721

The performance of various boosting techniques depends on multiple hyperparamters and dataset provided for training. 
With gradient boosting, good precision can be achieved in reducing errors. There are multiple advantages in using gradient boosting as compared to adaboost. Also XgBoost is again evolved version of gradient boosting. Being versatile and capability of handling large datasets makes it more popular. While CatBoost is popular to handle categorical variables more vigourously and doesnt require encoding for the same.


