# Problem Statement
Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# Dataset Overview
The dataset contains 10,000 rows and 20 columns, including both numerical and categorical variables. Some columns have missing values, especially for older individuals, reflecting the scenario where certain health records may not be up-to-date. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day. 

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan).

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O).

It is clear from the above description that the predictor variable is the 'Target' column. This time we will be creating a decision tree model.

Let us begin with importing the necessary libraries. And reading the preprocessed data.

In [None]:
# Necessary library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

In [None]:
# Load the dataset
df = pd.read_csv('Healthcare_Dataset_Preprocessed.csv')
df.head()

Now that we have a fairly processed dataset, let's separate the feature variable and the target variable. Let the target variable be 'y' and independent variables be 'X'.

In [None]:
# Separating the independent and target variables
# your code here
raise NotImplementedError

In [None]:
# Ignore this cell
assert X.shape == (9549, 22), 'Make sure you set the _Target_ variable in a separate variable named y and independent variables be X.'

<h3>Performing train-test split</h3>

Now, let's split the data in a 70:30 ratio. Use the random state = 42 and use the variable names- X_train, X_test, y_train, y_test

In [None]:
# Split the data into train and test
# Write your code below
# your code here
raise NotImplementedError

In [None]:
# Ignore this cell
assert X_train.shape == (6684, 22), 'Make sure to use the test size as 0.3, random_state = 42, and split the data correctly'

With that we are done with the train-test split. Let us have a quick look at the shape of the data.

In [None]:
# A quick look at the shape of the datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

<h3>Building the model</h3>

Now, let us create a decision tree classifier instance. We will set the random state to be 42 and save the instance in a variable named "DT_model".

In [None]:
# Creating a decision tree classifier object called DT_model
DT_model = DecisionTreeClassifier(random_state = 42)

Now fit the model - DT_model with X_train and y_train.

In [None]:
# Building the model using the training data
# your code here
raise NotImplementedError

Now that our model is built, let's perfom prediction on both the train and test data.
This would help us to compare our model performance on train and test data.

In [None]:
# Perfoming prediction on both the train and test data
y_pred_train = DT_model.predict(X_train)
y_pred_test = DT_model.predict(X_test)

Now let is find the f1-score for both train and test data. Store them in a variable named f1_train and f1_test.

<h3>Evaluating the model</h3>

Now let us get the f1 scores for both train and test data. Store the f1_scores for train data in 'f1_train' and 'f1_test' variables respectively.

In [None]:
# Get the train and test score 
# your code here
raise NotImplementedError

In [None]:
assert f1_train == 1.0, 'Make sure you have followed the conditions stated to train the model and predict the output'

As you can see, the model performance on the train data is exceptional. We have got a 100% performance for F1 score on train data. But the score seems low on test data hence leading to an overfitting model.

<h3>Calculating feature importance</h3>

To possibly counter this issue, Anova Insurance wants to find to find the important features and get rid of the irrelevant ones. You must store the value of importance in a dataframe named 'imp'.

In [None]:
#Retreiving the score of each feature
feature_imp = DT_model.feature_importances_

#  Sort the features by importance in descending order in a dataframe.

# your code here
raise NotImplementedError

# Calculate the cumulative sum of the 'Importance' column and store it in a new column called 'cum_imp'
imp['cum_imp'] = imp.Importance.cumsum()
imp

In [None]:
assert imp.shape == (22,3), 'Make sure you the feature importance is saved in a variable named imp and sorted in ascending order of importance'

Now let us drop the columns where the cummulative importance is greater than 0.95. Save these column names in a variable named drop_col.

In [None]:
#identifying the variables with low importance
# your code here
raise NotImplementedError

In [None]:
# Dropping the low important columns from X_train and X_test
# your code here
raise NotImplementedError

In [None]:
# ignore this cell
assert X_train.shape == (6684, 11), 'Make sure you have dropped the unimportant variables in train data'
assert X_test.shape == (2865, 11), 'Make sure you have dropped the unimportant variables in test data'

<h3>Making predictions using the model</h3>

Based on the irrelevant columns we removed, let us create a new decision tree model to see check whether there is an improvement in the performance.

In [None]:
# Building the decision tree classifier object
DT_new_model = DecisionTreeClassifier(random_state = 42)

Now fit the model - DT_new_model with X_train and y_train.

In [None]:
# Building the decision tree classifier object
# your code here
raise NotImplementedError

Now that our model is built, let's perfom prediction on both the train and test data.

In [None]:
# Building the model using the training data
# your code here
raise NotImplementedError

<h3>Evaluating the model</h3>

As the next step, print the f1-scores for both train and test data. Save the f1 score for train data in 'f1_train_new' and f1 score for test data in 'f1_test_new'.

In [None]:
# your code here
raise NotImplementedError

In [None]:
# ignore this cell
assert f1_train_new == 1.0, 'Make sure you have followed the conditions stated to train the model and predict the output'

We can see that there has only been a marginal improvement in the model. Another way to improve the model is pruning the tree and explore by identifying the right depth.

For this, let's look at the depth of the current model.

In [None]:
# Storing the depth of our current model
depth = DT_model.get_depth()
depth

Next, we will create a list of values, starting from the default tree depth and reducing by 3 until reaching a depth of 1. This list will help us perform pruning using different values of max_depth. Save this list in a variable named - 'max_depth_list'

In [None]:
# List of max_depth values 

# your code here
raise NotImplementedError

In [None]:
# ignore this cell
assert len(max_depth_list) == 7, 'Make sure you have followed the conditions stated to train the model and predict the output'

Next, we will train Decision Tree models with different max_depth values and then compute and display both the train and test F1 scores upto three decimal points for each model. 

In [None]:
# Dictionary to store the train and test F1 scores

f1_train_scores = {}
f1_test_scores = {}

# Loop through max_depth values and train the models
for depth in max_depth_list:
    # Initialize the Decision Tree model with the current max_depth value
    DT_model_prune = DecisionTreeClassifier(max_depth=depth, random_state=42)
    
    # Train the model
    # your code here
    raise NotImplementedError

# Print the train and test F1 scores for each model
for depth in max_depth_list:
    print(f"max_depth = {depth}|\
    Train Score = {f1_train_scores[depth]:.3f} |\
    Test score = {f1_test_scores[depth]:.3f}")
    print('_'*65)

In [None]:
# ignore this cell
assert 19 in f1_train_scores and f1_train_scores[19] > 0.99, "Train score is diffenrent from the expected score for max_depth = 19. Make sure you have followed the instructions carefully."
assert 13 in f1_train_scores and f1_train_scores[13] > 0.97, "Train score is diffenrent from the expected score for max_depth = 15. Make sure you have followed the instructions carefully."
assert 7 in f1_train_scores and f1_train_scores[7] > 0.91, "Train score is diffenrent from the expected score for max_depth = 7. Make sure you have followed the instructions carefully."
assert 1 in f1_train_scores and f1_train_scores[1] > 0.75, "Train score is diffenrent from the expected score for max_depth = 3. Make sure you have followed the instructions carefully."


We can see that at max_depth = 7, the test and train scores are almost same while keeping them both near 0.9, which is a good score. That proves that max_depth = 7 is the optimum value of depth we should choose for our model while making predictions!