# Problem Statement
Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# Dataset Overview
The dataset contains 10,000 rows and 20 columns, including both numerical and categorical variables. Some columns have missing values, especially for older individuals, reflecting the scenario where certain health records may not be up-to-date. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day. 

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan).

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O).

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.


It is clear from the above description that the predictor variable is the 'Target' column.

Let us begin with importing the necessary libraries. And read the data.

In [1]:
# Necessary library imports for data processing and KNN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score

In [2]:
# Load the dataset
df = pd.read_csv('Healthcare_Dataset.csv')
df.head()

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,Target,Smoking,Alcohol,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies,Diet_Type,Blood_Group
0,-2.0,26.0,111.0,198.0,99.0,72.0,4.0,-1.0,5.0,5.0,1,2,2,1,2,1,0,1,Vegetarian,AB
1,-8.0,24.0,121.0,199.0,103.0,75.0,2.0,1.0,2.0,9.0,1,0,1,1,2,1,2,2,Non-Vegetarian,AB
2,81.0,27.0,147.0,203.0,,74.0,10.0,-0.0,5.0,1.0,0,2,1,2,0,0,1,0,Vegan,A
3,25.0,21.0,150.0,199.0,102.0,70.0,7.0,3.0,3.0,3.0,0,2,0,1,2,1,2,0,Vegan,B
4,24.0,26.0,146.0,202.0,99.0,76.0,10.0,-2.0,5.0,1.0,0,0,1,2,0,2,0,2,Vegetarian,B


In [3]:
# Shape of the data
df.shape

(10000, 20)

In [4]:
df.columns

Index(['Age', 'BMI', 'Blood_Pressure', 'Cholesterol', 'Glucose_Level',
       'Heart_Rate', 'Sleep_Hours', 'Exercise_Hours', 'Water_Intake',
       'Stress_Level', 'Target', 'Smoking', 'Alcohol', 'Diet', 'MentalHealth',
       'PhysicalActivity', 'MedicalHistory', 'Allergies', 'Diet_Type',
       'Blood_Group'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Age               10000 non-null  float64
 1   BMI               9457 non-null   float64
 2   Blood_Pressure    9457 non-null   float64
 3   Cholesterol       10000 non-null  float64
 4   Glucose_Level     9457 non-null   float64
 5   Heart_Rate        10000 non-null  float64
 6   Sleep_Hours       10000 non-null  float64
 7   Exercise_Hours    10000 non-null  float64
 8   Water_Intake      10000 non-null  float64
 9   Stress_Level      10000 non-null  float64
 10  Target            10000 non-null  int64  
 11  Smoking           10000 non-null  int64  
 12  Alcohol           10000 non-null  int64  
 13  Diet              10000 non-null  int64  
 14  MentalHealth      10000 non-null  int64  
 15  PhysicalActivity  10000 non-null  int64  
 16  MedicalHistory    10000 non-null  int64  

Now, let us describe the numerical functions of the data.

In [6]:
# Get the description for numerical data
df.describe()

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,Target,Smoking,Alcohol,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies
count,10000.0,9457.0,9457.0,10000.0,9457.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,23.7636,25.694935,130.799196,199.2521,100.122343,73.5314,7.0062,-0.7569,3.4879,4.2902,0.4999,0.9945,0.9927,1.0055,0.9967,1.0006,1.0019,0.9931
std,42.752514,1.998384,28.396578,2.105941,2.273901,1.724329,2.343227,2.182631,1.705387,2.123213,0.500025,0.815681,0.816525,0.816172,0.82326,0.809979,0.81398,0.816284
min,-128.0,19.0,22.0,192.0,93.0,67.0,-1.0,-8.0,-3.0,-3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-5.0,24.0,113.0,198.0,99.0,72.0,5.0,-2.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,22.0,26.0,135.0,199.0,100.0,74.0,7.0,-1.0,4.0,4.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,49.0,27.0,151.0,201.0,102.0,75.0,9.0,1.0,5.0,6.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
max,201.0,32.0,225.0,207.0,107.0,80.0,14.0,8.0,10.0,12.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0


In [7]:
#get the description of the categorical variable
df.describe(include='O')

Unnamed: 0,Diet_Type,Blood_Group
count,10000,10000
unique,3,4
top,Vegetarian,O
freq,3360,2538


# Observations
There are 3 major observations here-
- The age column has minimum and maximum values of negative 128 and positive 201 respectively which is not possible.
- We can see that 5 variables- Age, Sleep_Hours, Exercise_Hours, Water_Intake, Stress_Level have negative values which is not possible logically.
- We have 2 categorical variables- Diet_Type, and Blood Group for which we must perform one-hot encoding on

Let us begin with converting negative values in the 5 variables- Age, Sleep_Hours, Exercise_Hours, Water_Intake, Stress_Level into a positive one. For this we have the following methods-
- Individually convert the negative values with their positive values. Possible methods can be:
 1. Replace negative values with central value of the respective columns like mean/ median
 2. Replace negative value with central value based on other columns like sleep hours, blood pressure or others

Or, you can choose any other method to treat the negative values in these 5 columns.

In [8]:
# Replace all the negative values in these 5 columns with the mean of the positive values from the respective columns.
# Write your code below here
# your code here
mean_age = df["Age"].mean()
mean_age

df.loc[df['Age'] < 0, 'Age'] = mean_age

In [9]:
mean_sleep = df["Sleep_Hours"].mean()
mean_sleep

df.loc[df["Sleep_Hours"] < 0, "Sleep_Hours"] = mean_sleep

In [10]:
mean_ex = df["Exercise_Hours"].mean()
mean_ex

df.loc[df['Exercise_Hours'] < 0, 'Exercise_Hours'] = 0.00

In [11]:
mean_water = df["Water_Intake"].mean()
mean_water

df.loc[df['Water_Intake'] < 0, 'Water_Intake'] = mean_water

In [12]:
mean_stress = df["Stress_Level"].mean()
mean_stress

df.loc[df['Stress_Level'] < 0, 'Stress_Level'] = mean_stress

In [13]:
assert len(df[df['Age'] <0]) == 0, 'Make sure to make all the values in the "Age" variable as +ve'
assert len(df[df['Sleep_Hours'] <0]) == 0, 'Make sure to make all the values in the "Sleep_Hours" variable as +ve'
assert len(df[df['Exercise_Hours'] <0]) == 0, 'Make sure to make all the values in the "Exercise_Hours" variable as +ve'
assert len(df[df['Water_Intake'] <0]) == 0, 'Make sure to make all the values in the "Water_Intake" variable as +ve'
assert len(df[df['Stress_Level'] <0]) == 0, 'Make sure to make all the values in the "Stress_Level" variable as +ve'

Now, let us treat the age column.

Since Anova is a newly etablished health insurance company, they are very strict as to who will be eligible for a health insurance. And the management takes the call that any person above the age of 100 will not be issued an insurance. Also, this will helps us eliminate age values that are absurdly high(which may most probably be incorrect entries). First let's check how many observations are there where Age>100

In [14]:
# Check the number of observations in dataset with 'Age' values greater than 100
len(df[df['Age']>100])

437

Since Anova is a newly established Health insurance provider, they do not want to charge all people with 'Age' greater than 100 years, the same amount of premium. Being a new company, data is critical to them and deleting such rows won't make sense. Instead let us set the values above 100 in the Age column to 100.

In [15]:
# Replacing values greater than 100 with 100 in the Age column
# Write your code below
# your code here
df.loc[df['Age'] > 100, 'Age'] = 100

In [16]:
assert len(df[df['Age'] > 100]) == 0, 'The "Age" variable still has values greater than 100. Make sure to select only those records that have "Age" values <= 100'

In [17]:
df.shape

(10000, 20)

Now let us treat the missing values.

# Treating Missing Values

In [18]:
#check missing values
df.isnull().sum()

Age                   0
BMI                 543
Blood_Pressure      543
Cholesterol           0
Glucose_Level       543
Heart_Rate            0
Sleep_Hours           0
Exercise_Hours        0
Water_Intake          0
Stress_Level          0
Target                0
Smoking               0
Alcohol               0
Diet                  0
MentalHealth          0
PhysicalActivity      0
MedicalHistory        0
Allergies             0
Diet_Type             0
Blood_Group           0
dtype: int64

There are 3 variables with missing values. Let us fill these variables with the median of these variables. Let's create a list with the name of these columns.

In [19]:
#list of columns to fill with the median value within each SKU_ID group
columns_to_fill_with_median = ['BMI', 'Blood_Pressure', 'Glucose_Level']

We can create a for loop that loops over the columns the 'columns_to_fill_with_median' list. And within the for loop use lambda function to fill all the columns with median.

In [20]:
#compute the median for each column and fill NA values with that median
# Write your code below
# your code here
for column in columns_to_fill_with_median:
    median_value = df[column].median()
    df[column].fillna(median_value, inplace=True)

In [21]:
assert len(df[df.BMI.isna()]) == 0, 'The column "BMI" still has missing values, make sure to impute them properly with the median values'
assert len(df[df.Blood_Pressure.isna()]) == 0, 'The column "Blood_Pressure" still has missing values, make sure to impute them properly with the median values'
assert len(df[df.Glucose_Level.isna()]) == 0, 'The column "Glucose_Level" still has missing values, make sure to impute them properly with the median values'

Then let us check the missing values again.

In [22]:
#check missing values
df.isnull().sum()

Age                 0
BMI                 0
Blood_Pressure      0
Cholesterol         0
Glucose_Level       0
Heart_Rate          0
Sleep_Hours         0
Exercise_Hours      0
Water_Intake        0
Stress_Level        0
Target              0
Smoking             0
Alcohol             0
Diet                0
MentalHealth        0
PhysicalActivity    0
MedicalHistory      0
Allergies           0
Diet_Type           0
Blood_Group         0
dtype: int64

# One Hot Encoding

Now let us check the values in the categorical columns - 'Diet_Type', and 'Blood_Group'.

In [23]:
# Check unique values in Diet_Type Column
df.Diet_Type.unique()

array(['Vegetarian', 'Non-Vegetarian', 'Vegan'], dtype=object)

In [24]:
# Check unique values in Blood_Group Column
df.Blood_Group.unique()

array(['AB', 'A', 'B', 'O'], dtype=object)

Let's perform one-hot encoding. We will add both the column names - ['Diet_Type','Blood_Group'] insurrin the columns argument and use the same as prefix in the prefix argument enclosed in a square bracket. Make sure the drop_first argument is true. This will delete the variable original 2 variales after encoding. Make sure that your columns are converted to numeric and make sure to add prefix 'Diet_Type' and 'Blood_Group' in the respective columns

In [25]:
df['Diet_Type'] = df['Diet_Type'].astype('category')

In [26]:
df['Blood_Group'] = df['Blood_Group'].astype('category')

In [27]:
# One Hot Encoding
# Write your code below
# your code here

df = pd.get_dummies(df, columns=["Diet_Type","Blood_Group"], prefix=["Diet_Type","Blood_Group"], drop_first=True)

The column names after OHE should be Diet_Type_Vegan, Diet_Type_Vegetarian, Blood_Group_AB, Blood_Group_B, Blood_Group_O

In [28]:
# Check the head
df.head()

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,...,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies,Diet_Type_Vegan,Diet_Type_Vegetarian,Blood_Group_AB,Blood_Group_B,Blood_Group_O
0,23.7636,26.0,111.0,198.0,99.0,72.0,4.0,0.0,5.0,5.0,...,1,2,1,0,1,0,1,1,0,0
1,23.7636,24.0,121.0,199.0,103.0,75.0,2.0,1.0,2.0,9.0,...,1,2,1,2,2,0,0,1,0,0
2,81.0,27.0,147.0,203.0,100.0,74.0,10.0,-0.0,5.0,1.0,...,2,0,0,1,0,1,0,0,0,0
3,25.0,21.0,150.0,199.0,102.0,70.0,7.0,3.0,3.0,3.0,...,1,2,1,2,0,1,0,0,1,0
4,24.0,26.0,146.0,202.0,99.0,76.0,10.0,0.0,5.0,1.0,...,2,0,2,0,2,0,1,0,1,0


In [29]:
df.shape

(10000, 23)

As you can see, we have dummy varibales for both the columns ready.