# Problem Statement
Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# Dataset Overview
The dataset contains 10,000 rows and 20 columns, including both numerical and categorical variables. Some columns have missing values, especially for older individuals, reflecting the scenario where certain health records may not be up-to-date. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day. 

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan).

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O).

It is clear from the above description that the predictor variable is the 'Target' column.

Let us begin with importing the necessary libraries. And reading the preprocessed data.

In [None]:
# Necessary library imports for data processing and KNN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score

In [None]:
# Load the dataset
df = pd.read_csv('Healthcare_Data_Preprocessed.csv')
df.head()

<h2>Performing Train Test Split</h2>

In [None]:
# Separating the independent and target variables
y = df['Target']
X = df.drop(columns = ['Target'], axis = 1)

In [None]:
# Quick look at the shape of the data
X.shape, y.shape

Now, let's split the data in a 70:30 ratio. Use the random state = 42 and use the variable names- X_train, X_test, y_train, y_test

In [None]:
# Split the data into train and test
# your code here
raise NotImplementedError

In [None]:
assert X_train.shape == (7000, 22), 'Make sure to use the test size as 0.3, random_state = 42, and split the data correctly'
assert X_test.shape == (3000, 22), 'Make sure to use the test size as 0.3, random_state = 42, and split the data correctly'
assert y_train.shape == (7000, ), 'Make sure to use the test size as 0.3, random_state = 42, and split the data correctly'
assert y_test.shape == (3000, ), 'Make sure to use the test size as 0.3, random_state = 42, and split the data correctly'

With that we are done with the train-test split. Let us have a quick look at the shape of the data.

In [None]:
# A quick look at the shape of the datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Now, we will scale the data. Scaling will bring all the variables to the same scale and ensure certain variables do not influence the learning process.

# Scaling the Data

In [None]:
#import standard scaler
from sklearn.preprocessing import StandardScaler

In [None]:
#create an instance for StandardScaler
scaler = StandardScaler()

In [None]:
#let's transform the train data
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
#check X_train_scaled
X_train_scaled

In [None]:
X_train_scaled.shape

In [None]:
X_test_scaled = scaler.transform(X_test)

In [None]:
X_test_scaled

# KNN Model

Now let us build the knn model. We have to build a model with 5 neighbors.

In [None]:
#import the necessary class
from sklearn.neighbors import KNeighborsClassifier

Now build a knn classification model with k = 5.

In [None]:
# setting the value k or neighbors at 5
knn_model = KNeighborsClassifier(5)

Now train the knn model using the scaled train data

In [None]:
# training the knn model with train data
# Write your code below
# your code here
raise NotImplementedError

Now, let us make the predictions on the same train data. Store it in an instance named <b>y_train_pred</b>

In [None]:
# making predictions on the same training data
# Write your code below
# your code here
raise NotImplementedError

Then make the predictions on the test data and save it in an instance named <b> y_pred </b>

In [None]:
# making predictions on the test data based on the learniing from fitting model on the train data
# Write your code below
# your code here
raise NotImplementedError

<h2>Evaluating the Model</h2>

With that our model is prepared. Let us now get the f1-score for both train and test data. Save the scores in <b>train_f1_score</b> and <b>test_f1_score</b>

In [None]:
# getting f1_score for train and test data
# your code here
raise NotImplementedError
print('Train F1 Score: ', train_f1_score)
print('Test F1 Score: ', test_f1_score)

In [None]:
assert train_f1_score > 0.85, 'Make sure you have followed the conditions stated to train the model and predict the output'
assert test_f1_score > 0.77, 'Make sure you have followed the conditions stated to train the model and predict the output'

We are getting a decent model using KNN