# Problem Statement
Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# Dataset Overview
The dataset contains 10,000 rows and 20 columns, including both numerical and categorical variables. Some columns have missing values, especially for older individuals, reflecting the scenario where certain health records may not be up-to-date. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day.

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan).

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O).

It is clear from the above description that the predictor variable is the 'Target' column.

Let us begin with importing the necessary libraries. And read the data.

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Loading and reading the dataset
data = pd.read_csv('Healthcare_Dataset_Preprocessed.csv')
data.head()

Now, we check the data type of all the columns

In [None]:
# Check the data types
data.dtypes

# Splitting the data into Features and Target

To feed the data into the model, we must split the data into features(X) and target(y)

In [None]:
# Separate features and target variable
X = data.drop(columns = 'Target', axis=1)
y = data['Target']

In [None]:
# verify the shape of the X and y
X.shape, y.shape

In [None]:
# Import statsmodels
import statsmodels.api as sm

# Building a Logistic Regression Model

Now let's create a logistic regression model by first adding a constant term to your predictor variables. Then we fit the model to the data and finally get the model summary.

In [None]:
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model.summary()

# Checking the VIF Scores

Let's address the issue of highly correlated columns based on VIF scores.

Given that Anova is a newly established health insurance company, precision in predictions and informed business decisions are crucial. Ensuring the model's accuracy by leveraging the most reliable features is paramount. Features with high correlations could introduce bias, hence the need to identify and remove these highly correlated features. To start, let's assess the VIF scores across all features.

In [None]:
# Import variance_inflation_factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

Steps for the solution:
1. **Initialize DataFrame:** Creates an empty DataFrame `vif_data` to store VIF results.

2. **Add 'Variable' Column:** Add a column "Variable" to `vif_data` with column names from the input DataFrame `data`.

3. **Calculate VIF:** Use list comprehension and `variance_inflation_factor` to calculate VIF for each column in `data`.

4. **Add 'VIF' Column:** Include a "VIF" column in `vif_data` with the calculated VIF scores for each variable.

5. **Return DataFrame:** Return the resulting DataFrame `vif_data` containing variables and their respective VIF scores.

In [None]:
#function to calculate the VIF score
def calculate_vif(data):
    """Calculate VIF for a dataframe."""
    # your code here
    raise NotImplementedError

# we run the function on your features to verify whether it is working or not    
calculate_vif(X)

In [None]:
assert calculate_vif(X).shape == (23,2), 'Please make sure to calculate the VIF score for each column within the feature set "X", and store these scores in a DataFrame, associating column names with their respective VIF scores'

We notice that some features exhibit a VIF score exceeding 10, signifying strong correlations among them. Hence, it's necessary to address multicollinearity by systematically removing the feature with the highest VIF score in each iteration. This process is repeated until no feature demonstrates a VIF score greater than 10, ensuring the model doesn't contain highly correlated predictors.

# Removing features with VIF score > 10

Steps for the solution:
1. Initialize VIF_features_removed as an empty list.
2. While max_vif is greater than the VIF threshold:
    
    a. Calculate VIF scores for each feature in dataset X using calculate_vif function.
    
    b. Find the maximum VIF score among all features and assign it to max_vif.
    
    c. If max_vif exceeds the VIF threshold:
        
        
        i. Identify the feature causing the highest VIF score (max) and store it in 'remove'.

        ii. Append the name of the removed feature to VIF_features_removed list.

        iii. Remove the identified feature from dataset X.

3. End the loop when no feature has a VIF score greater than the threshold.

In [None]:
# Iteratively remove features with the highest VIF
vif_threshold = 10
max_vif = float('inf')
# your code here
raise NotImplementedError

# Display the features and their VIFs after the iterative removal
final_vif = calculate_vif(X)
final_vif.sort_values(by='VIF', ascending = False)

In [None]:
assert final_vif.shape == (17,2), 'Please make sure to remove all features within the DataFrame "X" that possess VIF scores surpassing 10'

We verify the shapes of X (features) and y (target)

In [None]:
X.shape, y.shape

We will make the model once again with the removed feature and see the performance

In [None]:
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model.summary()

We can see from this model that we recieved a Pseudo R-squ score of 0.05064. To improve this score, we will drop features that have p-value>0.05 as those features will be less significant in our model building process and then we will see the performance of the model once again. 

In [None]:
X.shape, y.shape

In [None]:
X.drop('const', axis = 1, inplace = True)

Let's proceed to eliminate columns that exhibit statistical insignificance by examining their respective p-values.

# Checking the P value

Steps for the solution:
1. Identifies features with p-values exceeding 0.05 as statistically insignificant.
2. Removes insignificant features from dataset X.

In [None]:
# Identify features with p-values greater than 0.05
# your code here
raise NotImplementedError

In [None]:
# Drop those features
# your code here
raise NotImplementedError

In [None]:
assert X.shape == (9549,5), 'Please make sure to remove all the features with p-values greater than 0.05'

In [None]:
#Building model using the scaled features
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model2 = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model2.summary()

In [None]:
X.drop(columns = ['const'], inplace = True)

# Logistic - Model Building

## Dividing the data

Let's perform the train test split. We do a 70:30 split, have random state as 42 and stratify = y.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

Let us proceed to scaling the data before creating the logistic regression model.

## Standard Scaling

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

Now, let's scale the data as usual.

In [None]:
# Scale both the train and the test data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Let's convert them into a dataframe.

In [None]:
# Converting train and test scaled data into a dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X.columns)

# Model Building

Now let us build the logistic regression model.

In [None]:
# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

In [None]:
# Fitting the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)

In [None]:
#Making predictions on training and test data
y_train_pred = logreg.predict(X_train_scaled)
y_test_pred = logreg.predict(X_test_scaled)

# Model Evaluation

Now let us evaluate the model using the f1-score. Store the f1-scores in <b>score_train</b> and <b>score_test</b> variables.

In [None]:
# Import f1-score
from sklearn.metrics import f1_score

In [None]:
# Generate and print the f1-score for train and test
# your code here
raise NotImplementedError
print('Train score: ',score_train)
print('Test score: ', score_test)

In [None]:
assert score_train > 0.64, 'Check if the model has been made after following all the instructions.'
assert score_test > 0.63, 'Check if the model has been made after following the instructions.'

Anova is getting an underfitting model. To tackle this they also decide to implement decision trees for this supervised problem.