# Problem Statement
Anova Insurance, a global health insurance company, seeks to optimize its insurance policy premium pricing based on the health status of applicants. Understanding an applicant's health condition is crucial for two key decisions:
- Determining eligibility for health insurance coverage.
- Deciding on premium rates, particularly if the applicant's health indicates higher risks.

Your objective is to Develop a predictive model that utilizes health data to classify individuals as 'healthy' or 'unhealthy'. This classification will assist in making informed decisions about insurance policy premium pricing.

# Dataset Overview
The dataset contains 10,000 rows and 20 columns, including both numerical and categorical variables. Some columns have missing values, especially for older individuals, reflecting the scenario where certain health records may not be up-to-date. Here is the data dictionary.

- Age: Represents the age of the individual. Negative values seem to be present, which might indicate data entry errors or a specific encoding used for certain age groups.

- BMI (Body Mass Index): A measure of body fat based on height and weight. Typically, a BMI between 18.5 and 24.9 is considered normal.

- Blood_Pressure: Represents systolic blood pressure. Normal blood pressure is usually around 120/80 mmHg.

- Cholesterol: This is the cholesterol level in mg/dL. Desirable levels are usually below 200 mg/dL.

- Glucose_Level: Indicates blood glucose levels. It might be fasting glucose levels, with normal levels usually ranging from 70 to 99 mg/dL.

- Heart_Rate: The number of heartbeats per minute. Normal resting heart rate for adults ranges from 60 to 100 beats per minute.

- Sleep_Hours: The average number of hours the individual sleeps per day.

- Exercise_Hours: The average number of hours the individual exercises per day.

- Water_Intake: The average daily water intake in liters.

- Stress_Level: A numerical representation of stress level.

- Target: This is a binary outcome variable, with '1' indicating 'Unhealthy' and '0' indicating 'Healthy'.

- Smoking: A categorical variable indicating smoking status. Contains values - (0,1,2) which specify the regularity of smoking with 0 being no smoking and 2 being regular smmoking.

- Alcohol: A categorical variable indicating alcohol consumption status. Contains values - (0,1,2) which specify the regularity of alcohol consumption with 0 being no consumption quality and 2 being regular consumption.

- Diet: A categorical variable indcating the quality of dietary habits. Contains values - (0,1,2) which specify the quality of the habit with 0 being poor diet quality and 2 being good quality.

- MentalHealth: Possibly a measure of mental health status. Contains values - (0,1,2) which specify the severity of the mental health with 0 being fine and 2 being highly severe

- PhysicalActivity: A categorical variable indicating levels of physical activity. Contains values - (0,1,2) which specify the instensity of the medical history with 0 being no Physical Activity and 2 being regularly active.

- MedicalHistory: Indicates the presence of medical conditions or history. Contains values - (0,1,2) which specify the severity of the medical history with 0 being nothing and 2 being highly severe.

- Allergies: A categorical variable indicating allergy status. Contains values - (0,1,2) which specify the severity of the allergies with 0 being nothing and 2 being highly severe.

- Diet_Type: Categorical variable indicating the type of diet an individual follows. Contains values(Vegetarian, Non-Vegetarian, Vegan).

- Blood_Group: Indicates the blood group of the individual Contains values (A, B, AB, O).

It is clear from the above description that the predictor variable is the 'Target' column.

Let us begin with importing the necessary libraries. And read the data.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

In [2]:
# Loading and reading the dataset
data = pd.read_csv('Healthcare_Dataset_Preprocessed.csv')
data.head()

Unnamed: 0,Age,BMI,Blood_Pressure,Cholesterol,Glucose_Level,Heart_Rate,Sleep_Hours,Exercise_Hours,Water_Intake,Stress_Level,...,Diet,MentalHealth,PhysicalActivity,MedicalHistory,Allergies,Diet_Type_Vegan,Diet_Type_Vegetarian,Blood_Group_AB,Blood_Group_B,Blood_Group_O
0,2.0,26.0,111.0,198.0,99.0,72.0,4.0,1.0,5.0,5.0,...,1,2,1,0,1,0,1,1,0,0
1,8.0,24.0,121.0,199.0,103.0,75.0,2.0,1.0,2.0,9.0,...,1,2,1,2,2,0,0,1,0,0
2,81.0,27.0,147.0,203.0,100.0,74.0,10.0,-0.0,5.0,1.0,...,2,0,0,1,0,1,0,0,0,0
3,25.0,21.0,150.0,199.0,102.0,70.0,7.0,3.0,3.0,3.0,...,1,2,1,2,0,1,0,0,1,0
4,24.0,26.0,146.0,202.0,99.0,76.0,10.0,2.0,5.0,1.0,...,2,0,2,0,2,0,1,0,1,0


Now, we check the data type of all the columns

In [3]:
# Check the data types
data.dtypes

Age                     float64
BMI                     float64
Blood_Pressure          float64
Cholesterol             float64
Glucose_Level           float64
Heart_Rate              float64
Sleep_Hours             float64
Exercise_Hours          float64
Water_Intake            float64
Stress_Level            float64
Target                    int64
Smoking                   int64
Alcohol                   int64
Diet                      int64
MentalHealth              int64
PhysicalActivity          int64
MedicalHistory            int64
Allergies                 int64
Diet_Type_Vegan           int64
Diet_Type_Vegetarian      int64
Blood_Group_AB            int64
Blood_Group_B             int64
Blood_Group_O             int64
dtype: object

# Splitting the data into Features and Target

To feed the data into the model, we must split the data into features(X) and target(y)

In [4]:
# Separate features and target variable
X = data.drop(columns = 'Target', axis=1)
y = data['Target']

In [5]:
# verify the shape of the X and y
X.shape, y.shape

((9549, 22), (9549,))

In [6]:
# Import statsmodels
import statsmodels.api as sm

# Building a Logistic Regression Model

Now let's create a logistic regression model by first adding a constant term to your predictor variables. Then we fit the model to the data and finally get the model summary.

In [7]:
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model.summary()

Optimization terminated successfully.
         Current function value: 0.415608
         Iterations 7


0,1,2,3
Dep. Variable:,Target,No. Observations:,9549.0
Model:,Logit,Df Residuals:,9526.0
Method:,MLE,Df Model:,22.0
Date:,"Sat, 13 Jan 2024",Pseudo R-squ.:,0.3996
Time:,11:57:18,Log-Likelihood:,-3968.6
converged:,True,LL-Null:,-6610.1
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,133.3028,4.800,27.770,0.000,123.894,142.711
Age,-0.0240,0.001,-16.047,0.000,-0.027,-0.021
BMI,0.8490,0.024,35.723,0.000,0.802,0.896
Blood_Pressure,-0.0234,0.001,-18.228,0.000,-0.026,-0.021
Cholesterol,-0.6372,0.025,-25.939,0.000,-0.685,-0.589
Glucose_Level,-0.1037,0.025,-4.150,0.000,-0.153,-0.055
Heart_Rate,-0.1746,0.022,-8.012,0.000,-0.217,-0.132
Sleep_Hours,0.1190,0.026,4.545,0.000,0.068,0.170
Exercise_Hours,0.0474,0.024,1.950,0.051,-0.000,0.095


# Checking the VIF Scores

Let's address the issue of highly correlated columns based on VIF scores.

Given that Anova is a newly established health insurance company, precision in predictions and informed business decisions are crucial. Ensuring the model's accuracy by leveraging the most reliable features is paramount. Features with high correlations could introduce bias, hence the need to identify and remove these highly correlated features. To start, let's assess the VIF scores across all features.

In [8]:
# Import variance_inflation_factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

Steps for the solution:
1. **Initialize DataFrame:** Creates an empty DataFrame `vif_data` to store VIF results.

2. **Add 'Variable' Column:** Add a column "Variable" to `vif_data` with column names from the input DataFrame `data`.

3. **Calculate VIF:** Use list comprehension and `variance_inflation_factor` to calculate VIF for each column in `data`.

4. **Add 'VIF' Column:** Include a "VIF" column in `vif_data` with the calculated VIF scores for each variable.

5. **Return DataFrame:** Return the resulting DataFrame `vif_data` containing variables and their respective VIF scores.

In [9]:
#function to calculate the VIF score
def calculate_vif(data):
    """Calculate VIF for a dataframe."""
    # your code here
    vif_data = pd.DataFrame()
    vif_data["Variable"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    return vif_data

    

# we run the function on your features to verify whether it is working or not    
calculate_vif(X)

Unnamed: 0,Variable,VIF
0,const,21309.981005
1,Age,1.448242
2,BMI,1.707676
3,Blood_Pressure,1.351868
4,Cholesterol,2.049156
5,Glucose_Level,3.413105
6,Heart_Rate,1.554534
7,Sleep_Hours,4.304752
8,Exercise_Hours,1.177465
9,Water_Intake,2.035291


In [10]:
assert calculate_vif(X).shape == (23,2), 'Please make sure to calculate the VIF score for each column within the feature set "X", and store these scores in a DataFrame, associating column names with their respective VIF scores'

We notice that some features exhibit a VIF score exceeding 10, signifying strong correlations among them. Hence, it's necessary to address multicollinearity by systematically removing the feature with the highest VIF score in each iteration. This process is repeated until no feature demonstrates a VIF score greater than 10, ensuring the model doesn't contain highly correlated predictors.

# Removing features with VIF score > 10

Steps for the solution:
1. Initialize VIF_features_removed as an empty list.
2. While max_vif is greater than the VIF threshold:
    
    a. Calculate VIF scores for each feature in dataset X using calculate_vif function.
    
    b. Find the maximum VIF score among all features and assign it to max_vif.
    
    c. If max_vif exceeds the VIF threshold:
        
        
        i. Identify the feature causing the highest VIF score (max) and store it in 'remove'.

        ii. Append the name of the removed feature to VIF_features_removed list.

        iii. Remove the identified feature from dataset X.

3. End the loop when no feature has a VIF score greater than the threshold.

In [11]:
# Iteratively remove features with the highest VIF
vif_threshold = 10
max_vif = float('inf')
# your code here
while max_vif > vif_threshold:
        # Calculate VIF
    vif_df = calculate_vif(X)

    #Finding the max values of vif scores
    max_vif = vif_df["VIF"].max()

    # If the maximum VIF is above the threshold, remove that variable
    if max_vif > vif_threshold:
        remove = vif_df.sort_values("VIF", ascending=False).iloc[0]

        #Dropping columns from features simultaneously
        X.drop(remove["Variable"], axis=1, inplace = True)

# Display the features and their VIFs after the iterative removal
final_vif = calculate_vif(X)
final_vif.sort_values(by='VIF', ascending = False)

Unnamed: 0,Variable,VIF
1,Sleep_Hours,9.820069
3,Water_Intake,5.86314
4,Stress_Level,4.687377
0,Age,2.870769
2,Exercise_Hours,2.838009
9,PhysicalActivity,2.487722
7,Diet,2.484408
10,MedicalHistory,2.473794
5,Smoking,2.446548
6,Alcohol,2.440187


In [12]:
assert final_vif.shape == (17,2), 'Please make sure to remove all features within the DataFrame "X" that possess VIF scores surpassing 10'

We verify the shapes of X (features) and y (target)

In [13]:
X.shape, y.shape

((9549, 17), (9549,))

We will make the model once again with the removed feature and see the performance

In [14]:
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model.summary()

Optimization terminated successfully.
         Current function value: 0.657178
         Iterations 5


0,1,2,3
Dep. Variable:,Target,No. Observations:,9549.0
Model:,Logit,Df Residuals:,9531.0
Method:,MLE,Df Model:,17.0
Date:,"Sat, 13 Jan 2024",Pseudo R-squ.:,0.05064
Time:,11:57:21,Log-Likelihood:,-6275.4
converged:,True,LL-Null:,-6610.1
Covariance Type:,nonrobust,LLR p-value:,2.7290000000000003e-131

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.0553,0.190,-5.557,0.000,-1.428,-0.683
Age,-0.0167,0.001,-17.688,0.000,-0.019,-0.015
Sleep_Hours,0.1640,0.012,13.409,0.000,0.140,0.188
Exercise_Hours,-0.0228,0.016,-1.404,0.160,-0.055,0.009
Water_Intake,-0.0790,0.015,-5.277,0.000,-0.108,-0.050
Stress_Level,0.2012,0.015,13.683,0.000,0.172,0.230
Smoking,-0.0211,0.026,-0.810,0.418,-0.072,0.030
Alcohol,0.0081,0.026,0.312,0.755,-0.043,0.059
Diet,0.0139,0.026,0.532,0.595,-0.037,0.065


We can see from this model that we recieved a Pseudo R-squ score of 0.05064. To improve this score, we will drop features that have p-value>0.05 as those features will be less significant in our model building process and then we will see the performance of the model once again. 

In [15]:
X.shape, y.shape

((9549, 18), (9549,))

In [16]:
X.drop('const', axis = 1, inplace = True)

Let's proceed to eliminate columns that exhibit statistical insignificance by examining their respective p-values.

# Checking the P value

Steps for the solution:
1. Identifies features with p-values exceeding 0.05 as statistically insignificant.
2. Removes insignificant features from dataset X.

In [17]:
model.pvalues

const                   2.745110e-08
Age                     5.183459e-70
Sleep_Hours             5.365149e-41
Exercise_Hours          1.603146e-01
Water_Intake            1.312907e-07
Stress_Level            1.275362e-42
Smoking                 4.179934e-01
Alcohol                 7.552523e-01
Diet                    5.947830e-01
MentalHealth            1.739667e-01
PhysicalActivity        7.296350e-01
MedicalHistory          9.010360e-01
Allergies               6.241488e-01
Diet_Type_Vegan         2.856272e-02
Diet_Type_Vegetarian    6.020074e-01
Blood_Group_AB          6.469643e-01
Blood_Group_B           7.686861e-01
Blood_Group_O           6.075067e-01
dtype: float64

In [18]:
# Identify features with p-values greater than 0.05
# your code here
features_to_remove = model.pvalues[model.pvalues > 0.05].index
#If constant has a p-value > 0.05, we are not removing it from our X features: {Reason}
if 'const' in features_to_remove:
    features_to_remove = features_to_remove.drop('const')
print(features_to_remove)

Index(['Exercise_Hours', 'Smoking', 'Alcohol', 'Diet', 'MentalHealth',
       'PhysicalActivity', 'MedicalHistory', 'Allergies',
       'Diet_Type_Vegetarian', 'Blood_Group_AB', 'Blood_Group_B',
       'Blood_Group_O'],
      dtype='object')


In [19]:
# Drop those features
# your code here
X = X.drop(columns=features_to_remove)
X.head()

Unnamed: 0,Age,Sleep_Hours,Water_Intake,Stress_Level,Diet_Type_Vegan
0,2.0,4.0,5.0,5.0,0
1,8.0,2.0,2.0,9.0,0
2,81.0,10.0,5.0,1.0,1
3,25.0,7.0,3.0,3.0,1
4,24.0,10.0,5.0,1.0,0


In [20]:
assert X.shape == (9549,5), 'Please make sure to remove all the features with p-values greater than 0.05'

In [21]:
#Building model using the scaled features
#Adding a constant
X = sm.add_constant(X)

#fitting the model
model2 = sm.Logit(y, X).fit()

#Let's take a look at the summary of the model
model2.summary()

Optimization terminated successfully.
         Current function value: 0.657476
         Iterations 5


0,1,2,3
Dep. Variable:,Target,No. Observations:,9549.0
Model:,Logit,Df Residuals:,9543.0
Method:,MLE,Df Model:,5.0
Date:,"Sat, 13 Jan 2024",Pseudo R-squ.:,0.0502
Time:,11:57:21,Log-Likelihood:,-6278.2
converged:,True,LL-Null:,-6610.1
Covariance Type:,nonrobust,LLR p-value:,3.434e-141

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.1606,0.159,-7.311,0.000,-1.472,-0.849
Age,-0.0165,0.001,-17.822,0.000,-0.018,-0.015
Sleep_Hours,0.1639,0.012,13.416,0.000,0.140,0.188
Water_Intake,-0.0737,0.014,-5.096,0.000,-0.102,-0.045
Stress_Level,0.2046,0.014,14.158,0.000,0.176,0.233
Diet_Type_Vegan,0.1021,0.045,2.259,0.024,0.014,0.191


In [22]:
X.drop(columns = ['const'], inplace = True)

# Logistic - Model Building

## Dividing the data

Let's perform the train test split. We do a 70:30 split, have random state as 42 and stratify = y.

In [24]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

Let us proceed to scaling the data before creating the logistic regression model.

## Standard Scaling

In [25]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

Now, let's scale the data as usual.

In [26]:
# Scale both the train and the test data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Let's convert them into a dataframe.

In [27]:
# Converting train and test scaled data into a dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X.columns)

# Model Building

Now let us build the logistic regression model.

In [28]:
# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

In [29]:
# Fitting the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)

In [30]:
#Making predictions on training and test data
y_train_pred = logreg.predict(X_train_scaled)
y_test_pred = logreg.predict(X_test_scaled)

# Model Evaluation

Now let us evaluate the model using the f1-score. Store the f1-scores in <b>score_train</b> and <b>score_test</b> variables.

In [31]:
# Import f1-score
from sklearn.metrics import f1_score

In [32]:
# Generate and print the f1-score for train and test
# your code here
score_train = f1_score(y_train, y_train_pred)
score_test = f1_score(y_test, y_test_pred)


print('Train score: ',score_train)
print('Test score: ', score_test)

Train score:  0.6454826982630941
Test score:  0.6376448481052301


In [33]:
assert score_train > 0.64, 'Check if the model has been made after following all the instructions.'
assert score_test > 0.63, 'Check if the model has been made after following the instructions.'

Anova is getting an underfitting model. To tackle this they also decide to implement decision trees for this supervised problem.