![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

In [18]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [20]:
# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [21]:
# Data exploration and cleaning
print(insurance.isna().sum().sort_values())
print(insurance.info())

charges     54
age         66
sex         66
bmi         66
children    66
smoker      66
region      66
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB
None


In [22]:
columns=insurance.columns
for column in columns:
    print(insurance[column].unique())

[ 19.  18.  28.  33.  32. -31.  46.  37.  60.  25.  62.  23.  56. -27.
  52. -23.  30. -34.  59.  63.  55.  31.  22.  nan  26.  35.  24.  41.
  21.  48.  36.  40.  58.  34.  43.  64.  20.  61.  27.  53.  44.  57.
 -41.  45. -35.  54.  38.  29.  49.  47.  51.  42.  50. -44. -39. -28.
 -40.  39. -25. -52. -26. -47. -45. -57. -43. -50. -58. -56. -30. -51.
 -60. -37. -55. -64. -22. -36. -21. -18. -20. -19. -33.]
['female' 'male' 'woman' 'F' 'man' nan 'M']
[27.9   33.77  33.    22.705 28.88  25.74  33.44  27.74  29.83  25.84
 26.22  26.29  34.4   39.82  42.13  24.6   30.78  23.845 40.3   35.3
 36.005 32.4   34.1      nan 28.025 27.72  23.085 32.775 17.385 36.3
 35.6   26.315 28.6   28.31  36.4   20.425 32.965 20.8   36.67  39.9
 26.6   36.63  21.78  37.3   38.665 34.77  24.53  35.625 28.    34.43
 28.69  36.955 31.825 31.68  22.88  37.335 27.36  33.66  24.7   25.935
 22.42  28.9   39.1   36.19  23.98  24.75  28.5   28.1   32.01  27.4
 34.01  35.53  39.805 26.885 38.285 37.62  41.23  34.8   

In [23]:
# Setting negative age to positive
insurance['age'] = np.abs(insurance['age'])

# Binary classification of sex to female and make
insurance['sex'] = insurance['sex'].replace({'F':'female', 'woman':'female'})
insurance['sex'] = insurance['sex'].replace({'M':'male', 'man':'male'})

# Setting negative number of childen to positive
insurance['children'] = np.abs(insurance['children'])

# Proper classification of region
insurance['region'] = insurance['region'].str.lower()

In [24]:
# Removing $ from charges and setting it to a float
warnings.filterwarnings("ignore")
insurance['charges'] = insurance['charges'].str.replace('$', '').astype(float)

In [7]:
# Check for the presence of any negative values
if (insurance['charges'] < 0).any():
    print("The 'charges' feature contains negative values.")
else:
    print("The 'charges' feature does not contain negative values.")

The 'charges' feature does not contain negative values.


In [25]:
# Drop missing values from categotical features
insurance_copy = insurance.copy()
cat_cols = ['sex', 'smoker', 'region']
num_cols = ['age', 'bmi', 'children', 'charges']
insurance = insurance.dropna(subset = cat_cols)

# Replace missing values on numerical features and target with mean, risking data leakage
for num_col in num_cols:
    insurance[num_col].fillna(insurance[num_col].mean(), inplace = True)
print(insurance.isna().sum().sort_values())

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [26]:
# Features-target separation
# X = insurance[insurance.columns[:-1]]
X_cat = insurance[cat_cols]
X_num = insurance[['age', 'bmi', 'children']]
y = insurance[insurance.columns[-1]]

# Binary classification of categorical data (One-Hot Encoding)
X_cat_encoded = pd.get_dummies(X_cat, drop_first = True)

# Scaling numerical features
scale = StandardScaler()
X_num_scaled = scale.fit_transform(X_num)

# Recombining all features
X = np.hstack((X_cat_encoded, X_num_scaled)) # returns an np.array

In [27]:
# Cross-validation and hyperparameter tuning for linear regression
kf = KFold(n_splits = 6, shuffle = True, random_state=69)
param_grid = {'fit_intercept': [True, False]}
linreg = LinearRegression()
r2_score_linreg = GridSearchCV(linreg, param_grid, cv = kf, scoring='r2')
r2_score_linreg.fit(X,y)
print(r2_score_linreg.best_params_, r2_score_linreg.best_score_)

{'fit_intercept': True} 0.7430517540590519


In [28]:
# Cross-validation and hyperparameter tuning for Lasso regression
param_grid = {'alpha': np.arange(5,12,0.1),
              'fit_intercept': [True, False]}
              #'selection': ['cyclic', 'random']}
lasreg = Lasso()
r2_score_lasreg = GridSearchCV(lasreg, param_grid, cv = kf, scoring='r2')
r2_score_lasreg.fit(X,y)
print(r2_score_lasreg.best_params_, r2_score_lasreg.best_score_)

{'alpha': 11.899999999999975, 'fit_intercept': True} 0.743189264984097


In [37]:
# Cross-validation and hyperparameter tuning for Ridge regression
warnings.filterwarnings("ignore")
param_grid = {'alpha': np.arange(0.1, 3.0, 0.1),
              'fit_intercept': [True, False],
              'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs']}
ridgereg = Ridge()
r2_score_ridgereg = GridSearchCV(ridgereg, param_grid, cv = kf, scoring='r2')
r2_score_ridgereg.fit(X,y)
print(r2_score_ridgereg.best_params_, r2_score_ridgereg.best_score_)

{'alpha': 0.5, 'fit_intercept': True, 'solver': 'saga'} 0.7431199077685534


In [36]:
# Establish the best model according to the R2 score
r2_scores_models = {'Linear Regression':r2_score_linreg.best_score_, 
                    'Lasso Regression':r2_score_lasreg.best_score_, 
                    'Ridge Regression':r2_score_ridgereg.best_score_}

best_model = max(r2_scores_models, key=r2_scores_models.get)
r2_score = r2_scores_models[best_model]

print(f'The best model {best_model}, yields an R2 score of {r2_score:}.')

The best model Lasso Regression, yields an R2 score of 0.743189264984097.


In [38]:
# Importing validation dataset
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18.0,female,24.09,1.0,no,southeast
1,39.0,male,26.41,0.0,yes,northeast
2,27.0,male,29.15,0.0,yes,southeast
3,71.0,male,65.502135,13.0,yes,southeast
4,28.0,male,38.06,0.0,no,southeast


In [39]:
# Binary classification with one-hot encoding for validation
X_test_cat = validation_data[cat_cols]
X_test_cat_encoded = pd.get_dummies(X_test_cat, drop_first = True)

# Scaling numeric features for validation
X_test_num = validation_data[['age', 'bmi', 'children']]
X_test_num_scaled = scale.transform(X_test_num)

# Recombining all features
X_test = np.hstack((X_test_cat_encoded, X_test_num_scaled)) # returns an np.array

In [40]:
# Fit the best model as found above
best_Lasso = Lasso(alpha = 11.899999999999975, fit_intercept = True)
best_Lasso.fit(X,y)
predicted_charges = best_Lasso.predict(X_test)

In [42]:
# Set all target values that are <1000 to 1000!
predicted_charges = np.where(predicted_charges<1000,1000,predicted_charges)

# Save predictions to the validation dataset
validation_data['predicted_charges'] = predicted_charges
predicted_chargesd_charges

array([ 1000.        , 30618.27143654, 27856.0455876 , 56556.010591  ,
        7222.98005917, 57986.33752872,  7042.7896758 , 13028.63655232,
       12488.60133552, 16104.1286815 ,  2664.40990183, 14163.39642756,
       11236.34774724, 11693.99079223,  2846.84045308,  4059.01479135,
       42186.3489687 , 63275.20851037, 58431.99619314, 11192.90743572,
        1000.        , 12690.46725223, 32061.73070413, 11916.06584556,
        9772.4094064 ,  5278.35077437, 58031.68117805,  3310.73307333,
       11729.6135255 , 10504.00236841,  6296.94231912, 27234.94835398,
       30505.22499649, 13002.56642456, 32012.02298987, 13916.91154024,
       58210.90846304, 14318.50879437,  1000.        , 29651.23314634,
       29946.07726723, 11863.72708779,  3862.10637292, 59615.02495455,
        5913.89386758, 39658.89841137, 67237.35874955, 30872.90489216,
       14925.78425367, 35251.80580692])