![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [165]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

In [166]:
# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [167]:
#dropping the missing values
insurance.dropna(inplace=True)

In [168]:
#exploring the dataframe
insurance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1208 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1208 non-null   float64
 1   sex       1208 non-null   object 
 2   bmi       1208 non-null   float64
 3   children  1208 non-null   float64
 4   smoker    1208 non-null   object 
 5   region    1208 non-null   object 
 6   charges   1208 non-null   object 
dtypes: float64(3), object(4)
memory usage: 75.5+ KB


In [169]:
#standardazing the sex column
insurance['sex'] = insurance['sex'].replace({'woman':'female','F':'female','man':'male','M':'male'})   
insurance['sex'].unique()

array(['female', 'male'], dtype=object)

In [170]:
#changing all the age values to absolute to deal with negatives
insurance['age'] = np.absolute(insurance['age'])
insurance['age'].describe()

count    1208.000000
mean       39.223510
std        14.071944
min        18.000000
25%        26.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [171]:
#transform all the negative values in children to zero
insurance['children'] = np.where(insurance['children']<0,0,insurance['children'])

In [172]:
#cleaning the charges column
insurance['charges'] = insurance['charges'].str.replace('$','')


In [173]:
#transform region to all lower case
insurance['region'] = insurance['region'].str.lower()

In [174]:
#converting the charges column to float
insurance['charges'] = insurance['charges'].astype(float)

In [175]:
insurance['charges'].isnull().sum()

1

In [176]:
insurance.dropna(inplace=True)

In [177]:
#separating target variable from features
target = insurance.pop('charges')

In [178]:
target.isnull().sum()

0

In [179]:
#transforming the categorical features
cat_features = ['sex','smoker','region']
features = pd.get_dummies(data = insurance, columns = cat_features)

In [180]:
#initialize scaler
scaler = StandardScaler()

In [181]:
#Build a Linear Regression model
#initialize the model
lin_reg = LinearRegression()
#buil a pipeline with the scaler to preprocess the data and build the model
steps = [('scaler',scaler),('lin_reg',lin_reg)]
insurance_model_pipe = Pipeline(steps)

#fitting the model
insurance_model_pipe.fit(features,target)

In [182]:
#analyzing the model
r2_scores = cross_val_score(insurance_model_pipe,features,target,cv=5,scoring = 'r2')

In [183]:
r2_score = np.mean(r2_scores)
r2_score

0.7438652844825541

In [184]:
#testing the model
#load validation data
validation_data = pd.read_csv('validation_dataset.csv')

In [186]:
#preprocessing the test data
features_test = pd.get_dummies(validation_data, columns = cat_features)

In [187]:
#predicting the charges for the validation data
predicted_charges = insurance_model_pipe.predict(features_test)

In [188]:
#add the predicted charges to the validation dataset and set the minimum charge value at 1000
validation_data['predicted_charges'] = predicted_charges

validation_data.loc[validation_data['predicted_charges']<1000,'predicted_charges'] = 1000

In [189]:
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,1000.0
1,39.0,male,26.41,0.0,yes,northeast,23401.612182
2,27.0,male,29.15,0.0,yes,southeast,20830.965512
3,71.0,male,65.502135,13.0,yes,southeast,34099.161895
4,28.0,male,38.06,0.0,no,southeast,1856.184228
