![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [69]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [70]:
# Exploring our dataset
insurance.describe()
insurance.dtypes
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1272 non-null   float64
 1   sex       1272 non-null   object 
 2   bmi       1272 non-null   float64
 3   children  1272 non-null   float64
 4   smoker    1272 non-null   object 
 5   region    1272 non-null   object 
 6   charges   1284 non-null   object 
dtypes: float64(3), object(4)
memory usage: 73.3+ KB


In [71]:
#Cleaning dataset, drop duplicates and missing values
insurance.drop_duplicates()
insurance = insurance.dropna()
insurance.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [72]:
#Cleaning age and children columns, as it has negative values 
insurance['age'] = insurance['age'].apply(abs)
insurance['children'] = insurance['children'].apply(abs)

#Cleaning other columns 
rep = {'man':'male',
      'M':'male',
      'woman':'female',
      'F':'female'}
insurance['sex'] = insurance['sex'].replace(rep)

insurance['region']=insurance['region'].str.lower()

insurance['charges'] = insurance['charges'].str.replace('$', '')

In [73]:
# Process cleaning and changing data types
df = insurance[insurance['charges'] != 'nan']
df['charges'] = pd.to_numeric(df['charges'])

df['sex'] = df['sex'].astype("category")

df['smoker'] = df['smoker'].astype("category")

df['region'] = df['region'].astype("category")

#Hot - encoding categorical variables 
df2 = pd.get_dummies(df, drop_first='True')

df2['charges'] = df2['charges'].round(0)

In [74]:
# Splitting dataset to X and y variables
y = df2['charges']
X = df2.drop('charges', axis = 1)
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1207 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1207 non-null   float64
 1   bmi               1207 non-null   float64
 2   children          1207 non-null   float64
 3   sex_male          1207 non-null   uint8  
 4   smoker_yes        1207 non-null   uint8  
 5   region_northwest  1207 non-null   uint8  
 6   region_southeast  1207 non-null   uint8  
 7   region_southwest  1207 non-null   uint8  
dtypes: float64(3), uint8(5)
memory usage: 43.6 KB


In [75]:
#Importing necessary libraries for model building 
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Split our data to train and test 
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2)

#Perform and evaluate our model
lin_reg = LinearRegression()
model = lin_reg.fit(X_train, y_train)
model.intercept_
model.coef_
model.score(X_test, y_test)

y_pred = model.predict(X_test)
r2_score = r2_score(y_test, y_pred)
r2_score

0.6870806119226351

In [77]:
# Loading the validation dataset
validation_data_path = 'validation_dataset.csv'
validation = pd.read_csv(validation_data_path)
validation.dtypes

# Making the same changes as we did to insurance data set, as we want to apply model.
validation['sex'] = validation['sex'].astype("category")

validation['smoker'] = validation['smoker'].astype("category")

validation['region'] = validation['region'].astype("category")

validation1 = pd.get_dummies(validation, drop_first='True')
validation1

#Make prediction 
predicted_charges = model.predict(validation1)
predicted_charges= pd.DataFrame(predicted_charges)
validation_data = pd.concat([validation, predicted_charges], axis = 1)
validation_data = validation_data.rename(columns = {0 :'predicted_charges'})