![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

# Regression Model: Insurance Charges Prediction

Develop a regression model using the `insurance.csv` dataset to predict **charges**. Evaluate the model's accuracy using the **R-Squared Score**. Then, apply the model to estimate `predicted_charges` for unseen data in `validation_dataset.csv`.

---

### Instructions

1. **Build a Regression Model**  
   - Use the `insurance.csv` dataset to predict **charges**.

2. **Evaluate Model Accuracy**  
   - Calculate the **R-Squared Score** of your trained model.  
   - Save the score as a variable named `r2_score`.  
   - The R-Squared Score must **exceed a threshold of 0.65**.

3. **Predict on Validation Data**  
   - Use the trained model to predict charges for entries in `validation_dataset.csv`.
   - Store the predictions in a new column named `predicted_charges` within the validation dataset.
   - Save the updated dataset as a pandas DataFrame called `validation_data`.
   - **Ensure a minimum basic charge of 1000.**

> **⚠️ Note:**  
> If you encounter errors during model training, make sure the `insurance` DataFrame is properly cleaned and ready for modeling.

---

In [1]:
# Re-run this cell
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Loading the insurance dataset
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [3]:
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)
validation_data.head()

Unnamed: 0,index,age,sex,bmi,children,smoker,region
0,0,18,female,24.09,1,no,southeast
1,1,39,male,26.41,0,yes,northeast
2,2,27,male,29.15,0,yes,southeast
3,3,71,male,65.502135,13,yes,southeast
4,4,28,male,38.06,0,no,southeast


## Initial Hypotheses

Here are some initial hypotheses about factors that might influence insurance charges:

*   **Hypothesis 1 (Smoker Status):** Smokers will have significantly higher insurance charges compared to non-smokers.
*   **Hypothesis 2 (Age):** Insurance charges will increase with age.
*   **Hypothesis 3 (BMI):** Individuals with a higher Body Mass Index (BMI) will have higher insurance charges.
*   **Hypothesis 4 (Children):** The number of children an individual has will have a positive correlation with their insurance charges (perhaps due to family coverage or related factors).
*   **Hypothesis 5 (Region):** Insurance charges will vary significantly depending on the geographical region.
*   **Hypothesis 6 (Sex):** There might be a difference in insurance charges between males and females.

## Final Take on Initial Hypotheses (Based on Modeling Results)

Based on the linear regression model I trained and the analysis of its coefficients, here's how my initial hypotheses held up:

*   **Hypothesis 1 (Smoker Status): Strong Support.** The model's coefficient for `smoker_yes` was the largest and positive, clearly indicating that being a smoker is a major predictor of higher insurance charges.
*   **Hypothesis 2 (Age): Supported.** The positive and significant coefficient for `age` confirms that insurance charges are predicted to increase with age.
*   **Hypothesis 3 (BMI): Supported.** The positive coefficient for `bmi` shows that higher Body Mass Index is associated with higher predicted insurance costs.
*   **Hypothesis 4 (Children): Supported.** The positive coefficient for `children` suggests a positive correlation between the number of children and predicted charges, although its impact is less significant than the top three factors.
*   **Hypothesis 5 (Region): Partially Supported.** While the model shows that region does have some influence (indicated by the region coefficients), its impact is relatively smaller compared to smoking, age, and BMI.
*   **Hypothesis 6 (Sex): Weakly Supported.** The coefficient for `sex_male` was very small and negative, suggesting a minimal difference in predicted charges based on sex in this model.

In conclusion, the modeling results strongly validate the hypotheses that smoking, age, and BMI are the primary drivers of insurance charges, with other factors like the number of children and region having less impact, and sex having the least impact in this linear model.