This project performs Exploratory Data Analysis (EDA) and builds a Linear Regression model to predict medical insurance charges based on personal and lifestyle factors such as age, BMI, smoking status, and region.
The aim of this project is to understand what factors influence medical insurance costs and to create a predictive model that estimates charges for new individuals.
The analysis includes:
- Data exploration and visualization
- Feature encoding and preparation
- Model training and evaluation
- Insight generation and visualization
Dataset: Insurance.csv
Features:
age: Age of the policyholdersex: Gender (male/female)bmi: Body Mass Indexchildren: Number of dependentssmoker: Smoking status (yes/no)region: Residential region (northeast, northwest, southeast, southwest)charges: Medical insurance cost (target variable)
Key visualizations used to explore data:
- Histogram of Charges – Distribution of insurance costs
- Boxplot of Charges by Smoker – Effect of smoking on cost
- Scatter Plot of BMI vs Charges – Relationship between BMI and charges
- Correlation Heatmap – Strength of relationships between numeric variables
- Pair Plot – Relationships among age, BMI, children, and charges
Key Insights:
- Smokers have significantly higher charges than non-smokers.
- Age and BMI moderately increase medical costs.
- Children and region have minimal effect.
- Charges distribution is right-skewed (few high-cost outliers).
Algorithm Used: Linear Regression
Steps:
- Encoded categorical features using one-hot encoding.
- Split dataset into training (80%) and testing (20%).
- Trained a linear regression model.
- Evaluated performance using standard metrics.
Evaluation Metrics:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R² Score
- The model shows that smoking status, age, and BMI are the strongest predictors of insurance charges.
- Smokers and individuals with high BMI or older age have much higher predicted costs.
- The model performs well overall but struggles with extreme high-charge cases (outliers).
| Visualization | Purpose |
|---|---|
| Histogram of Charges | Understand data distribution |
| Boxplot by Smoker | Compare charges between smokers and non-smokers |
| Scatter Plot (BMI vs Charges) | Show relationship between health and cost |
| Correlation Heatmap | Identify strong predictive features |
| Actual vs Predicted Plot | Evaluate model fit |
| Residuals Plot | Check for random distribution of errors |
- Upload the dataset
Insurance.csvto your Google Colab or local environment. - Install required libraries:
pip install pandas numpy seaborn matplotlib scikit-learn
📜 Author atchi.venkataramana94@gmail.com github.com/VenkataRamanaA