Skip to content

VenkataRamanaA/Python-for-Data-Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

🧾 Insurance Charges Prediction

This project performs Exploratory Data Analysis (EDA) and builds a Linear Regression model to predict medical insurance charges based on personal and lifestyle factors such as age, BMI, smoking status, and region.


📘 Project Overview

The aim of this project is to understand what factors influence medical insurance costs and to create a predictive model that estimates charges for new individuals.
The analysis includes:

  • Data exploration and visualization
  • Feature encoding and preparation
  • Model training and evaluation
  • Insight generation and visualization

📂 Dataset Information

Dataset: Insurance.csv
Features:

  • age: Age of the policyholder
  • sex: Gender (male/female)
  • bmi: Body Mass Index
  • children: Number of dependents
  • smoker: Smoking status (yes/no)
  • region: Residential region (northeast, northwest, southeast, southwest)
  • charges: Medical insurance cost (target variable)

📊 Exploratory Data Analysis (EDA)

Key visualizations used to explore data:

  • Histogram of Charges – Distribution of insurance costs
  • Boxplot of Charges by Smoker – Effect of smoking on cost
  • Scatter Plot of BMI vs Charges – Relationship between BMI and charges
  • Correlation Heatmap – Strength of relationships between numeric variables
  • Pair Plot – Relationships among age, BMI, children, and charges

Key Insights:

  • Smokers have significantly higher charges than non-smokers.
  • Age and BMI moderately increase medical costs.
  • Children and region have minimal effect.
  • Charges distribution is right-skewed (few high-cost outliers).

⚙️ Model Building

Algorithm Used: Linear Regression

Steps:

  1. Encoded categorical features using one-hot encoding.
  2. Split dataset into training (80%) and testing (20%).
  3. Trained a linear regression model.
  4. Evaluated performance using standard metrics.

Evaluation Metrics:

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • R² Score

📈 Model Results

  • The model shows that smoking status, age, and BMI are the strongest predictors of insurance charges.
  • Smokers and individuals with high BMI or older age have much higher predicted costs.
  • The model performs well overall but struggles with extreme high-charge cases (outliers).

📊 Visualizations Supporting Insights

Visualization Purpose
Histogram of Charges Understand data distribution
Boxplot by Smoker Compare charges between smokers and non-smokers
Scatter Plot (BMI vs Charges) Show relationship between health and cost
Correlation Heatmap Identify strong predictive features
Actual vs Predicted Plot Evaluate model fit
Residuals Plot Check for random distribution of errors

🚀 How to Run This Project

  1. Upload the dataset Insurance.csv to your Google Colab or local environment.
  2. Install required libraries:
    pip install pandas numpy seaborn matplotlib scikit-learn
    
    

📜 Author atchi.venkataramana94@gmail.com github.com/VenkataRamanaA

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published