<a href="https://colab.research.google.com/github/bidyashreenayak0211/Labmentix-Internship/blob/main/Medibuddy_Project_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **MediBuddy Project EDA**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -** Bidyashree Nayak

# **Project Summary -**

The Medibuddy Insurance Claim Analysis project focuses on extracting valuable insights from health insurance claim data to support data-driven decision-making in the insurance sector. This analysis aims to uncover the underlying factors that impact the amount of claims submitted by customers, while also exploring patterns in customer behavior, demographic influences, and healthcare spending trends.

By leveraging Python for exploratory data analysis (EDA), the project systematically investigates the dataset to understand distributions, detect anomalies, and identify correlations among variables. The process includes feature engineering to enhance model performance and data visualization techniques to communicate findings effectively.

The ultimate goal is to highlight key drivers behind high claim amounts, detect fraudulent or outlier claims, and propose recommendations that could help insurers refine their policy structures, minimize risk exposure, and improve customer engagement. Through comprehensive analysis, this project provides strategic insights that can contribute to better resource allocation, pricing models, and customer satisfaction in the health insurance domain.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The insurance sector is increasingly challenged by the need to control escalating claim expenses, accurately evaluate risk factors, and develop sustainable pricing strategies. To remain competitive and financially viable, insurers must consider how various customer attributes—such as demographics, health habits, and regional differences—affect the cost of insurance claims.

This project is centered on conducting a detailed analysis of Medibuddy’s insurance claim dataset to uncover the primary drivers behind claim expenditures. By analyzing features such as age, body mass index (BMI), smoking habits, geographic location, and household size, the project aims to:

Pinpoint demographic segments that are associated with higher average claim costs

Identify health and lifestyle risks—such as smoking or elevated BMI—that contribute to larger claims

Explore how regional differences and the number of dependents (children) influence medical expenses

Generate actionable, data-backed recommendations to support more accurate risk modeling and premium pricing

Through these analyses, the project seeks to equip Medibuddy with meaningful insights that can guide improvements in policy structuring, help tailor premiums to individual risk profiles, and support data-informed business strategies. The ultimate objective is to enhance profitability while delivering fair and customer-centric insurance solutions.



#### **Define Your Business Objective?**

1. **Refine Risk Evaluation and Pricing Models**  
   - Identify customer segments that pose a higher risk—such as individuals with elevated BMI or smoking habits—based on historical claim patterns.  
   - Propose risk-adjusted premium strategies aimed at reducing financial exposure and boosting profitability.

2. **Enhance Customer Segmentation and Personalization**  
   - Segment the policyholder base by key demographics such as age, region, and health metrics to uncover behavioral trends in claim submissions.  
   - Support the development of tailored insurance plans and personalized pricing models to better align with customer needs and risk levels.

3. **Analyze Cost Drivers and Lifestyle Impacts**  
   - Examine the influence of personal habits and household characteristics (e.g., smoking, obesity, number of dependents) on claim expenditures.  
   - Offer data-backed guidance to revise policy coverage or implement tiered pricing based on lifestyle-related risk factors.

4. **Boost Customer Loyalty and Experience**  
   - Leverage claim data insights to craft more relevant policy recommendations, tailored to each customer's risk profile.  
   - Foster trust and satisfaction by offering competitively priced, equitable insurance products.

5. **Detect Anomalies and Uncover Fraudulent Patterns**  
   - Spot unusual or excessive claim amounts and behavioral anomalies that may suggest fraud or misuse.  
   - Strengthen Medibuddy’s fraud detection framework using analytical methods and predictive tools.

6. **Support Strategic Planning Through Analytics**  
   - Deliver actionable insights to inform key business decisions, such as adjusting coverage caps or introducing new plan tiers.  
   - Utilize predictive modeling to anticipate future claim volumes and support financial planning.

By aligning these objectives with analytical outcomes, Medibuddy can streamline its operations, reinforce financial health, and deliver customer-focused insurance solutions more effectively.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For basic plotting
import seaborn as sns  # For statistical data visualization
pd.set_option('display.max_columns', None)  # Show all columns in output
pd.set_option('display.width', 1000)        # Increase output width if needed

### Dataset Loading

In [None]:
# Load Dataset
df1 = pd.read_excel('/content/Medibuddy Insurance Data Price (1) (2).xlsx') #Load the Price Dataset
df2 = pd.read_excel('/content/Medibuddy insurance data personal details (1) (2).xlsx') #Load the Personal Details Dataset

In [None]:
# Merge the two dfs on the "Policy no. column"
merged_df = pd.merge(df1, df2, on="Policy no.")

### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows:", merged_df.shape[0])
print("Number of columns:", merged_df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Number of duplicate rows:", merged_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Number of Null/Missing Values in Dataset =",merged_df.isnull().sum().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(merged_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset comprises **1,338 rows and 8 columns**, representing insurance-related information for a group of individuals. Each row corresponds to a unique policyholder, as indicated by the non-null **Policy no.**, and there are no duplicate or missing entries, ensuring the dataset is complete and clean. The columns include demographic details such as **age**, **sex**, **number of children**, **smoking status**, and **region**. Additionally, it provides health-related data like **BMI (Body Mass Index)** and financial data like **insurance charges (in INR)**. The absence of missing values and duplicates makes this dataset reliable for analysis. The mix of categorical (e.g., sex, smoker, region) and numerical variables (e.g., age, bmi, charges) suggests it's well-suited for exploring patterns in healthcare costs, performing statistical analysis, or building predictive models related to insurance pricing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()

### Variables Description

1. **`Policy no.`**
   - **Type:** `object` (string)
   - **Description:** A unique identifier assigned to each insurance policy. It serves as a reference number and ensures that each record in the dataset is distinct.
   - **Role:** Identifier (not used in modeling or analysis directly, but useful for tracking).

2. **`age`**
   - **Type:** `int64` (integer)
   - **Description:** The age of the policyholder in years.
   - **Importance:** A critical factor in determining insurance risk and pricing, as healthcare costs often increase with age.

3. **`sex`**
   - **Type:** `object` (string)
   - **Values:** Typically `'male'` or `'female'`
   - **Description:** The gender of the policyholder.
   - **Importance:** Used to study demographic patterns and can also influence insurance premiums due to differing risk profiles.

4. **`bmi` (Body Mass Index)**
   - **Type:** `float64` (floating point number)
   - **Description:** A measure of body fat calculated from weight and height. It helps categorize individuals as underweight, normal, overweight, or obese.
   - **Importance:** Higher BMI can indicate health risks, which in turn can affect insurance premiums.

5. **`charges in INR`**
   - **Type:** `float64`
   - **Description:** The amount charged by the insurance company for providing health coverage. This is likely the target variable in pricing or prediction tasks.
   - **Importance:** Represents the cost of health insurance; central to any cost-related analysis or prediction.

6. **`children`**
   - **Type:** `int64`
   - **Description:** The number of dependent children covered under the insurance policy.
   - **Importance:** More dependents may influence policy costs and risk assessments.

7. **`smoker`**
   - **Type:** `object` (string)
   - **Values:** `'yes'` or `'no'`
   - **Description:** Indicates whether the policyholder is a smoker.
   - **Importance:** A major risk factor in health insurance, significantly affecting premium amounts due to higher associated health risks.

8. **`region`**
   - **Type:** `object` (string)
   - **Description:** The geographic region in which the policyholder resides (likely within India, though the dataset doesn't specify).
   - **Importance:** Regional differences in healthcare costs and practices may influence insurance pricing.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Number of Unique values for each column in Dataset-")
print(merged_df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd
import numpy as np

# 📄 1. Make a fresh copy of the merged DataFrame
df_cleaned = merged_df.copy()

# 🔄 2. Format and round numerical values
df_cleaned['bmi'] = df_cleaned['bmi'].round(2)
df_cleaned['charges in INR'] = df_cleaned['charges in INR'].round(2)

# ✅ 3. Create a binary flag for smoking status
df_cleaned['is_smoker'] = df_cleaned['smoker'].map({'yes': 1, 'no': 0})

# 📊 4. Group age into defined age ranges
age_bins = [18, 30, 40, 50, 60, 70]
age_labels = ['18-29', '30-39', '40-49', '50-59', '60+']
df_cleaned['Age_Bracket'] = pd.cut(df_cleaned['age'], bins=age_bins, labels=age_labels, include_lowest=True)

# 🧮 5. Define function to classify BMI values
def classify_bmi(value):
    if value < 18.5:
        return 'Underweight'
    elif 18.5 <= value < 25:
        return 'Normal'
    elif 25 <= value < 30:
        return 'Overweight'
    else:
        return 'Obese'

# 🏷️ 6. Apply BMI classification
df_cleaned['BMI_Group'] = df_cleaned['bmi'].apply(classify_bmi)

# 💸 7. Group insurance charges into INR ranges
inr_bins = [0, 10000, 20000, 30000, 40000, 50000, 60000, np.inf]
inr_labels = ['<10k', '10k-20k', '20k-30k', '30k-40k', '40k-50k', '50k-60k', '60k+']
df_cleaned['Charges_Band'] = pd.cut(df_cleaned['charges in INR'], bins=inr_bins, labels=inr_labels)

# 👶 8. Categorize number of children
def children_group(n):
    if n == 0:
        return 'No Children'
    elif n in [1, 2]:
        return '1-2 Children'
    else:
        return '3+ Children'

df_cleaned['Children_Group'] = df_cleaned['children'].apply(children_group)

# 🧾 9. Average claim amount per policy number
avg_claims_by_policy = df_cleaned.groupby('Policy no.')['charges in INR'].mean()

# 📉 10. Compute claim amount relative to BMI
df_cleaned['Claim_per_BMI'] = df_cleaned['charges in INR'] / df_cleaned['bmi']

# 💾 Save cleaned DataFrame to CSV
df_cleaned.to_csv('cleaned_insurance_data.csv', index=False)


### What all manipulations have you done and insights you found?

### 🔧 **Manipulations (Data Transformations):**

1. **🆕 Data Copy:**
   - A fresh copy of the original `merged_df` was created as `df_cleaned` to ensure the original data remains untouched during manipulation.

2. **🔄 Rounded Numerical Values:**
   - Rounded `bmi` and `charges in INR` to two decimal places for consistency and easier readability.

3. **✅ Created `is_smoker`:**
   - Converted the `smoker` column to a binary flag:
     - `yes` → 1
     - `no` → 0

4. **📊 Age Bracketing:**
   - Grouped `age` into 5 ranges:
     - `18-29`, `30-39`, `40-49`, `50-59`, `60+`

5. **🧮 BMI Classification:**
   - Used BMI value to categorize individuals into:
     - `Underweight`, `Normal`, `Overweight`, `Obese`

6. **💸 Charges Band:**
   - Grouped `charges in INR` into ranges like:
     - `<10k`, `10k–20k`, ..., `60k+`

7. **👶 Children Grouping:**
   - Grouped `children` count into:
     - `No Children`, `1–2 Children`, `3+ Children`

8. **🧾 Average Claims:**
   - Computed average insurance charges per `Policy no.` — useful for analyzing typical costs by policy.

9. **📉 Claim per BMI:**
   - Created a ratio of `charges in INR` to `bmi`:
     - Indicates how much cost is incurred per BMI unit — a potential risk indicator.

---

### 🔍 **Insights You Can Extract from This Data:**

1. **💨 Smokers vs Non-Smokers:**
   - Analyze how smoking status (`is_smoker`) correlates with charges. Smokers are likely to have significantly higher claims.

2. **📈 Age and Charges:**
   - Using `Age_Bracket` and `Charges_Band`, you can study how charges increase with age.

3. **⚖️ BMI and Risk:**
   - By looking at `BMI_Group` vs `charges`, you can assess whether obesity or underweight individuals incur more costs.

4. **👨‍👩‍👧‍👦 Children Impact:**
   - Compare average charges across `Children_Group` — possibly more dependents mean higher healthcare usage.

5. **💰 High-Risk Segments:**
   - Combining `is_smoker`, `BMI_Group`, and `Age_Bracket` may reveal the most expensive customer segments.

6. **📊 Insurance Cost Efficiency:**
   - `Claim_per_BMI` can help identify inefficiencies — high ratios might indicate individuals with higher-than-expected claims.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure the dataset is loaded
import pandas as pd
df = pd.read_csv('/content/cleaned_insurance_data.csv')

###**Question - 1. Does the gender of the person matter for the company as a constraint for extending policies?**

In [None]:
# Set figure size
plt.figure(figsize=(14, 6))

# Set custom colors: light blue for females, orange for males
gender_colors = {'female': '#5DADE2', 'male': '#F5B041'}

# Bar plot for average claim amount by gender
sns.barplot(
    x='sex',
    y='charges in INR',
    data=df,
    estimator='mean',
    ci=None,
    palette=gender_colors
)

# Labels and title
plt.title('Average Claim Amount by Gender', fontsize=18, fontweight='bold')
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Average Claim Amount (INR)', fontsize=12)

# Display values on top of bars
ax = plt.gca()
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height:.2f}',
                (p.get_x() + p.get_width() / 2., height),
                ha='center', va='bottom',
                fontsize=10, color='black')

plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


### **Answer:**

From the visualization, gender does not appear to be a major constraint when it comes to extending insurance policies. The average claim amount for males is ₹13,956.75, while for females, it is slightly lower at ₹12,569.58.

Although males tend to claim slightly more, the difference of approximately ₹1,387.17 is relatively modest in the context of healthcare costs and insurance claims. This narrow gap suggests that gender alone does not significantly influence the magnitude of claims made by policyholders.

This implies that gender may not be a critical factor in risk assessment or premium calculation, at least not in isolation. Other factors—such as smoking status, BMI, age, and pre-existing conditions—are likely to have a more substantial impact on claim behavior and overall risk.

Therefore, policy extension or premium decisions should ideally be guided by a holistic view of health and lifestyle attributes, rather than gender alone.

###**Question - 2. What is the average amount of money the company spent over each policy cover?**

In [None]:
# Grouping by policy and calculating average claim
avg_claim = df.groupby('Policy no.')['charges in INR'].mean().reset_index()

# Sorting for better color mapping
avg_claim = avg_claim.sort_values(by='charges in INR', ascending=False).reset_index(drop=True)

# Set figure size
plt.figure(figsize=(16, 5))

# Create a color palette (gradient)
colors = sns.color_palette("viridis", len(avg_claim))

# Plotting bar chart with gradient colors
bars = plt.bar(avg_claim['Policy no.'].astype(str), avg_claim['charges in INR'], color=colors)

# Adding labels and title
plt.xlabel('Policy No', fontsize=12)
plt.ylabel('Average Claim Amount (INR)', fontsize=12)
plt.title('Average Claim Amount per Policy', fontsize=18, fontweight='bold')

# Hide x-ticks to avoid clutter (optional: show selected ticks for clarity)
plt.xticks([])

# Grid lines for readability
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()



### **Answer:**

From the bar plot, it is evident that the **average claim amount varies significantly across different policy numbers**. This wide range of claim values points to a **highly irregular distribution** of healthcare expenses among policyholders.

While a few policies show **substantially higher average claim amounts**, a larger number of policies reflect **moderate to low claim values**. This disparity suggests that **certain individuals or groups may be incurring significantly higher medical costs**, possibly due to chronic illnesses, surgeries, or high-risk profiles (e.g., smokers, older age, high BMI).

The absence of a consistent trend or pattern in claim amounts indicates that there is **no uniformity in the average amount spent per policy**. This could be influenced by a variety of factors, such as:
- Differences in health conditions,
- Varying family sizes under each policy,
- Lifestyle choices,
- Frequency of medical utilization,
- Or even policy coverage limits.

Such a **non-uniform distribution** of claim amounts is critical from a risk management and pricing perspective. It emphasizes the need for:
- **Individualized policy assessments,**
- **Tiered premium structures**, and
- **Preventive healthcare strategies** to minimize high-risk claims.

In [None]:
# Set figure size
plt.figure(figsize=(14, 5))

# Plotting histogram with custom color and styling
sns.histplot(
    df['charges in INR'],
    bins=30,
    kde=True,
    color='#2E86AB',        # Elegant blue tone
    edgecolor='white',      # Makes bars visually distinct
    linewidth=1.2
)

# Adding labels and title with styling
plt.title('Distribution of Claim Amount', fontsize=18, fontweight='bold', color='black')
plt.xlabel('Claim Amount (INR)', fontsize=12, color='#34495E')
plt.ylabel('Frequency', fontsize=12, color='#34495E')

# Grid styling
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Tight layout for clean spacing
plt.tight_layout()
plt.show()


### **Answer:**

From the histogram visualizing the distribution of claim amounts, it is clear that the data is **right-skewed** (positively skewed). This means that **most policyholders submit relatively low-value claims**, while **a smaller number of policies account for high-value claims** on the far right of the distribution curve.

The bulk of the claims are concentrated **below ₹20,000**, indicating that routine medical expenses like minor treatments, consultations, or outpatient services dominate the claim profile. This forms the **peak of the distribution**, showing high frequency in this cost range.

As the claim amount increases, the frequency of such cases drops significantly. Only a small fraction of the data lies **beyond ₹40,000**, suggesting that **large medical claims are relatively rare**. These high-end claims might be associated with:
- Major surgeries,
- Chronic or critical illnesses,
- Hospitalizations, or
- Specialized treatments.

This skewed distribution implies that the insurance company’s **overall expenditure is heavily driven by many low-cost claims**, punctuated by **a few high-cost outliers**. From a financial planning perspective, this pattern is typical in healthcare insurance and reinforces the importance of:
- **Risk pooling**, where the low-cost claims help offset the burden of occasional high-cost ones,
- **Premium structuring**, to ensure that rare but expensive events are still covered sustainably,
- **Predictive modeling**, to better anticipate and plan for outlier cases.

###**Question - 3. Could you advice if the company needs to offer separate policies based upon the geographic location of the person?**

In [None]:
# Set figure size
plt.figure(figsize=(14, 5))

# Plotting bar chart with a vibrant color palette
sns.barplot(x='region', y='charges in INR', data=df, palette='coolwarm')

# Adding labels and title with refined styling
plt.title('Average Claim Amount by Region', fontsize=18, fontweight='bold', color='black')
plt.xlabel('Region', fontsize=12, color='#34495E')
plt.ylabel('Average Claim Amount (INR)', fontsize=12, color='#34495E')

# Displaying values on top of bars with spacing
for bar in plt.gca().patches:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2.,
        height + 500,
        f'{height:.2f}',
        ha='center',
        va='bottom',
        fontsize=10,
        fontweight='semibold',
        color='black'
    )

# Removed grid lines

# Clean spacing
plt.tight_layout()
plt.show()

### **Answer:**

From the bar plot, it's evident that **average claim amounts differ across regions**, indicating potential regional disparities in healthcare utilization or costs.

The **Southeast region records the highest average claim amount** at approximately **₹14,735.41**, suggesting higher medical expenses or more frequent claims in that area. In contrast, the **Southwest shows the lowest average** at around **₹12,346.94**, implying lower healthcare spending or fewer claims filed.

These differences may stem from various **regional factors**, such as:
- Cost of medical services,
- Access to healthcare facilities,
- Lifestyle and demographic differences,
- Prevalence of chronic conditions.

This variation highlights the potential value in implementing **region-specific pricing models or policy structures**, which could help balance risk, improve affordability, and ensure more accurate premium calculations.

In short, understanding these **regional patterns** can aid insurers in developing **more customized and fair policy offerings**.

In [None]:
# Set figure size
plt.figure(figsize=(10, 8))

# Counting Regions
region_counts = df['region'].value_counts()

# Plotting pie chart with a new color palette
plt.pie(
    region_counts,
    labels=region_counts.index,
    autopct='%1.1f%%',
    colors=sns.color_palette('Set2'),  # Changed palette
    startangle=140
)

# Adding title
plt.title('Policyholders by Region', fontsize=18, fontweight='bold')

# Add legend for clarity
plt.legend(title='Region', loc='upper left', bbox_to_anchor=(1, 0, 0.5, 1))

# Show Plot
plt.tight_layout()
plt.show()

### **Answer:**

From the pie chart, we observe that the **distribution of policyholders across regions is relatively balanced**, with only slight variations. The **Southeast region holds the largest share at approximately 27.2%**, while the **Southwest, Northwest, and Northeast** each account for around **24%** of the total policyholders.

This indicates that **regional population density or policy reach is not drastically uneven**, suggesting that the insurer has a **fairly consistent presence across all regions**.

Given this balanced distribution, the company should focus less on the number of policyholders per region and more on the **regional differences in claim behavior**. For instance:
- **Southeast** has both a **higher share of policyholders** *and* **higher average claims**, signaling the need for **careful risk assessment** and **possibly adjusted premiums**.
- **Regions with lower average claims** may allow for more competitive or subsidized pricing.

Thus, **policy customization based on claim patterns rather than customer volume** could help improve profitability while maintaining fairness.!

###**Question - 4. Does the no. of dependents make a difference in the amount claimed?**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Correctly load the dataset
df = pd.read_csv('/content/cleaned_insurance_data.csv')

# Optional: check the column names
# print(df.columns)

# Make sure 'Children_Category' exists; if not, create it
if 'Children_Category' not in df.columns:
    def categorize_children(n):
        if n == 0:
            return 'No Children'
        elif n in [1, 2]:
            return '1-2 Children'
        else:
            return '3+ Children'

    df['Children_Category'] = df['children'].apply(categorize_children)

# Set Figure Size
plt.figure(figsize=(16, 5))

# Create Bar Plot
sns.barplot(x='Children_Category', y='charges in INR', data=df, palette='crest')

# Add title and labels
plt.title('Average Claim Amount by Children Category', fontsize=18, fontweight='bold', color='black')
plt.xlabel('Children Category', fontsize=12, color='#34495E')
plt.ylabel('Average Claim Amount (INR)', fontsize=12, color='#34495E')

# Annotate values on bars
for bar in plt.gca().patches:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2.,
        height + 500,
        f'{height:.2f}',
        ha='center',
        va='bottom',
        fontsize=10,
        fontweight='medium',
        color='black'
    )

# Remove grid if you want a cleaner look
plt.grid(False)

# Clean layout
plt.tight_layout()
plt.show()


###**Answer –**

The analysis reveals a clear correlation between the number of children and the average claim amount. As the number of dependents increases, so does the financial burden on healthcare, leading to higher insurance claims.

- **Policyholders with no children** have the lowest average claim amount at **₹12,365.98**, likely due to fewer family-related medical expenses.
- Those with **1–2 children** show a noticeable rise in average claims, reaching **₹13,727.93**, reflecting moderate family healthcare needs.
- Policyholders with **3 or more children** claim the highest on average at **₹14,576.00**, indicating that larger families incur significantly higher medical costs.

This trend suggests that insurance providers may consider family size when designing personalized premium structures, as household size has a direct influence on healthcare expenditures and claim behavior.

###**Question - 5. Does a study of persons BMI get the company any idea for the insurance claim that it would extend?**

In [None]:
# Set aesthetic theme
sns.set(style="whitegrid")

# Set figure size
plt.figure(figsize=(14, 6))

# Scatter plot with a hue based on BMI groups (optional if 'BMI_Group' exists)
# Otherwise, using a color gradient based on 'bmi'
scatter = sns.scatterplot(
    x='bmi',
    y='charges in INR',
    data=df,
    palette='flare',   # You can try 'coolwarm', 'rocket', 'Spectral' for variation
    hue='bmi',         # Adds a color gradient based on BMI value
    size=5,
    alpha=0.7,
    edgecolor='w',
    legend=False
)

# Title and labels
plt.title('BMI vs. Claim Amount', fontsize=18, fontweight='bold', color='black')
plt.xlabel('Body Mass Index (BMI)', fontsize=12, color='#34495E')
plt.ylabel('Claim Amount (INR)', fontsize=12, color='#34495E')

# Grid and layout
plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.5)
plt.tight_layout()

# Show plot
plt.show()


###**Answer –**

The scatter plot analysis of **BMI vs. Claim Amount** reveals a noticeable trend: as BMI increases, particularly beyond **30 (classified as obese)**, the **insurance claim amounts also tend to rise**. While there's some variability, the upper range of BMI consistently aligns with **higher claim values**.

This correlation suggests that individuals with elevated BMI may be at greater risk of health issues, leading to more frequent or costlier medical interventions. For insurers, this highlights BMI as a **valuable risk indicator**. Incorporating BMI into **risk assessment models or premium calculations** could help better align policy pricing with expected healthcare costs.

### **Question - 6. Is it needed for the company to understand whether the person covered is a smoker or a non-smoker?**

In [None]:
# Set aesthetic style
sns.set(style="whitegrid")

# Set Figure Size
plt.figure(figsize=(14, 5))

# Create Bar Plot with updated palette and edge color
sns.barplot(
    x='smoker',
    y='charges in INR',
    data=df,
    palette='Set2',
    edgecolor='black'
)

# Add labels and title with enhanced styling
plt.title('Average Claim Amount by Smoker Status', fontsize=18, fontweight='bold', color='#2C3E50')
plt.xlabel('Smoker Status', fontsize=12, color='#34495E')
plt.ylabel('Average Claim Amount (INR)', fontsize=12, color='#34495E')

# Displaying values on top of bars
for bar in plt.gca().patches:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        height + 500,
        f'{height:.2f}',
        ha='center',
        va='bottom',
        fontsize=10,
        fontweight='semibold'
    )

# Remove grid lines for a cleaner look or comment this out to keep them
plt.grid(False)

# Tight layout for neatness
plt.tight_layout()

# Show plot
plt.show()


###**Answer –**

The comparison between smokers and non-smokers clearly highlights a **substantial disparity in average claim amounts**. Smokers incur **over ₹32,000** on average, whereas non-smokers average just **above ₹8,000** in claims.

This sharp contrast signals that **smoker status is a major risk factor**, likely due to increased susceptibility to chronic illnesses and lifestyle-related conditions. For the insurance company, this insight is critical—it justifies implementing **higher premiums or tailored policy terms** for smokers to balance the elevated risk and potential healthcare costs they bring. Incorporating this factor ensures **fair pricing and sustainable risk management**.

### **Question - 7. Does age have any barrier on the insurance claimed?**

In [None]:
# Set Seaborn style
sns.set(style="whitegrid")

# Set Figure Size
plt.figure(figsize=(14, 6))

# Create Scatter Plot with enhanced styling
sns.scatterplot(
    x='age',
    y='charges in INR',
    data=df,
    color='#1f77b4',  # Modern blue tone
    s=90,             # Slightly larger markers
    edgecolor='black',
    linewidth=0.5,
    alpha=0.8         # Transparency for overlapping points
)

# Add titles and axis labels
plt.title('Age vs. Claim Amount', fontsize=18, fontweight='bold', color='#2C3E50')
plt.xlabel('Age (Years)', fontsize=12, color='#34495E')
plt.ylabel('Claim Amount (INR)', fontsize=12, color='#34495E')

# Tweak tick parameters
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Optional: Remove gridlines for cleaner look
plt.grid(False)

# Adjust layout
plt.tight_layout()

# Show Plot
plt.show()


**Answer –**

Yes, **age significantly influences the insurance claim amount**.

The scatter plot reveals a **positive correlation between age and claim values**—as individuals grow older, their average claim amounts tend to increase. This trend reflects **higher medical needs and expenses associated with aging**, such as chronic illness management, hospitalization, and age-related conditions.

For the insurance company, this insight is crucial for **risk-based pricing**. Implementing **age-tiered premium structures** ensures that the coverage aligns with the projected healthcare needs, maintaining **both fairness for policyholders and financial sustainability for the insurer**.

### **Question - 8. Can the company extend certain discounts after checking the health status (BMI) in this case?**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set a style
sns.set_style("whitegrid")

# Set figure size
plt.figure(figsize=(14, 5))

# Bar plot with a vibrant color palette
custom_palette = sns.color_palette("Set2")
barplot = sns.barplot(
    x='BMI_Group',
    y='charges in INR',
    data=df,
    palette=custom_palette,
    estimator='mean'  # Ensure it plots average
)

# Title and labels with custom font styling
plt.title('Average Claim Amount by BMI Category', fontsize=18, fontweight='bold', color='#333333')
plt.xlabel('BMI Category', fontsize=13, fontweight='bold', color='#555555')
plt.ylabel('Average Claim Amount (INR)', fontsize=13, fontweight='bold', color='#555555')

# Displaying data labels with better alignment
for p in barplot.patches:
    barplot.annotate(f'{p.get_height():,.2f}',
                     (p.get_x() + p.get_width() / 2., p.get_height()),
                     ha='center', va='bottom',
                     fontsize=10, color='black', fontweight='medium')

# Gridlines only on y-axis for cleaner look
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Remove top and right spines for a cleaner look
sns.despine()

# Final layout and display
plt.tight_layout()
plt.show()


###**Answer:**

Yes, the company can strategically offer discounts based on BMI categories to encourage healthier lifestyles and manage claim-related expenses.

From the data analysis:

- **Obese individuals** have the **highest average claim amount (₹15,552.34)**. This suggests they are at a higher risk of health issues, leading to increased medical costs and insurance claims.
- **Underweight individuals** show the **lowest average claim amount (₹8,852.20)**, indicating fewer medical expenses.
- **Normal-weight individuals** also have relatively lower claim amounts compared to the obese category.

**Implications for policy design:**

- The company could offer **premium discounts** or **wellness incentives** to policyholders in the **underweight and normal BMI groups**, encouraging healthy habits and lowering risk.
- On the other hand, **obese policyholders** may be considered a higher risk group and could be subject to **higher premiums** or required to enroll in wellness programs to improve their health status.

This segmentation not only promotes **preventive healthcare** but also helps the insurer **reduce long-term claim liabilities** while rewarding customers who maintain a healthier lifestyle.

####**Correlation Heatmap**

In [None]:
# Set seaborn style
sns.set(style='white')

# Set figure size
plt.figure(figsize=((20, 10))

# Create the heatmap with a diverging color palette
heatmap = sns.heatmap(
    df.corr(numeric_only=True),
    annot=True,
    fmt='.2f',
    cmap='coolwarm',         # You can also try 'viridis', 'magma', 'Spectral'
    linewidths=0.5,
    linecolor='gray',
    cbar_kws={"shrink": 0.75, "label": "Correlation Coefficient"},
    square=True,
    annot_kws={"size": 10, "color": "black", "weight": "bold"}
)

# Title with custom styling
plt.title('Correlation Heatmap', fontsize=18, fontweight='bold', color='#333333')

# Rotate axis labels for better readability
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.yticks(rotation=0, fontsize=10)

# Clean up spines for a sleek look
sns.despine(left=True, bottom=True)

# Adjust layout
plt.tight_layout()
plt.show()


### **Insights from the Correlation Heatmap**

#### 🔹 **Strong Positive Correlations**

- **Charges in INR & Claim_per_BMI (0.96):**  
  There is a very strong positive correlation between total claim amount and claim per BMI. This indicates that as the overall charges increase, the claim relative to BMI also rises significantly, highlighting a direct dependency.

- **Smoker Flag & Charges in INR (0.79):**  
  Smokers tend to incur significantly higher medical expenses, confirming that smoking is a major contributor to increased insurance claim costs.

- **Smoker Flag & Claim_per_BMI (0.81):**  
  The correlation shows that smokers also have a higher claim per BMI, suggesting that smoking not only affects overall charges but amplifies risk relative to body composition.

#### 🔸 **Moderate Positive Correlations**

- **Age & Charges in INR (0.30):**  
  Older individuals generally have higher claim amounts, reflecting the increased likelihood of health issues with age.

- **Age & Claim_per_BMI (0.31):**  
  A similar trend is observed with claim per BMI, reinforcing the idea that age contributes to elevated claim ratios, particularly when BMI is factored in.

#### ⚪ **Weak or Negligible Correlations**

- **BMI & Charges in INR (0.20):**  
  Although BMI is often linked to health risks, the relatively low correlation suggests that BMI alone is not a strong standalone predictor of claim amounts.

- **Children & Charges in INR (0.07):**  
  The number of children has minimal impact on claim costs, indicating that dependents do not significantly influence the policyholder's medical expenses.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **Strategic Recommendations for Achieving Business Objectives**

#### 1. **Risk-Based Premium Structuring**
Implement a dynamic premium model that adjusts based on risk factors. Specifically:
- **Higher premiums** for individuals who smoke or fall into the obese BMI category, as data shows they are associated with significantly higher claim amounts.
- This pricing model can help offset higher healthcare costs while encouraging at-risk policyholders to adopt healthier lifestyles.

#### 2. **Wellness Rewards for Healthier Policyholders**
Introduce **discounts and incentives** for individuals with a normal or underweight BMI, who statistically pose lower health risks.
- Offering **lower premiums or cashback benefits** to this group not only promotes healthier living but can also reduce overall claim volume in the long run.

#### 3. **Region-Sensitive Policy Design**
Geographical analysis reveals that policyholders from the **southeast region incur the highest average claims**.
- Develop **location-specific pricing** or create **customized policy packages** that align with the healthcare cost trends of each region.
- This can enhance fairness and improve profitability across different markets.

#### 4. **Tailored Family Insurance Plans**
Data indicates that claim amounts increase with the number of dependents.
- Launch **family-centric insurance plans** that offer flexible benefits, dependent coverage options, and bundled pricing.
- These plans can attract customers seeking comprehensive family protection while allowing better management of risk exposure.

#### 5. **Age-Specific Policy Segmentation**
Since older policyholders are more likely to make higher claims, it's crucial to:
- Design **age-specific insurance products** with benefits tailored to different life stages (e.g., preventive care for younger groups, chronic disease management for seniors).
- This segmentation allows for better customer satisfaction and more accurate pricing.

#### 6. **Preventive Health Incentives**
Promote wellness and reduce future claims by offering:
- **Incentives for regular health check-ups**, gym memberships, fitness milestones, or wellness program participation.
- These initiatives can lower the risk profile of insured individuals while fostering long-term customer engagement and loyalty.

# **Conclusion**

The analysis highlights several key factors that significantly impact insurance claim amounts. Among these, smoker status and BMI category emerge as the most influential. Smokers and individuals classified as obese consistently exhibit higher claim amounts, underscoring the direct relationship between unhealthy lifestyle choices and increased healthcare costs.

Geographical location also proves to be a crucial variable, with the southeast region recording the highest average claims. This could be attributed to regional disparities in healthcare access, lifestyle patterns, or local health risks. Furthermore, age and family size (number of dependents) have a clear positive correlation with claim amounts—older policyholders and those with more dependents tend to submit higher claims, likely due to more frequent or complex medical needs.

To capitalize on these insights and enhance business performance, the company should consider adopting a risk-adjusted pricing strategy. By charging higher premiums to higher-risk groups (such as smokers and individuals with high BMI), while offering incentives or discounts to healthier individuals, the insurer can encourage positive behavior change and reduce the overall claims burden.

Additionally, implementing region-specific policies would allow the company to better align premiums and coverage options with the unique healthcare cost dynamics of different areas. Customized insurance plans based on age brackets and family composition can further increase customer satisfaction and retention by offering more relevant and affordable options.

Overall, by personalizing insurance offerings based on demographic and health profiles, the company stands to not only improve profitability but also strengthen customer engagement, manage risks more effectively, and contribute to better public health outcomes through incentivized wellness.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***