# 📘 Notebook Title: Insurance Data Analysis
---

**Note:** If you've not checked out our introductory notebooks of pandas and numpy do revise them

---

## 🧭 Goal of the Notebook
The primary goal of this notebook is to explore how insurance charges vary across different categories, with a special focus on smoker status. We aim to:

* Analyze various columns of data and how they might contribute to policy-making decisions.
* Analyze which age groups contribute to taking insurance.
* How smoking affects insurance charges.
* Examine how many smokers fall into each charge range.
* and more.
---
## 📌 Introduction
Insurance charges are determined by several factors including age, BMI, number of children, region, and whether the person is a smoker. In this notebook, we narrow our focus to study how smoker status correlates with insurance charges.

### 📂 Load and Inspect the Dataset
We begin by loading the insurance dataset and inspecting a few rows to understand the structure.<br>
Data Source: https://github.com/stedy/Machine-Learning-with-R-datasets

In [1]:
# Note: If you've not installed pandas and numpy, then uncomment the line below
# !pip install numpy pandas --quiet

# import packages
import numpy as np
import pandas as pd

In [2]:
# Loading the dataset
insurance_df = pd.read_csv("insurance.csv")

### Viewing the data in the dataframe

In [56]:
# Let's see the first 5 rows of the dataset
insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,charge_range
0,19,female,27.9,0,yes,southwest,16884.924,10k-20k
1,18,male,33.77,1,no,southeast,1725.5523,0-5k
2,28,male,33.0,3,no,southeast,4449.462,0-5k
3,33,male,22.705,0,no,northwest,21984.47061,20k-30k
4,32,male,28.88,0,no,northwest,3866.8552,0-5k


### Viewing the information of dataframe

In [4]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


There is no null values in the data columns and the dtypes of each columns can be seen

## Looking at the statistical data of each columns

In [5]:
insurance_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


**Task:** See the min and max values of each column and infer your understanding
### Checking for null values in each columns

In [57]:
# Rechecking if null values exists in our dataset
insurance_df.isnull().sum()


age             0
sex             0
bmi             0
children        0
smoker          0
region          0
charges         0
charge_range    0
dtype: int64

### Shape and Columns of the dataset

In [60]:
print("Shape of dataset (row,columns): ",insurance_df.shape)
print("Column names in dataset ",insurance_df.columns.tolist())

Shape of dataset (row,columns):  (1338, 8)
Column names in dataset  ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges', 'charge_range']


### Unique values under age column

In [61]:
insurance_df['age'].unique()

array([19, 18, 28, 33, 32, 31, 46, 37, 60, 25, 62, 23, 56, 27, 52, 30, 34,
       59, 63, 55, 22, 26, 35, 24, 41, 38, 36, 21, 48, 40, 58, 53, 43, 64,
       20, 61, 44, 57, 29, 45, 54, 49, 47, 51, 42, 50, 39], dtype=int64)

### Let's check which age group is taking the insurance the most

In [9]:
insurance_df['age'].describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

**From the above data we can check that the minimum age of taking the insurance is 18 and the the maximum age that took the insurance is 64 and that the average age group taking insurance is around 39**

### Let's check which age group is most insurance and what insights can be drawn

In [62]:
# under 20 age
age_under_20 = insurance_df[insurance_df['age'].between(1,20,inclusive='both')]

# Between 21 and 30 age
age_btw_21_30 = insurance_df[insurance_df['age'].between(21,30,inclusive='both')]

# Between 31 and 40 age
age_btw_31_40 = insurance_df[insurance_df['age'].between(31,40,inclusive='both')]

# Between 41 and 50 age
age_btw_41_50 = insurance_df[insurance_df['age'].between(41,50,inclusive='both')]

# Between 51 and 60 age
age_btw_51_60 = insurance_df[insurance_df['age'].between(51,60,inclusive='both')]

# Above 60 age
age_above_61 = insurance_df[insurance_df['age']>60]

In [63]:
print("Age under 20: ",age_under_20.shape[0])
print("Age btw 21 and 30: ",age_btw_21_30.shape[0])
print("Age btw 31 and 40: ",age_btw_31_40.shape[0])
print("Age btw 41 and 50: ",age_btw_41_50.shape[0])
print("Age btw 51 and 60: ",age_btw_51_60.shape[0])
print("Age above 61: ",age_above_61.shape[0])

Age under 20:  166
Age btw 21 and 30:  278
Age btw 31 and 40:  257
Age btw 41 and 50:  281
Age btw 51 and 60:  265
Age above 61:  91


**The information indicates that individuals aged 41 to 50 are the most likely to purchase insurance, followed by those in the 21 to 30 age group. This insight can help in developing customer-friendly policies targeted at these specific age groups.**

### Let's check what bmi can contribute to better understand planning the ploicies for the customer

In [12]:
# Minimum and Maximum value of bmi of the customer 
print(f"Minimum value of bmi {insurance_df['bmi'].min()}")
print(f"Maximum value of bmi {insurance_df['bmi'].max()}")

Minimum value of bmi 15.96
Maximum value of bmi 53.13


**Let's check which bmi is seen under which age group**

In [13]:
# Minimum and Maximum value of BMI of the customer under 20 age
print(f"Minimum value of bmi {age_under_20['bmi'].min()}")
print(f"Maximum value of bmi {age_under_20['bmi'].max()}")

Minimum value of bmi 15.96
Maximum value of bmi 53.13


In [14]:
# Minimum and Maximum value of BMI of the customer between 31 and 40 age
print(f"Minimum value of bmi {age_btw_31_40['bmi'].min()}")
print(f"Maximum value of bmi {age_btw_31_40['bmi'].max()}")

Minimum value of bmi 16.815
Maximum value of bmi 47.6


In [15]:
# Minimum and Maximum value of BMI of the customer between 41 and 50 age
print(f"Minimum value of bmi {age_btw_41_50['bmi'].min()}")
print(f"Maximum value of bmi {age_btw_41_50['bmi'].max()}")

Minimum value of bmi 19.19
Maximum value of bmi 48.07


**Do it yourself:** Find BMI for different age groups and analyze the insights comparing BMI across these age groups<br>**Insights:** The minimum BMI among age groups is increasing, which should be considered for planning purposes.

In [65]:
bmi_bins = [0, 18.5, 25, 30, 40, 100]
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obese', 'Severely Obese']

insurance_df['bmi_category'] = pd.cut(insurance_df['bmi'], bins=bmi_bins, labels=bmi_labels)

bmi_stats = insurance_df.groupby('bmi_category').agg(
    count=('charges', 'count'),
    avg_charges=('charges', 'mean')
).sort_index()

print(bmi_stats)


                count   avg_charges
bmi_category                       
Underweight        21   8657.620652
Normal            226  10435.440719
Overweight        386  10997.803881
Obese             614  15379.565215
Severely Obese     91  16784.615546


**Obese people are taking more insurance and as well as paying more for their insurance**

### Let's check for Smoker, how they are contributing to insurance policies

In [26]:
insurance_df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

It is clear from the data that the proportion of smokers is lower compared to non-smokers.
**Let's examine which age group smokes the most and is taking insurance.**


In [27]:
age_under_20['smoker'].value_counts()

no     127
yes     39
Name: smoker, dtype: int64

In [28]:
age_btw_21_30['smoker'].value_counts()

no     222
yes     56
Name: smoker, dtype: int64

In [29]:
age_btw_31_40['smoker'].value_counts()

no     203
yes     54
Name: smoker, dtype: int64

In [30]:
age_btw_41_50['smoker'].value_counts()

no     220
yes     61
Name: smoker, dtype: int64

In [31]:
age_btw_51_60['smoker'].value_counts()

no     223
yes     42
Name: smoker, dtype: int64

In [32]:
age_above_61['smoker'].value_counts()

no     69
yes    22
Name: smoker, dtype: int64

It is proven that the age group of 41-50 and 21-30 contains more smokers as compared to other age categories but the values are less as comapred to non-smokers
Let's check the proportion of male-female of taking insurance

In [39]:
insurance_df['sex'].value_counts()

male      676
female    662
Name: sex, dtype: int64

It is clear from the above data that the sex factor is not significantly contributing to insurance, as both gender groups are comparatively contributing equally. Let's check how many males and females are smokers and are purchasing insurance.

In [40]:
insurance_df[insurance_df['smoker'] == 'yes']['sex'].value_counts()

male      159
female    115
Name: sex, dtype: int64

It is clear from the above data that male are more smokers as compared to females but there's not huge difference in numbers.
Let us check how smoking factor is affecting insurance charges

**Let's calculate for whole charges column**

In [51]:
bins = [0, 5000, 10000, 20000, 30000, 40000, 50000, 70000]
labels = ['0-5k', '5k-10k', '10k-20k', '20k-30k', '30k-40k', '40k-50k', '50k+']

insurance_df['charge_range'] = pd.cut(insurance_df['charges'], bins=bins, labels=labels)
print(insurance_df['charge_range'].value_counts().sort_index())


0-5k       359
5k-10k     353
10k-20k    353
20k-30k    111
30k-40k     83
40k-50k     72
50k+         7
Name: charge_range, dtype: int64


Insurance charges ranging from 0k to 20k are substantial compared to other price ranges. This indicates that we should focus more on this charge segment for policy improvement. **Let's check the situation for smokers.**

In [54]:
# SMOKERS

# Define bins and labels
bins = [0, 5000, 10000, 20000, 30000, 40000, 50000, 70000]
labels = ['0-5k', '5k-10k', '10k-20k', '20k-30k', '30k-40k', '40k-50k', '50k+']

# Create a new column with binned charge ranges
insurance_df['charge_range'] = pd.cut(insurance_df['charges'], bins=bins, labels=labels)

# Filter for smoker == 'yes' and count value ranges
smoker_yes_ranges = insurance_df[insurance_df['smoker'] == 'yes']['charge_range'].value_counts().sort_index()

print(smoker_yes_ranges)


0-5k        0
5k-10k      0
10k-20k    62
20k-30k    60
30k-40k    73
40k-50k    72
50k+        7
Name: charge_range, dtype: int64


In [55]:
# NON-SMOKERS

# Define bins and labels
bins = [0, 5000, 10000, 20000, 30000, 40000, 50000, 70000]
labels = ['0-5k', '5k-10k', '10k-20k', '20k-30k', '30k-40k', '40k-50k', '50k+']

# Create a new column with binned charge ranges
insurance_df['charge_range'] = pd.cut(insurance_df['charges'], bins=bins, labels=labels)

# Filter for smoker == 'yes' and count value ranges
smoker_yes_ranges = insurance_df[insurance_df['smoker'] == 'no']['charge_range'].value_counts().sort_index()

print(smoker_yes_ranges)


0-5k       359
5k-10k     353
10k-20k    291
20k-30k     51
30k-40k     10
40k-50k      0
50k+         0
Name: charge_range, dtype: int64


The data indicate that smokers pay more for insurance than non-smokers, despite their smaller numbers, with many smokers paying twice as much as non-smokers.

The goal of this notebook was to give you a sense of how EDA is important and how one can extract patterns and insights from them

### Your Task  
Your task is to draw patterns for the other columns and summarize the findings into a paragraph so that the company can improve its policy-making decisions.

# ✅ Summary of Findings

* The highest number of smokers fall into the 30k–40k charge ranges.
* A significant portion of smokers are charged above 20k, indicating high insurance costs for this group.
* Very few smokers are found in the under 20k as compored to above ranges.
* Sex does not contribute much in policy buying, since both categories are not more but still equally contributing.
* Obese people are taking more insurance and as well as paying more for their insurance
* 
#### Add your insights to this notebook and perform more EDA to conclude to a better policy making decision

### Keep Learning !!