<h2><strong>Predictive Modeling of Health Insurance Charges: Understanding the factors influencing costs</strong></h2>

<img style="width: 1000px; height: 500px; border: 1px solid #ccc;" src="images/IMG_0597.JPG"></image>

Lets start by load this data into a Pandas DataFrame. This will enable us to explore the data through filtering, selecting new colums and creating new columns.


First, let's take a look at the variables:
   - `age` Age of primary beneficiary.
   - `sex` Insurance contractor gender, female / male.
   - `bmi` Body mass index, providing an understanding of body of principal beneficiary.
   - `childeren` Number of children covered by health insurance / Number of dependents
   - `smoker` smoker / non - smoker.
   - `region` stores the beneficiary's residential area in the US, northeast, southeast, southwest, northwest..
   - `charges` stores the estimated medical insurance costs for the individuals.
   

In [27]:
#Importing....
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


ModuleNotFoundError: No module named 'matplotlib'

In [None]:
df = pd.read_csv('insurance[1].csv')

In [6]:
#Glimpse of the data
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [9]:
#Last five rows
df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [7]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [10]:
#Check for unique values
df.nunique()

age           47
sex            2
bmi          548
children       6
smoker         2
region         4
charges     1337
dtype: int64

In [11]:
# Checking for duplicate values
df.duplicated().value_counts()

False    1337
True        1
Name: count, dtype: int64

In [12]:
# Droping duplicate values
df.drop_duplicates(inplace=True)

In [13]:
df.shape

(1337, 7)

In [8]:
#Checking data types and null values 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [15]:
#row*cols
df.size

9359

Let us now create a new column to categorize the beneficiaries. We would filter out smokers from non smokers and categorize the weight by their respective BMI. 

In [16]:
# Selecting specific columns
selected_data = df[['age', 'sex', 'charges']]

# Filtering df based on certain conditions
smokers = df[df['smoker'] == 'yes']

# Creating a new column for BMI category
def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 25:
        return 'Normal Weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['bmi_category'] = df['bmi'].apply(bmi_category)


In [18]:
#Average charges for smokers and non smokers 
avg_charges_smokers = df[df['smoker'] == 'yes']['charges']
avg_charges_non_smoker = df[df['smoker'] == 'no']['charges'] 
