# Insurance Cost Analysis

- Load the data as a pandas dataframe
- Clean the data, taking care of the blank entries
- Run exploratory data analysis (EDA) and identify the attributes that most affect the charges
- Develop single variable and multi variable Linear Regression models for predicting the charges
- Use Ridge regression to refine the performance of Linear regression models.

For better analysis of the data, numbers were assigned instead of values:

Gender	Assigned Value
Female	1
Male	2


In [4]:
import pandas as pd
import numpy as np

In [5]:
file_name = "insurance.csv"

In [6]:
# Set path
df = pd.read_csv("insurance.csv", header=None)

In [7]:
# Print df
print(df.head(10))

    0  1       2  3  4  5            6
0  19  1  27.900  0  1  3  16884.92400
1  18  2  33.770  1  0  4   1725.55230
2  28  2  33.000  3  0  4   4449.46200
3  33  2  22.705  0  0  1  21984.47061
4  32  2  28.880  0  0  1   3866.85520
5  31  1  25.740  0  ?  4   3756.62160
6  46  1  33.440  1  0  4   8240.58960
7  37  1  27.740  3  0  1   7281.50560
8  37  2  29.830  2  0  2   6406.41070
9  60  1  25.840  0  0  1  28923.13692


Data set has no header, so we will add it

In [8]:
# Add headers
headers = ["age", "gender", "bmi", "no_of_children", "smoker", "region", "charges"]
df.columns = headers

In [9]:
# Check df
df

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charges
0,19,1,27.900,0,1,3,16884.92400
1,18,2,33.770,1,0,4,1725.55230
2,28,2,33.000,3,0,4,4449.46200
3,33,2,22.705,0,0,1,21984.47061
4,32,2,28.880,0,0,1,3866.85520
...,...,...,...,...,...,...,...
2767,47,1,45.320,1,0,4,8569.86180
2768,21,1,34.600,0,0,3,2020.17700
2769,19,2,26.030,1,1,1,16450.89470
2770,23,2,18.715,0,0,1,21595.38229


In [10]:
# Check column info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2772 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             2772 non-null   object 
 1   gender          2772 non-null   int64  
 2   bmi             2772 non-null   float64
 3   no_of_children  2772 non-null   int64  
 4   smoker          2772 non-null   object 
 5   region          2772 non-null   int64  
 6   charges         2772 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 151.7+ KB


As we can see, age and smoker are considered objects we will have to change that data type to int

In [11]:
# Update data types
df["smoker"] = df["smoker"].astype("int")

ValueError: invalid literal for int() with base 10: '?'

In [None]:
# Update data types
df[["age","smoker"]] = df[["age","smoker"]].astype("int")

As you can see above when you first try to convert the column values you get an error:

<b>ValueError:<b/> invalid literal for int() with base 10: '?'

so we have to find the column that has missing data, and either remove the rows or add a mean(avg) value to those rows where the data is missing.

In [None]:
# checking for missing values:

null_count = df.isna().sum()
print(null_count)

In [None]:
missing_data = df.isnull()
missing_data.head(5)

In [None]:
# checking for missing values:

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

In [None]:
missing_data = df.isnull()
missing_data.head(5)

In [None]:
# checking for missing values:

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")  

In [None]:
# Update data types
df["smoker"] = df["smoker"].astype("int")

In [None]:
#Not ideal, but printed all the rows in the df to see where the error was, row 234 in the age column has a null value (?).
#To fix it, we will add the mean to both columns so we can fill up those null values

In [None]:
# Replace '?' with NaN
df["age"] = df["age"].replace('?', np.nan)

In [None]:
# Convert the age column to float for mean calculation
df["age"] = df["age"].astype('float')

In [None]:
# Calculate the mean of the age column, excluding NaN values
mean_age = df["age"].mean()

In [None]:
# Replace NaN values in the age column with the mean age
df["age"].fillna(mean_age, inplace=True)

In [None]:
# Replace NaN values in the smoker column with the most frequent value
is_smoker = df['smoker'].value_counts().idxmax()
df["smoker"].replace(np.nan, is_smoker, inplace=True)

In [None]:
# Convert data types
df[["age", "smoker"]] = df[["age", "smoker"]].astype("int")

In [None]:
# Output to check if the operation was successful
df.dtypes

In [None]:
# calling df
df

<p>As you can see the bmi and the charges roles have multiple numbers after the column.</p>

<p>We want to creat a pattern and leave only two digts after the decimal.</p>

In [12]:
# Rounding up values
df[["charges"]] = np.round(df[["charges"]],2)
df[["bmi"]] = np.round(df[["bmi"]],2)
print(df.head())

  age  gender    bmi  no_of_children smoker  region   charges
0  19       1  27.90               0      1       3  16884.92
1  18       2  33.77               1      0       4   1725.55
2  28       2  33.00               3      0       4   4449.46
3  33       2  22.70               0      0       1  21984.47
4  32       2  28.88               0      0       1   3866.86


In [19]:
df.tail(8)

Unnamed: 0,age,gender,bmi,no_of_children,smoker,region,charges
2764,22,1,31.02,3,1,4,35595.59
2765,47,2,36.08,1,1,4,42211.14
2766,18,2,23.32,1,0,4,1711.03
2767,47,1,45.32,1,0,4,8569.86
2768,21,1,34.6,0,0,3,2020.18
2769,19,2,26.03,1,1,1,16450.89
2770,23,2,18.72,0,0,1,21595.38
2771,54,2,31.6,0,0,3,9850.43


In [21]:
# Check the statistical summary of each column
df.describe()

Unnamed: 0,gender,bmi,no_of_children,region,charges
count,2772.0,2772.0,2772.0,2772.0,2772.0
mean,1.507215,30.701522,1.101732,2.559885,13261.369957
std,0.500038,6.129228,1.214806,1.130761,12151.76897
min,1.0,15.96,0.0,1.0,1121.87
25%,1.0,26.22,0.0,2.0,4687.8
50%,2.0,30.45,1.0,3.0,9333.015
75%,2.0,34.77,2.0,4.0,16577.78
max,2.0,53.13,5.0,4.0,63770.43
