In this notebook I will clean up the data downloaded from:
https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv

Description:
Simuated data for hypothetical medical expenses for patients in the United States from the Packt Publishing group

The goal is to attempt to use features to predict the insurance charges.

In [1]:
import pandas as pd

In [2]:
# Read in the data to a dataframe:
ins_data = pd.read_csv("insurance.csv")

In [3]:
# Check the shape of the dataframe
ins_data.shape

(1338, 7)

In [4]:
# Check the first few rows of the data to get an idea for how best to prepare the data
ins_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
# See if there are any null values that need to be dealt with:
ins_data.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Charges will obviously be the target variable.

BMI and Age can stay as they are, but sex, smoker, and region will need to be changed into dummy variables.

It seems unlikely to me that the number of children will be a useful feature, but it might be interesting to check it anyway. We will keep it for now.

In [6]:
# Convert the categorial variables into dummy ones
ins_data = pd.get_dummies(ins_data)
ins_data.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


In [7]:
# Drop the extra dummy variables. Leaving all of the regions for now just for simplicity's sake, even though we could drop one
ins_data.drop(['sex_male','smoker_no'],axis=1,inplace=True)

In [8]:
ins_data.head()

Unnamed: 0,age,bmi,children,charges,sex_female,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,1,0,0,0,1
1,18,33.77,1,1725.5523,0,0,0,0,1,0
2,28,33.0,3,4449.462,0,0,0,0,1,0
3,33,22.705,0,21984.47061,0,0,0,1,0,0
4,32,28.88,0,3866.8552,0,0,0,1,0,0


In [9]:
# Let's round the charges and bmi variable up to 2 decimals for simplicity. No useful data hsould be lost:
ins_data = ins_data.round(decimals=2)

In [10]:
ins_data

Unnamed: 0,age,bmi,children,charges,sex_female,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.90,0,16884.92,1,1,0,0,0,1
1,18,33.77,1,1725.55,0,0,0,0,1,0
2,28,33.00,3,4449.46,0,0,0,0,1,0
3,33,22.70,0,21984.47,0,0,0,1,0,0
4,32,28.88,0,3866.86,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...
1333,50,30.97,3,10600.55,0,0,0,1,0,0
1334,18,31.92,0,2205.98,1,0,1,0,0,0
1335,18,36.85,0,1629.83,1,0,0,0,1,0
1336,21,25.80,0,2007.94,1,0,0,0,0,1


I think this data is now ready to be used for some EDA!

In [13]:
# Let's write the clean dataframe to the local directory:
ins_data.to_csv("ins_data_clean.csv")