Importing necessary libraries for data manipulation and normalization. Importing Pandas to load and work with the data, and MinMaxScaler to scale numeric features later on. This step sets up what I need to clean, transform, and prepare the insurance data for modeling.

In [4]:
#importing required libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

Loading the insurance dataset from a CSV file and showing the first few rows to make sure the data loaded correctly and to see what columns and values I’m working with. This helps me confirm that there are no obvious issues before I start transforming the data.

In [7]:
#loading the insurance dataset
df = pd.read_csv('Downloads/insurance.csv')

#displaying the first few rows
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Transforming the continuous age and bmi variables into meaningful categorical bins. This helps the model find patterns that relate to age ranges and BMI levels instead of treating these numbers as raw values. Using bins makes the data more meaningful and easier to interpret.

In [10]:
#creating age bins
df['age_group'] = pd.cut(df['age'],
                         bins=[17, 25, 35, 50, 65],
                         labels=['young_adult', 'adult', 'middle_aged', 'senior'])

#creating bmi bins based on cdc categories
df['bmi_category'] = pd.cut(df['bmi'],
                            bins=[0, 18.5, 24.9, 29.9, 100],
                            labels=['underweight', 'normal', 'overweight', 'obese'])

#displaying age and bmi with new categories
df[['age', 'age_group', 'bmi', 'bmi_category']].head()

Unnamed: 0,age,age_group,bmi,bmi_category
0,19,young_adult,27.9,overweight
1,18,young_adult,33.77,obese
2,28,adult,33.0,obese
3,33,adult,22.705,normal
4,32,adult,28.88,overweight


Converting all the text-based columns, like sex, smoker, region, age group, and BMI category, into numeric columns using one-hot encoding. This step makes sure the model can read these categories as numbers. This also prevents any category from being seen as greater or less than another.

In [13]:
#one-hot encoding for categorical columns
df_encoded = pd.get_dummies(df,
                            columns=['sex', 'smoker', 'region', 'age_group', 'bmi_category'],
                            drop_first=True)

#displaying the new shape and column names
print("new dataframe shape:", df_encoded.shape)
print("encoded feature columns:")
print(df_encoded.columns.tolist())

new dataframe shape: (1338, 15)
encoded feature columns:
['age', 'bmi', 'children', 'charges', 'sex_male', 'smoker_yes', 'region_northwest', 'region_southeast', 'region_southwest', 'age_group_adult', 'age_group_middle_aged', 'age_group_senior', 'bmi_category_normal', 'bmi_category_overweight', 'bmi_category_obese']


Applying min-max scaling to age, bmi, and charges to bring their values into a 0–1 range for algorithms that are sensitive to feature scales. Normalizing these columns keeps them on the same scale so no single number can unfairly influence the model. This is important for algorithms that are sensitive to feature size.

In [16]:
#initializing min-max scaler
scaler = MinMaxScaler()

#normalizing selected numeric features
df_encoded[['age', 'bmi', 'charges']] = scaler.fit_transform(df_encoded[['age', 'bmi', 'charges']])

#displaying normalized values
df_encoded[['age', 'bmi', 'charges']].head()

Unnamed: 0,age,bmi,charges
0,0.021739,0.321227,0.251611
1,0.0,0.47915,0.009636
2,0.217391,0.458434,0.053115
3,0.326087,0.181464,0.33301
4,0.304348,0.347592,0.043816


Checking if any new missing values were introduced during feature engineering to make sure I did not accidentally create gaps in the data before using it for modeling. Making sure there are no missing values helps avoid errors later.

In [19]:
#checking for missing values in the final dataset
print("missing values after feature engineering:")
print(df_encoded.isnull().sum())

missing values after feature engineering:
age                        0
bmi                        0
children                   0
charges                    0
sex_male                   0
smoker_yes                 0
region_northwest           0
region_southeast           0
region_southwest           0
age_group_adult            0
age_group_middle_aged      0
age_group_senior           0
bmi_category_normal        0
bmi_category_overweight    0
bmi_category_obese         0
dtype: int64


Saving the cleaned dataset.

In [22]:
#saving the cleaned and encoded dataframe to your Downloads folder
df_encoded.to_csv('Downloads/insurance_cleaned.csv', index=False)
print("Cleaned dataset saved to Downloads as 'insurance_cleaned.csv'")

Cleaned dataset saved to Downloads as 'insurance_cleaned.csv'


Brief Summary:

In the first steps, I load my insurance dataset, look at the first few rows, and create new categories to make the data more useful. I group ages into ranges like young adult, adult, middle aged, and senior, and split BMI into groups like underweight, normal, overweight, and obese. Then I turn all the text columns, including sex, smoker, region, and the new age and BMI groups, into dummy variables so the model can read them. I use MinMaxScaler to scale age, BMI, and charges to keep them on the same range. After this, I check for missing values and see there are none. These steps give me a clean dataset that is ready for modeling.