# U.S. Medical Insurance Costs

In this project, I will be analyzing the 'insurance.csv' file provided by Codeacademy. This file contains insurance information about individuals. Lets take a look by reading the csv using pandas.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Acquire

In [2]:
raw_csv_data = pd.read_csv('insurance.csv')

In [3]:
raw_csv_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


From intial inspection, there do not appear to be any null values, though we will need to confirm that all entries are meaningful. Lets take a look at the type of values that are in each of these columns.

In [4]:
raw_csv_data.sample(1)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
240,23,female,36.67,2,yes,northeast,38511.6283


In [5]:
raw_csv_data['sex'].unique()

array(['female', 'male'], dtype=object)

In [6]:
raw_csv_data['smoker'].unique()

array(['yes', 'no'], dtype=object)

In [7]:
raw_csv_data['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

Each record (row) is information about a single patient. There are 7 columns, each representing a different characteristic:
   - **Age**: Age of the individual
        - Presumably measured in years (rounded down)
        - Presumably measured according to western standards (age at birth = 0)
   - **Sex**: Biological sex of the individual
        - Binary (male/female)
   - **BMI**: Body mass index
        - Broadly used to categorize an individuals weight
        - Formula: body mass divided by the square of the body height in units of (kg/m$^2$)
        - Continuous decimal values are commonly categorized as:
            - Underweight (<18.5 kg/m$^2$)
            - Normal Weight (>= 18.5 & < 25)
            - Overweight (>= 25 & < 30)
            - Obese (>= 30)
   - **Children**: Number of children
       - Presumably biological
   - **Smoker**: Whether or not an individual is a 'smoker' or not
       - Binary, but not additional information on what constitutes a smoker for this data set
   - **Region**: Geographical location
       - Categorized into four regions (Southwest, Southeast, Northwest, Northeast)
   - **Charges**: A float value in some type of currency
       - The type of currency is unclear, and the number of decimal places represented is unlike most common currency conventions
       - This may represent the premiums that the individual paid during some common time window, or may represent a yearly premium, or may represent the charges that the insurance company had to pay on behalf of the patient for medical services and goods. 

# Goal
Given the nature of this data, there are a few things that we could pursue.

1) Predicting charges based on the other factors
- Depending on what charges might be, we may be able to develop a model that can predict charges for future customers based on their other characteristics.
- An issue with this is not knowing what charges precisely represent. If the value for charges is developed from the application of a simple, but hidden, formula, we may be able to identify that hidden formula, but this may provide little help to an insurance company seeking to better understand their customers (potential or actual).
- From prior Codeacademy tutorials using similar data, charges was determined using a simple formula like:

    $charges$ = ($Coef_{age}$ * $age$) + ($Coef_{sex}$ * $sex$) + ($Coef_{bmi}$ * $bmi$) + ($Coef_{children}$ * $children$) + ($Coef_{smoker}$ * $smoker$) + ($Coef_{region}$ * $region$) + $C$

2) Predicting one of the variables (age, sex, BMI, children, smoker, or region) based on any combination of the other variables. 
- Of the variables represented, arguably the one that may have both the biggest impact on the health costs of an individual and the one that consumers have the highest incentive to withhold (and the easiest success in hiding) is the smoking characteristic. 
- If we assume that the charges are built off of a formula, we will not want to use this as a feature in our model. 
        
For this reason our goal for this exploration will be as follows:
1) Split the dataset into train, validate, and test data sets  
2) Explore the train dataset and identify possible characteristics to utilize in a model  
3) Develop several classification models utilizing different algorithms and select the best performing model to evaluate on the training data set.

# Prepare
Steps:
1) Remove the `charges` column  
2) Add a categorical description of bmi based on binned values  
3) Use one-hot encoding to convert categorical variables to numeric   
4) Split the data into train, validate, and test sets  
5) Seperate the target variable  
6) Create a scaled version of the dataframes for clustering algorithms  

#### Remove the `charges` column

In [8]:
raw_csv_data.drop(columns=['charges'], inplace=True)

#### Add Categorical Variable for BMI

In [9]:
def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'underweight'
    elif bmi < 25:
        return 'normal_weight'
    elif bmi < 30:
        return 'overweight'
    else:
        return 'obese'

In [10]:
raw_csv_data['bmi_category'] = raw_csv_data.bmi.apply(categorize_bmi)

In [11]:
raw_csv_data.bmi_category.unique()

array(['overweight', 'obese', 'normal_weight', 'underweight'],
      dtype=object)

#### One-hot Encoding for Categorical Variables

In [12]:
def one_hot_encoding(df, features):
    '''
    Takes in a dataframe (df) and a list of categorical (object type) features (features) to encode as numeric dummy variables, then drops the
    original listed feature columns from the dataframe.
    
    Returns the dataframe
    '''
    for feature in features:
        df[feature] = df[feature].astype(object)
    obj_df = df[features]
    dummy_df = pd.get_dummies(obj_df, dummy_na=False, drop_first=True)
    df = pd.concat([df, dummy_df], axis=1)
    df.drop(columns=features, inplace=True)
    return df

In [13]:
raw_csv_data = one_hot_encoding(raw_csv_data, ['bmi_category', 'sex', 'smoker', 'region'])

In [14]:
raw_csv_data

Unnamed: 0,age,bmi,children,bmi_category_obese,bmi_category_overweight,bmi_category_underweight,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.900,0,0,1,0,0,1,0,0,1
1,18,33.770,1,1,0,0,1,0,0,1,0
2,28,33.000,3,1,0,0,1,0,0,1,0
3,33,22.705,0,0,0,0,1,0,1,0,0
4,32,28.880,0,0,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,1,0,0,1,0,1,0,0
1334,18,31.920,0,1,0,0,0,0,0,0,0
1335,18,36.850,0,1,0,0,0,0,0,1,0
1336,21,25.800,0,0,1,0,0,0,0,0,1


#### Split into train, validate, and test sets

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
train_validate, test = train_test_split(raw_csv_data, test_size=.10, random_state = 123)
train, validate = train_test_split(train_validate, test_size=.20, random_state = 123)

In [17]:
raw_csv_data.shape, train.shape, validate.shape, test.shape

((1338, 11), (963, 11), (241, 11), (134, 11))

In [18]:
raw_csv_data.shape[0] == train.shape[0] + validate.shape[0] + test.shape[0]

True

#### Separate the target variable

In [20]:
y_train = train['smoker_yes']
x_train = train.drop(columns=['smoker_yes'])

In [21]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 963 entries, 1145 to 140
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       963 non-null    int64  
 1   bmi                       963 non-null    float64
 2   children                  963 non-null    int64  
 3   bmi_category_obese        963 non-null    uint8  
 4   bmi_category_overweight   963 non-null    uint8  
 5   bmi_category_underweight  963 non-null    uint8  
 6   sex_male                  963 non-null    uint8  
 7   region_northwest          963 non-null    uint8  
 8   region_southeast          963 non-null    uint8  
 9   region_southwest          963 non-null    uint8  
dtypes: float64(1), int64(2), uint8(7)
memory usage: 36.7 KB


In [22]:
y_validate = validate['smoker_yes']
x_validate = validate.drop(columns=['smoker_yes'])

In [23]:
x_validate.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 389 to 283
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       241 non-null    int64  
 1   bmi                       241 non-null    float64
 2   children                  241 non-null    int64  
 3   bmi_category_obese        241 non-null    uint8  
 4   bmi_category_overweight   241 non-null    uint8  
 5   bmi_category_underweight  241 non-null    uint8  
 6   sex_male                  241 non-null    uint8  
 7   region_northwest          241 non-null    uint8  
 8   region_southeast          241 non-null    uint8  
 9   region_southwest          241 non-null    uint8  
dtypes: float64(1), int64(2), uint8(7)
memory usage: 9.2 KB


In [24]:
y_test = validate['smoker_yes']
x_test = validate.drop(columns=['smoker_yes'])

In [25]:
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 389 to 283
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       241 non-null    int64  
 1   bmi                       241 non-null    float64
 2   children                  241 non-null    int64  
 3   bmi_category_obese        241 non-null    uint8  
 4   bmi_category_overweight   241 non-null    uint8  
 5   bmi_category_underweight  241 non-null    uint8  
 6   sex_male                  241 non-null    uint8  
 7   region_northwest          241 non-null    uint8  
 8   region_southeast          241 non-null    uint8  
 9   region_southwest          241 non-null    uint8  
dtypes: float64(1), int64(2), uint8(7)
memory usage: 9.2 KB


#### Create scaled version of the `x_train`, `x_validate`, and `x_test` sets

In [26]:
from sklearn.preprocessing import MinMaxScaler

In [27]:
def min_max_scaler(train, validate, test):
    '''
    Accepts three dataframes and applies a linear transformer to convert values in each dataframe
    to a value from 0 to 1 while mantaining relative distance between values. 
    Columns containing object data types are dropped, as strings cannot be directly scaled.

    Parameters (train, validate, test) = three dataframes being scaled
    
    Returns (scaler, train_scaled, validate_scaled, test_scaled)
    '''
    train = train.select_dtypes(exclude=['object'])
    validate = validate.select_dtypes(exclude=['object'])
    test = test.select_dtypes(exclude=['object'])    
    scaler = MinMaxScaler(copy=True, feature_range=(0,1)).fit(train)
    train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values])
    validate_scaled = pd.DataFrame(scaler.transform(validate), columns=validate.columns.values).set_index([validate.index.values])
    test_scaled = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, train_scaled, validate_scaled, test_scaled 

In [28]:
scaler, x_train_scaled, x_validate_scaled, x_test_scaled = min_max_scaler(x_train, x_validate, x_test)

In [29]:
x_train_scaled.head()

Unnamed: 0,age,bmi,children,bmi_category_obese,bmi_category_overweight,bmi_category_underweight,sex_male,region_northwest,region_southeast,region_southwest
1145,0.73913,0.438793,0.6,1.0,0.0,0.0,1.0,1.0,0.0,0.0
619,0.804348,0.561349,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
776,0.478261,0.425333,0.4,1.0,0.0,0.0,1.0,1.0,0.0,0.0
176,0.434783,0.29881,0.4,0.0,1.0,0.0,1.0,1.0,0.0,0.0
805,0.586957,0.524936,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


# Explore
Let's start with summary statistics

In [30]:
x_train.describe()

Unnamed: 0,age,bmi,children,bmi_category_obese,bmi_category_overweight,bmi_category_underweight,sex_male,region_northwest,region_southeast,region_southwest
count,963.0,963.0,963.0,963.0,963.0,963.0,963.0,963.0,963.0,963.0
mean,39.364486,30.653178,1.094496,0.530633,0.292835,0.011423,0.499481,0.25649,0.271028,0.244029
std,14.112595,5.852029,1.210832,0.49932,0.4553,0.10632,0.50026,0.436922,0.444722,0.429733
min,18.0,17.29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.5,26.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,40.0,30.495,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,51.5,34.59,2.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
max,64.0,52.58,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
