# Medical insurance dataset analysis

The course I am doing has a mini project involving this dataset, to keep my learning notes clean I have put my work on the project in a separate notebook.


In [3]:
# Import required libraries
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read in the insurance dataset
insurance = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv')
insurance.head()

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

Looking at the top 5 rows of the dataset we can see columns of the following types; integers, floats, binary categories and multiple categories. The categorical data will need to be sorted and the floats and integers may also need some level of transformation.

Before these are transformed empty values should be checked and dealt with.

In [4]:
insurance.info()

NameError: name 'insurance' is not defined

No empty values, perfect.

Work can begin on exploring these variables

In [None]:
# Summary statistics of numerical vars
print(insurance.describe())

# Value counts of categorical
cols_to_count = ['sex', 'smoker', 'region']
for col in cols_to_count:
    print(f'---- {col} ----')
    print(insurance[col].value_counts())

Initial exploration shows that the numerical variables may have a normal distribution, except the charges variable. Since the mean and medians are similar. The slight differences may also show the direction of skew in each of these variables.

The categorical data also shows some signs of cleanliness as the only variable with radically different value counts for its classes is the smoker category with a difference of 790 between them. This will have to be carefully monitored as leaving too few in the training set may cause in accuracies when predicting smokers charges.

In [None]:
sns.histplot(data=insurance, x='age', kde=True)

In [None]:
sns.histplot(data=insurance, x='bmi', kde=True)

In [None]:
sns.histplot(data=insurance, x='children', kde=True)

In [None]:
sns.histplot(data=insurance, x='charges', kde=True)

My initial thoughts about the distributions have proved to be false, for all but bmi. This is because I did not assume the distibutions would be so far from normal.

These variables that are not normally distributed will have to be normalised through some method.

## Exploring relationships between the dependent and independent
### Categorical variables

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(
    data=insurance,
    y='sex', x='charges'
)

Shows a difference in the distribution of charges, but the medians and lower quantile remain similar. The biggest difference in the distributions between the sexes is the inter-quantile range and the upper limit. Men appear to have a larger than of most probable charges than women, and a higher upper limit.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(
    data=insurance,
    y='smoker', x='charges'
)

This difference is huge. This will be a hugely influential variable for predicting charges. The biggest aspect that draws the eye is that 75% of the non-smoker data doesn't even touch the smoker distribution, this means naively that if we just used this variable we have a good probability of a correct prediction of correct band of charges they will be in.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(
    data=insurance,
    y='region', x='charges'
)

Differences in the regions are limited to the distribution of the charges with some being larger than others. The southeast region has by far the largest distribution.

### Numerical variables

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(
    data=insurance,
    x='age', y='charges'
)

The relationship between charges and age is clear, as age increases so does that of the charges. This relationship also looks fairly linear.

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(
    data=insurance,
    x='bmi', y='charges'
)

Another positive relationship, as bmi increase so does charges. However, this time the relationship is also less consistent other the whole set.

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(
    data=insurance,
    x='children', y='charges'
)

This is our first negatively correlated relationship, as the number of children goes up the charges go down.

## Transforming our data for modelling

In [None]:
# Import required libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
# Split insurance into a test and train set
train, test = train_test_split(
    insurance,
    train_size=0.8,
    random_state=42
)

In [None]:
# Transform the binary category variables into 1 and 0 values
train.replace(['yes', 'male'], 1, inplace=True)  # 1 replacer
train.replace(['no', 'female'], 0, inplace=True)  # 0 replacer

test.replace(['yes', 'male'], 1, inplace=True)  # 1 replacer
test.replace(['no', 'female'], 0, inplace=True)  # 0 replacer

In [None]:
# Transform the multiple category variables into one hot encoded format
train = pd.get_dummies(
    data=train,
    prefix='',
    prefix_sep=''
)

test = pd.get_dummies(
    data=test,
    prefix='',
    prefix_sep=''
)

The distributions of the variables vary and thus require different transformation techniques to use them for neural networks. The variable bmi requires standardization as it is already normally distributed, but all others require normalisation.

Since neural networks are affected by varying scales I will normalise my data between -1 and +1 to keep with the same scale as the other variable.

In [None]:
# Transform variables into better input and outputs for a neural network to learn from

## Init transformers
stdScaler = StandardScaler()
mmScaler = MinMaxScaler(feature_range=(-1, 1))
chargesScaler = MinMaxScaler(feature_range=(-1, 1))

## Fit transform values in training set
train['bmi'] = stdScaler.fit_transform(
    train['bmi'].to_numpy().reshape(-1, 1)
)
train[['age', 'children']] = mmScaler.fit_transform(
    train[['age', 'children']]
)
train['charges'] = chargesScaler.fit_transform(
    train['charges'].to_numpy().reshape(-1, 1)
)


## Transform values in test set
test['bmi'] = stdScaler.transform(
    test['bmi'].to_numpy().reshape(-1, 1)
)
test[['age', 'children', 'charges']] = mmScaler.transform(
    test[['age', 'children', 'charges']]
)
test['charges'] = chargesScaler.transform(
    test['charges'].to_numpy().reshape(-1, 1)
)


## Modelling our neural network

First model will have an input layer of equivalent size to the dataset, and only one hidden layer with the same number of units as the input. The Output layer will obviously contain only 1 unit. The model will train for 100 epochs.

In [None]:
# Import required libraries
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [None]:
# Create model
model_1 = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(9, input_shape=[9], name='Input_Layer'),
        tf.keras.layers.Dense(100, name='Hidden_Layer_1'),
        tf.keras.layers.Dense(1,name='Output_Layer')
    ]
)

# Compile the model
model_1.compile(
    loss=tf.keras.losses.mae,
    optimizer=tf.keras.optimizers.SGD(),
    metrics=['mae']
)

# Fit the model
model_1.fit(train.drop(columns=['charges']), train.charges, epochs=100)

In [None]:
# Evaluate model
y_pred = model_1.predict(test.drop(columns=['charges']))

print('----- Normalised output values -----')
print(mean_squared_error(test.charges, y_pred))
print(mean_absolute_error(test.charges, y_pred))

print('----- Not normal output values -----')
print(
    mean_squared_error(
        chargesScaler.inverse_transform(test['charges'].to_numpy().reshape(-1, 1)),
        chargesScaler.inverse_transform(y_pred)
    )
)
print(
    mean_absolute_error(
        chargesScaler.inverse_transform(test['charges'].to_numpy().reshape(-1, 1)),
        chargesScaler.inverse_transform(y_pred)
    )
)



# Plot ground truth against predictions
plt.figure(figsize=(10,7))
plt.scatter(test.charges, y_pred)
plt.plot([-1, 1], [-1, 1], '--', c='red')