Load some data from a csv file into Python so that the data can be used to train a model.

This data contains information on customer's annual spend on health.

In [1]:
import pandas as pd

df_health = pd.read_csv('https://www.benivade.com/datasets/health_insurance_trimmed.csv', index_col=0)
df_health

Unnamed: 0,age,gender,bmi,children,smoker,region,charges
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.00,3,no,southeast,4449.4620
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335


First change the output (health charges) to groups of numbers, i.e. categories:

* less than \$3000

* \$3000-\$5000

* \$5000-\$10000

* greater than \$10000.

In [2]:
df_health['charge_category'] = 0

In [3]:
df_health.loc[ df_health['charges'] < 3000 , 'charge_category'] = 0
df_health.loc[ (df_health['charges'] >= 3000) &  (df_health['charges'] < 5000), 'charge_category'] = 1
df_health.loc[ (df_health['charges'] >= 5000) &  (df_health['charges'] < 10000), 'charge_category'] = 2
df_health.loc[ df_health['charges'] >= 10000, 'charge_category'] = 3
df_health

Unnamed: 0,age,gender,bmi,children,smoker,region,charges,charge_category
1,18,male,33.77,1,no,southeast,1725.5523,0
2,28,male,33.00,3,no,southeast,4449.4620,1
4,32,male,28.88,0,no,northwest,3866.8552,1
5,31,female,25.74,0,no,southeast,3756.6216,1
6,46,female,33.44,1,no,southeast,8240.5896,2
...,...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850,3
1333,50,male,30.97,3,no,northwest,10600.5483,3
1334,18,female,31.92,0,no,northeast,2205.9808,0
1335,18,female,36.85,0,no,southeast,1629.8335,0


Replace any words with numbers

In [4]:
df_health['gender_code'] = 0
df_health.loc[ df_health['gender'] == 'male', 'gender_code'] = 1
df_health

Unnamed: 0,age,gender,bmi,children,smoker,region,charges,charge_category,gender_code
1,18,male,33.77,1,no,southeast,1725.5523,0,1
2,28,male,33.00,3,no,southeast,4449.4620,1,1
4,32,male,28.88,0,no,northwest,3866.8552,1,1
5,31,female,25.74,0,no,southeast,3756.6216,1,0
6,46,female,33.44,1,no,southeast,8240.5896,2,0
...,...,...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850,3,0
1333,50,male,30.97,3,no,northwest,10600.5483,3,1
1334,18,female,31.92,0,no,northeast,2205.9808,0,0
1335,18,female,36.85,0,no,southeast,1629.8335,0,0


In [5]:
df_health['smoker_code'] = 0
df_health.loc[ df_health['smoker'] == 'yes', 'smoker_code'] = 1
df_health

Unnamed: 0,age,gender,bmi,children,smoker,region,charges,charge_category,gender_code,smoker_code
1,18,male,33.77,1,no,southeast,1725.5523,0,1,0
2,28,male,33.00,3,no,southeast,4449.4620,1,1,0
4,32,male,28.88,0,no,northwest,3866.8552,1,1,0
5,31,female,25.74,0,no,southeast,3756.6216,1,0,0
6,46,female,33.44,1,no,southeast,8240.5896,2,0,0
...,...,...,...,...,...,...,...,...,...,...
1332,52,female,44.70,3,no,southwest,11411.6850,3,0,0
1333,50,male,30.97,3,no,northwest,10600.5483,3,1,0
1334,18,female,31.92,0,no,northeast,2205.9808,0,0,0
1335,18,female,36.85,0,no,southeast,1629.8335,0,0,0


Separate the dataframe into:

* X: independent variable matrix  - this is the first 4 columns
* y: dependant variable vector    - this is just the 'class' column

In [6]:
X = df_health.loc[:, ['age', 'gender_code', 'bmi', 'children', 'smoker_code']]

y = df_health[ ['charge_category'] ]

display(X.head())
display(y.head())

Unnamed: 0,age,gender_code,bmi,children,smoker_code
1,18,1,33.77,1,0
2,28,1,33.0,3,0
4,32,1,28.88,0,0
5,31,0,25.74,0,0
6,46,0,33.44,1,0


Unnamed: 0,charge_category
1,0
2,1
4,1
5,1
6,2


Split our dataset into training and test sets

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Create an MLPClassifier object and train with the training data

In [8]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)
model.fit(X_train, y_train.values.ravel())

MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)

A quick check of accuracy with the test data

In [9]:
from sklearn.metrics import confusion_matrix

predictions = model.predict( X_test )

print(confusion_matrix(y_test, predictions))

[[32  3  0  0]
 [ 3 32  2  0]
 [ 0  1 73  7]
 [ 2  2  1 38]]


In [10]:
display(y_test.head(2))
X_test.head(2)

Unnamed: 0,charge_category
249,1
836,1


Unnamed: 0,age,gender_code,bmi,children,smoker_code
249,29,1,28.975,1,0
836,36,1,31.5,0,0


In [11]:
prediction = model.predict([[54, 0, 21.47, 3, 0]])

print(prediction)

[3]


We can use the predict_proba function to get the probability of being in each category for a particular prediction

In [12]:
probabilities = model.predict_proba([[54, 1, 32.77, 0, 0]])

print(probabilities)

print(max(probabilities[0]))

print(list(probabilities[0]).index(max(probabilities[0])))

[[1.25886903e-10 6.90816785e-04 3.92217719e-01 6.07091464e-01]]
0.6070914643631088
3


In [13]:
import joblib

joblib.dump(model, 'health_charges_classifier_model.joblib')

['health_charges_classifier_model.joblib']