<a href="https://colab.research.google.com/github/andrybrew/sma-health/blob/master/02_structured_data_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Clustering Medical Cost**

Clustering is the task of dividing the data points into a number of groups. Data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups.

1. age: age of primary beneficiary
2. sex: insurance contractor gender, female, male
3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
4. children: Number of children covered by health insurance / Number of dependents
5. smoker: Smoking
6. region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
7. charges: Individual medical costs billed by health insurance

Source: https://www.kaggle.com/mirichoi0218/insurance


### **Install and Import Libraries**

***Install Library***

In [None]:
# Install Category Encoders
! pip install category_encoders

***Import Libraries***

In [None]:
# Import Library for Data Manipulation
import pandas as pd
import numpy as np

# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

### **Import Data**

***Insurance Data***

In [None]:
# Import Dataset
df_insurance = pd.read_csv('https://raw.githubusercontent.com/andrybrew/sma-health/master/data/insurance.csv', sep=',')
df_insurance

In [None]:
# Prints the Dataset Information
df_insurance.info()

In [None]:
# Prints Descriptive Statistics
df_insurance.describe().transpose()

### **Explore the Dataset**

***Visualize Data using Pairplot***

In [None]:
# Set Graph Size
plt.rcParams['figure.figsize'] = (15, 8)

# Visualize Pair Plot with Colors
sns.pairplot(df_insurance, hue = 'smoker')

***Visualize Data using Scatterplot***

In [None]:
# Draw Scatter Plot
sns.relplot(x='bmi', y='charges', data=df_insurance)
plt.title('Medical Cost')
plt.xlabel('Body Mass Index')
plt.ylabel('Charges')

***Visualize Correlation between Features***

In [None]:
# Draw Correlation
sns.clustermap(df_insurance.corr(), center=0, cmap='vlag', linewidths=.75)

**Data Preprocessing**

First, we standardize the data to equalize the range and/or data variability. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. 

***Handling Missing Values***

In [None]:
# Check for Missing Values
df_insurance.isnull().sum()

***Data Standardization***

In [None]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

# Select Data
df_standardized = df_insurance[['bmi', 'charges']]

# Fit Standardization
column_names = df_standardized.columns.tolist()
df_standardized[column_names] = scaler.fit_transform(df_standardized[column_names])
df_standardized.sort_index(inplace=True)
df_standardized

### **Modeling**

***Search for the Optimum Number of Clusters (k)***

In [None]:
# Transform Data Frame to Numpy Array
insurance = df_standardized.to_numpy()
insurance

In [None]:
# Elbow Method
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(insurance)
    wcss.append(kmeans.inertia_)
  
# Visualize 
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.xticks(np.arange(1,11,1))
plt.show()

The optimal K value is found to be 3.

***Modeling K-Means Clustering***

In [None]:
# Apply the K-Means Model to the Data
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
clusters = kmeans.fit_predict(insurance)

We try to visualize the clusters in 2D graph

In [None]:
# Visualising Clusters
sns.scatterplot(x='bmi', y='charges', data=df_standardized)
plt.scatter(insurance[clusters == 0, 0], insurance[clusters == 0, 1], s = 50, label = 'Cluster 1')
plt.scatter(insurance[clusters == 1, 0], insurance[clusters == 1, 1], s = 50, label = 'Cluster 2')
plt.scatter(insurance[clusters == 2, 0], insurance[clusters == 2, 1], s = 50, label = 'Cluster 3')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],s=200,marker='s', alpha=0.7, label='Centroids')
plt.title('Medical Cost Clusters')
plt.xlabel('Body Mass Index')
plt.ylabel('charges')
plt.legend()
plt.show()

In [None]:
# Add Cluster Information to the Raw Data
df_insurance['cluster'] = clusters
df_insurance

In [None]:
# Save Prediction Result
df_insurance.to_csv('customer_clusters.csv', index=False)