In [7]:
import pandas as pd
from urllib.request import urlretrieve


In [1]:
medical_charges_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv'


## Loading the data
Load the data from Github and save as a CSV file.
Later read that CSV into a new Data Frame.

In [8]:
# urlretrieve(medical_charges_url, '../data/medical_charges.csv')
df = pd.read_csv('../data/medical_charges.csv')
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Observations
There are 1388 patient information with 7 columns.

We have information like age, sex, bmi, children, smoker, region, charges

In [10]:
df.shape


(1338, 7)

## Understanding the data better

### Checking the data type of each column

In [12]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


There are few columns which are numbers. This means they are kind of ready to process. However, there are fields like "sex", "smoker", "region" which are actually categorical data.

So, we will need to standardise the data before we can train the model.

In [13]:
df.describe()


Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [15]:
df.isnull().count()


age         1338
sex         1338
bmi         1338
children    1338
smoker      1338
region      1338
charges     1338
dtype: int64

From the above information, it is clear that the data is in a good state. There are no negative values which is good. Age, BMI, children are all proper data with no null values as well.

In [23]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [24]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#000000'


## Age

In [27]:
df.age.describe()


count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

In [29]:
fig = px.histogram(df,
    x='age',
    marginal='box',
    nbins=47,
    title='Distribution of age')
fig.update_layout(bargap=0.1)
fig.show()


## Observations from Age

Generally, the customers on all age has a very uniform distribution. 

In the range of 20 to 30 from the age of 20 to 64. However, there is almost double for 18 and 19. 

The age starts from 18 and ends at 64. There are no customers lower than 18 and beyond 64.

## BMI

In [33]:
fig = px.histogram(df,
    x='bmi',
    marginal='box',
    color_discrete_sequence=['red'],
    title='Distribution of BMI')
fig.update_layout(bargap=0.1)
fig.show()


## Charges

In [42]:
fig = px.histogram(df,
    x='charges',
    marginal='box',
    color='smoker',
    color_discrete_sequence=['green', 'grey'],
    title='Annual medical charges')
fig.update_layout(bargap=0.1)
fig.show()


In [49]:
df.smoker.value_counts()


smoker
no     1064
yes     274
Name: count, dtype: int64

In [50]:
px.histogram(df, x='smoker', color='sex', title='Smokers')


## Age and Charges

In [54]:
fig = px.scatter(df,
    x='age',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title='Age vs Charges with Smokers')
fig.update_traces(marker_size=5)
fig.show()


## BMI distribution

In [56]:
fig = px.scatter(df,
    x='bmi',
    y='charges',
    color='smoker',
    opacity=0.8,
    hover_data=['sex'],
    title='BMI vs charges')
fig.update_traces(marker_size=5)
fig.show()


## Correlation

Let's try to see how much of correlation we have between the features

In [57]:
df.charges.corr(df.age)


0.2990081933306478

In [58]:
df.charges.corr(df.bmi)


0.19834096883362887

In [61]:
smoker_values = {'no': 0, 'yes': 1}
smoker_numeric = df.smoker.map(smoker_values)
df.charges.corr(smoker_numeric)


0.7872514304984778