# US Medical Insurance Cost

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


pandas is imported to look through **insurance.csv**. The csv file will 

In [2]:
insurance = pd.read_csv('insurance.csv')

In [3]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
insurance.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [5]:
insurance.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


I will peform multiple linear regression to develop a model that can predict mdecial insurance charges based on various attributes of an insurnace holder 

I will start with all of the attributes in the dataset because
1. They all seems like they provide information associated with the insurance cost, our dependent variable.
2. Since there are only 6 variables, it wouldn't take too much time going through them 

I've noticed that some columns are categorical. They consist of strings. I need to convert them to numbers.<br>
There are three columns to work on: sex,smoker,region

1. sex: I will assign 0 to male and 1 to female by creating a new column named 'sex_n'. I created a new column instead of modifying the original column to verify my modification.

In [6]:
insurance['sex_n'] = insurance['sex'].map({'male':0,'female':1})

Checking if the conversion was performed properly

In [7]:
insurance['sex'].value_counts()

male      676
female    662
Name: sex, dtype: int64

In [8]:
insurance['sex_n'].value_counts()

0    676
1    662
Name: sex_n, dtype: int64

In [9]:
insurance[['sex','sex_n']]

Unnamed: 0,sex,sex_n
0,female,1
1,male,0
2,male,0
3,male,0
4,male,0
...,...,...
1333,male,0
1334,female,1
1335,female,1
1336,female,1


2. smoker - I will assign 0 to 'no' and 1 to 'yes'

In [10]:
insurance['smoker_n'] = insurance['smoker'].map({'yes':1,'no':0 })

In [11]:
insurance['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [12]:
insurance['smoker_n'].value_counts()

0    1064
1     274
Name: smoker_n, dtype: int64

In [13]:
insurance[['smoker','smoker_n']]

Unnamed: 0,smoker,smoker_n
0,yes,1
1,no,0
2,no,0
3,no,0
4,no,0
...,...,...
1333,no,0
1334,no,0
1335,no,0
1336,no,0


I've completed 2 coulumn's conversion! </br>
The last column is 'region'. </br> By looking at the column name, I can tell this will contain more than 2 unique values! I will have to check what those are!

In [14]:
insurance['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

I will assign 0-3 to the regions in the above ouput. 

In [15]:
insurance['region_n']=insurance['region'].map({'southwest':0,'southeast':1,'northwest':2,'northeast':3})

In [16]:
insurance[['region','region_n']]

Unnamed: 0,region,region_n
0,southwest,0
1,southeast,1
2,southeast,1
3,northwest,2
4,northwest,2
...,...,...
1333,northwest,2
1334,northeast,3
1335,southeast,1
1336,southwest,0


In [17]:
insurance['region'].value_counts()

southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

In [18]:
insurance['region_n'].value_counts()

1    364
2    325
0    325
3    324
Name: region_n, dtype: int64

In [19]:
x = insurance[['age','sex_n','bmi','children','smoker_n','region_n']]

In [20]:
y = insurance['charges']

Now, all the predictors are ready to be anlayzed!

Spliting the dataset into the training set and the test set (80%-20%) 

In [21]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=.8)

Creating a linear regression modela and fit it to my x_train and y_train.

In [22]:
mlr = LinearRegression()

In [23]:
mlr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's find out what the coefficient of determination $r^{2}$ for my model!

In [24]:
mlr.score(x_test,y_test)

0.741696518932536

Let's explore the coefficients of the predictors to see if there's anything we can do to improve the model's accuracy

In [25]:
mlr.coef_

array([  247.24320931,  -129.94819127,   344.00554351,   456.95362142,
       23501.31571898,   402.85397437])

In [26]:
person12=[[50,0,27,3,0,1],[50,1,27,3,0,1]]

In [27]:
mlr.predict(person12)

array([10397.3522441 , 10267.40405283])

I would like to standardize the dateset and run the model again to see if the standardization can be helpful

In [28]:
scaler = StandardScaler()

I need to remove columns with strings

In [29]:
insurance = insurance[['age', 'sex_n', 'bmi', 'children', 'smoker_n', 'region_n', 'charges']]

Returning a numpy array

In [30]:
data = insurance.values

In [31]:
scaled_insurance = scaler.fit_transform(data)

In [32]:
scaled_insurance = pd.DataFrame(scaled_insurance,columns=['age', 'sex_n', 'bmi', 'children', 'smoker_n', 'region_n', 'charges'])

In [33]:
scaled_insurance

Unnamed: 0,age,sex_n,bmi,children,smoker_n,region_n,charges
0,-1.438764,1.010519,-0.453320,-0.908614,1.970587,-1.343905,0.298584
1,-1.509965,-0.989591,0.509621,-0.078767,-0.507463,-0.438495,-0.953689
2,-0.797954,-0.989591,0.383307,1.580926,-0.507463,-0.438495,-0.728675
3,-0.441948,-0.989591,-1.305531,-0.908614,-0.507463,0.466915,0.719843
4,-0.513149,-0.989591,-0.292556,-0.908614,-0.507463,0.466915,-0.776802
...,...,...,...,...,...,...,...
1333,0.768473,-0.989591,0.050297,1.580926,-0.507463,0.466915,-0.220551
1334,-1.509965,1.010519,0.206139,-0.908614,-0.507463,1.372326,-0.914002
1335,-1.509965,1.010519,1.014878,-0.908614,-0.507463,-0.438495,-0.961596
1336,-1.296362,1.010519,-0.797813,-0.908614,-0.507463,-1.343905,-0.930362


In [34]:
x1 = scaled_insurance[['age','sex_n','bmi','children','smoker_n','region_n']]

In [35]:
y1= scaled_insurance['charges']

I'm done with the standardization on my data. <br>
Let's perform multiple regression again! <br>

I'm splitting the data into the train and test data

In [36]:
x_train,x_test,y_train,y_test = train_test_split(x1,y1,train_size=.8)

In [41]:
mlr1 = LinearRegression()

In [42]:
mlr1.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### <font color='red'>I thought that the standardized data would give me a better $r^2$. However, it's not! Why???

In [43]:
mlr1.score(x_test,y_test)

0.7491906460538403

### <font color='red'>It's hard to interpret the preictions below. How can I do that nicely? Is there a quick way I can un-standardize the result for a better interpretability?

In [44]:
mlr1.predict(person12)

array([19.27623656, 19.28234844])