# PATIENT INFORMATION & HEALTH INSURANCE COST
## **DATA TRANSFORMATION (3)**  
**Transformation non-numeric columns to numeric columns.**

In [1]:
import numpy as np
import pandas as pd

In [2]:
cleaned_data = pd.read_csv("cleaned_data.csv")
data = cleaned_data.copy()

data.head()

Unnamed: 0.1,Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0,19,female,27.9,0,yes,southwest,16884.924
1,1,18,male,33.77,1,no,southeast,1725.5523
2,2,28,male,33.0,3,no,southeast,4449.462
3,3,33,male,22.705,0,no,northwest,21984.47061
4,4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
data.drop(["Unnamed: 0"], axis = 1, inplace = True)

In [4]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
print("Total non-numeric columns:", len(data.select_dtypes("object").columns), "\n")

for column in data.select_dtypes("object").columns:
    print(column, ":", list(data[column].unique()),)

Total non-numeric columns: 3 

sex : ['female', 'male']
smoker : ['yes', 'no']
region : ['southwest', 'southeast', 'northwest', 'northeast']


## **Label Encoding**  
**Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.**

In [6]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()

data["sex"] = LE.fit_transform(data["sex"])
data["smoker"] = LE.fit_transform(data["smoker"])

|**sex**|**smoker**|  
|--------|----------|
| **0 : female**|**0 : no**|
| **1 : male**  | **1 : yes**|

In [7]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


## **One-Hot Encoding**  
**One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.**

In [8]:
data = pd.get_dummies(data, columns = ["region"], prefix = ["region"], drop_first = True)

In [9]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,0,0,1
1,18,1,33.77,1,0,1725.5523,0,1,0
2,28,1,33.0,3,0,4449.462,0,1,0
3,33,1,22.705,0,0,21984.47061,1,0,0
4,32,1,28.88,0,0,3866.8552,1,0,0


### **Correlations Between Features**

In [10]:
corr = data.corr()
corr.style.background_gradient()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest
age,1.0,-0.021504,0.115607,0.04302,-0.026247,0.309046,-0.000689,-0.011873,0.009739
sex,-0.021504,1.0,0.040212,0.018849,0.077806,0.060057,-0.009145,0.009508,-0.003858
bmi,0.115607,0.040212,1.0,0.017578,-0.009135,0.188131,-0.130035,0.258627,7e-06
children,0.04302,0.018849,0.017578,1.0,0.009633,0.075426,0.022586,-0.017936,0.021294
smoker,-0.026247,0.077806,-0.009135,0.009633,1.0,0.784121,-0.038238,0.071084,-0.041798
charges,0.309046,0.060057,0.188131,0.075426,0.784121,1.0,-0.042784,0.077021,-0.049583
region_northwest,-0.000689,-0.009145,-0.130035,0.022586,-0.038238,-0.042784,1.0,-0.344836,-0.322338
region_southeast,-0.011873,0.009508,0.258627,-0.017936,0.071084,0.077021,-0.344836,1.0,-0.34413
region_southwest,0.009739,-0.003858,7e-06,0.021294,-0.041798,-0.049583,-0.322338,-0.34413,1.0


In [11]:
data.to_csv("final_data.csv")