<a href="https://colab.research.google.com/github/chernandezrojas1991/medical-expenses/blob/main/medical_expenses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [46]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

In [63]:
filename = "/content/drive/MyDrive/Coding Dojo/Machine Learning/Semana 1/insurance.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Define characteristics (X) and target(y)

In [64]:
y = df['charges']
X = df.drop(columns = 'charges')

Make train_test_split() to prepare data to Machine Learning

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Identify each characteristic as numeric, ordinal or nominal.

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


**Numeric columns:** age, bmi, children, charges.
**Ordinal columns:** sex, smoker, region.

Codify ordinal way any ordinal characteristic

In [67]:
df['sex'].value_counts()

sex
male      676
female    662
Name: count, dtype: int64

In [68]:
df['sex'] = df['sex'].replace({'male': 1, 'female': 2})
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,2,27.9,0,yes,southwest,16884.924
1,18,1,33.77,1,no,southeast,1725.5523
2,28,1,33.0,3,no,southeast,4449.462
3,33,1,22.705,0,no,northwest,21984.47061
4,32,1,28.88,0,no,northwest,3866.8552


Make One-Hot codification to any nominal characteristic

In [76]:
cat_selector = make_column_selector(dtype_include = 'object')
cat_selector(X_train)

['sex', 'smoker', 'region']

In [77]:
train_cat_df = X_train[cat_selector(X_train)]
test_cat_df = X_test[cat_selector(X_test)]
train_cat_df

Unnamed: 0,sex,smoker,region
693,male,no,northwest
1297,female,no,southeast
634,male,no,southwest
1022,male,yes,southeast
178,female,no,southwest
...,...,...,...
1095,female,no,northeast
1130,female,no,southeast
1294,male,no,northeast
860,female,yes,southwest


In [78]:
one_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
one_encoder.fit(train_cat_df)
train_one = one_encoder.transform(train_cat_df)
test_one = one_encoder.transform(test_cat_df)
train_one



array([[0., 1., 1., ..., 1., 0., 0.],
       [1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 0., 1.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 0., 1.]])

In [79]:
one_column_names = one_encoder.get_feature_names_out(train_cat_df.columns)
train_one = pd.DataFrame(train_one, columns=one_column_names)
test_one = pd.DataFrame(test_one, columns=one_column_names)
train_one

Unnamed: 0,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
998,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
999,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1000,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
1001,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


Make scale any numeric characteristic

Concatenate all characteristics to the Dataframe

In [84]:
# create a numeric selector
num_selector = make_column_selector(dtype_include='number')

train_nums = X_train[num_selector(X_train)].reset_index(drop=True)
test_nums = X_test[num_selector(X_test)].reset_index(drop=True)

X_train_processed = pd.concat([train_nums, train_one], axis=1)
X_test_processed = pd.concat([test_nums, test_one], axis=1)
X_train_processed

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,24,23.655,0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,28,26.510,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,51,39.700,1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,47,36.080,1,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,46,28.900,2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
998,18,31.350,4,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
999,39,23.870,5,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1000,58,25.175,0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
1001,37,47.600,2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
