<a href="https://colab.research.google.com/github/dcreeder89/medical-charges-machine-learning/blob/main/Reeder_Pre_processing_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-processing Exercise
- Christina Reeder
- 13 Dec 2022

Task: use the medical charges dataset. How well can the charge be predicted based on the age, sex, bmi, number of children, smoking habit and region of the patient?

In [78]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [79]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector
from sklearn.impute import SimpleImputer

In [80]:
filename='/content/drive/MyDrive/Coding Dojo/05 Week 5: Intro to ML/Practice Assignments/insurance.csv'
df=pd.read_csv(filename)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Preprocessing

### Define features (X) and target (y)

We want to use all other columns to predict charges. Therefore, the charges column will be our target(y) and all other columns will be our features(X)

In [81]:
# target we are trying to predict
y = df['charges']
# features we will use to make the prediction
X = df.drop(columns='charges')

### Train test split the data and prepare for machine learning 

In [82]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### ID each feature as numerical, ordinal, or nominal. (Provide this answer in a text cell)

**Numerical**

- Age
- BMI
- Children
- Charges

**Ordinal**

- Smoker

**Nominal**

- Sex
- Region

### Ordinal encode any ordinal features

In [83]:
# Check values in the 'smoker' column
X_train['smoker'].value_counts()

no     797
yes    206
Name: smoker, dtype: int64

In [84]:
# Create dictionary to replace 'smoker' column
smoke = {'yes':1, 'no':0}

In [85]:
# apply the dictionary to the 'smoker' column in the train and test set
train_smoke = X_train['smoker'].replace(smoke)
test_smoke = X_test['smoker'].replace(smoke)

# view the value_counts() to make sure it worked
train_smoke.value_counts()

0    797
1    206
Name: smoker, dtype: int64

In [86]:
# reset indices on the training and testing smoker dataframes
train_smoke = train_smoke.reset_index(drop=True)
test_smoke = test_smoke.reset_index(drop=True)

### One Hot Encode any nominal features

In [87]:
# tell a column_selector to select columns with type 'object'
cat_selector = make_column_selector(dtype_include = 'object')

In [88]:
# create subset of data for only categorical columns
train_cat_data = X_train[cat_selector(X_train)]
test_cat_data = X_test[cat_selector(X_test)]

In [89]:
# instantiate one hot encoder
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
# fit OneHotEncoder on the training data
ohe_encoder.fit(train_cat_data)
# transform both the training and testing data
train_ohe = ohe_encoder.transform(train_cat_data)
test_ohe = ohe_encoder.transform(test_cat_data)

In [90]:
# extract column names from original column names
ohe_column_names = ohe_encoder.get_feature_names_out(train_cat_data.columns)
# convert ohe data to dataframe
train_ohe = pd.DataFrame(train_ohe, columns = ohe_column_names)
test_ohe = pd.DataFrame(test_ohe, columns = ohe_column_names)

### Scale any numeric features

In [91]:
# before scaling, explore the original data
X_train.describe().round(0)

Unnamed: 0,age,bmi,children
count,1003.0,1003.0,1003.0
mean,39.0,31.0,1.0
std,14.0,6.0,1.0
min,18.0,16.0,0.0
25%,27.0,26.0,0.0
50%,39.0,30.0,1.0
75%,51.0,34.0,2.0
max,64.0,53.0,5.0


In [92]:
# tell a column_selector to select columns with type 'number'
num_selector = make_column_selector(dtype_include = 'number')
# create subset of data for only numerical columns
train_num_data = X_train[num_selector(X_train)]
test_num_data = X_test[num_selector(X_test)]

In [93]:
# instantiate scaler
scaler = StandardScaler()

# fit scaler on training data
scaler.fit(train_num_data)

# trainsform the training and testing data
train_scaled = scaler.transform(train_num_data)
test_scaled = scaler.transform(test_num_data)

In [94]:
# transform back to a dataframe
train_scaled = pd.DataFrame(train_scaled, columns = train_num_data.columns)
test_scaled = pd.DataFrame(test_scaled, columns = test_num_data.columns)

In [95]:
# obtain descriptive statistics of the scaled data
train_scaled.describe().round(2)

Unnamed: 0,age,bmi,children
count,1003.0,1003.0,1003.0
mean,0.0,0.0,0.0
std,1.0,1.0,1.0
min,-1.51,-2.42,-0.92
25%,-0.87,-0.72,-0.92
50%,-0.02,-0.05,-0.09
75%,0.84,0.65,0.74
max,1.76,3.76,3.24


### Concatenate all features back into one dataframe

In [96]:
# re-combine the train and test sets on axis 1
X_train_processed = pd.concat([train_scaled, train_ohe, train_smoke], axis=1)
X_test_processed = pd.concat([test_scaled, test_ohe, test_smoke], axis=1)
# view processed test dataframe
X_train_processed

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest,smoker
0,-1.087167,-1.140875,-0.917500,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0
1,-0.802106,-0.665842,0.743605,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
2,0.836992,1.528794,-0.086947,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0
3,0.551932,0.926476,-0.086947,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1
4,0.480667,-0.268178,0.743605,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
998,-1.514757,0.139468,2.404710,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0
999,-0.018189,-1.105101,3.235263,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0
1000,1.335848,-0.887967,-0.917500,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0
1001,-0.160720,2.843247,0.743605,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1
