# Transform Data

* In this notebook we'll create custom transformers and pipelines to transform the raw `training` data into transformed data for ML training.
* We'll use the same pipeline to transform the data for prediction as well. 

## Import Libraries

In [2]:
## import the necessary libraries
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import rbf_kernel


## Load Training Data

In [3]:
processed_data_path = Path("..", "data", "processed", "housing")

In [4]:
## read data
data = pd.read_csv(Path(processed_data_path, "train_set.csv"))

In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_categories,population_categories
0,13096,-122.42,37.8,52.0,3321.0,1115.0,1576.0,1034.0,2.0987,458300.0,NEAR BAY,2,1
1,14973,-118.38,34.14,40.0,1965.0,354.0,666.0,357.0,6.0876,483800.0,<1H OCEAN,5,1
2,3785,-121.98,38.36,33.0,1083.0,217.0,562.0,203.0,2.433,101700.0,INLAND,2,1
3,14689,-117.11,33.75,17.0,4174.0,851.0,1845.0,780.0,2.2618,96100.0,INLAND,2,1
4,20507,-118.15,33.77,36.0,4366.0,1211.0,1912.0,1172.0,3.5292,361800.0,NEAR OCEAN,3,1


We need the following data transformations (in same order)
* Fill in missing values
* Convert `ocean_proximity` to one hot encoding
* Feature Engineering `rooms_per_house`, `bedroom_ratio` and `people_per_house`
* Add cluster similarity features - with hyperparameter to control `gamma`
* Drop Outliers
* Transform heavy tailed features using logarithm 
* Scale all numeric features. 

In [6]:
## before we create the pipeline lets split he training data into features and labels
df_features = data.drop("median_house_value", axis=1)
df_labels = data["median_house_value"].copy()