# DTSC670: Foundations of Machine Learning Models
## Module 2
## Assignment 4: Custom Transformer and Transformation Pipeline

#### Name:

Begin by writing your name above.

Your task in this assignment is to create a custom transformation pipeline that takes in raw data and returns fully prepared, clean data that is ready for model training.  However, we will not actually train any models in this assignment.  This pipeline will employ an imputer class, a user-defined transformer class, and a data-normalization class.

Please note that the order of features in the final feature matrix must be correct.  See the below figure that illustrates the input and output of the transformation pipeline.  The positions of features $x_1$ and $x_2$ do not change - they remain in the first and second columns, respectvely, both before and after the transformation pipeline.  In the transformed dataset, the $x_5$ feature is next, and is followed by the newly computed feature $x_6$.  Finally, the last two columns are the remaining one-hot vectors obtained from encoding the categorical feature $x_3$.

<img src="DataTransformation.png " width ="500" />

# 670 FAQ Assignment 4
Q: How do I calculate the new x6 column?

This new column should be the cube of your x1 column divided by the x5 column. Remember that you are working with arrays within the transformer.



Q: I get an error message when running my custom transformer?

90% of the problems we have seen are due to 2 issues: 1) Remember that only the numeric columns will be passed to your custom transformer. That will affect your column index numbers. 2) You will be working with arrays within the transformer so make sure all calculations and code work with arrays and not DataFrames.


# Import Data

Import data from the file called `CustomTransformerData.csv`.

In [1]:
import pandas as pd
import numpy as np

fileName = "CustomTransformerData.csv"
csvfile = pd.read_csv(fileName)
csvfile


Unnamed: 0,x1,x2,x3,x4,x5
0,1.5,2.354153,COLD,593,0.75
1,2.5,3.314048,WARM,340,2.083333
2,3.5,4.021604,COLD,551,4.083333
3,4.5,,COLD,2368,6.75
4,5.5,5.847601,WARM,2636,10.083333
5,6.5,7.22991,WARM,2779,14.083333
6,7.5,7.997255,HOT,1057,18.75
7,8.5,9.203947,COLD,819,24.083333
8,9.5,10.335348,WARM,3349,
9,10.5,11.112142,HOT,3235,36.75


# Seperate the numeric data and cateogrical data into numpy arrays

In [2]:
# isolate numeric data
csvnumeric = csvfile.drop(["x3"], axis=1)
csvnumeric

Unnamed: 0,x1,x2,x4,x5
0,1.5,2.354153,593,0.75
1,2.5,3.314048,340,2.083333
2,3.5,4.021604,551,4.083333
3,4.5,,2368,6.75
4,5.5,5.847601,2636,10.083333
5,6.5,7.22991,2779,14.083333
6,7.5,7.997255,1057,18.75
7,8.5,9.203947,819,24.083333
8,9.5,10.335348,3349,
9,10.5,11.112142,3235,36.75


In [3]:
# isolate categorical data
csvcategorical = csvfile.drop(["x1", "x2", "x4", "x5"], axis=1)
csvcategorical

Unnamed: 0,x3
0,COLD
1,WARM
2,COLD
3,COLD
4,WARM
5,WARM
6,HOT
7,COLD
8,WARM
9,HOT


# Create Custom Transformer

Create a custom transformer, just as we did in the lecture video entitled "Custom Transformers", that performs two computations: 

1. Adds an attribute to the end of the data (i.e. new last column) that is equal to $\frac{x_1^3}{x_5}$ for each observation

2. Drops the entire $x_4$ feature column.  (See further instructions below.)

You must name your custom transformer class `Assignment4Transformer`. Your class should include an input parameter with a default value of `True` that deletes the $x_4$ feature column when its value is `True`, but preserves the $x_4$ feature column when its value is `False`.

This transformer will be used in a pipeline. In that pipeline, an imputer will be run *before* this transformer. Keep in mind that the imputer will output an array, so **this transformer must be written to accept an array.**

Additionally, this transformer will ONLY be given the numerical features of the data. The categorical feature will be handled elsewhere in the full pipeline. This means that your code for this transformer **must reflect the absence of the categorical $x_3$ column** when indexing data structures.

In [4]:
# remember this is for numerical data only

from sklearn.base import BaseEstimator, TransformerMixin

# column index
x1, x2, x4, x5 = 0, 1, 2, 3

# You must name your custom transformer class Assignment4Transformer
class Assignment4Transformer(BaseEstimator, TransformerMixin): #TransformerMixin gives -> fit_transform, transform, fit ||| BaseEstimator gives -> params, getparams for hyperparameter tuning later on
    def __init__(self, toggle_x4 = True): # no *args or **kargs
        self.toggle_x4 = toggle_x4
        
    def fit(self, X, y=None): # define fit method
        return self  # nothing else to do
    
    def transform(self, X): # define transform method
            x6 = ((X[:,x1])**3 )/(X[:,x5]) # define x6 equal to  洧논3/洧논5  for each observation - #This new column should be the cube of your x1 column divided by the x5 column.
            if self.toggle_x4:  #What the trasnform does "if" the hyperparameter (in this case toggle_x4) is toggled to "= True"
                X = np.delete(X, [x4],axis=1) #drops x4
                return np.c_[X,x6] #Adds x6 -> which is an attribute to the end of the data (i.e. new last column) that is equal to  洧논31洧논5  for each observation
            else: #define what the transform returns "else" toggle_x4 is set (or refered to as "toggled") to " = False"
                return X # return orginally passed dataframe columns from column index x1, x2, x4, x5

In [5]:
import numpy as np
#instanciate an object of class Assignment4Transformer
attr_adder = Assignment4Transformer(toggle_x4=True)

#create an object which passes the csvnumeric data through the transformer we just created
data_extra_attribs = attr_adder.transform(csvnumeric.values)

In [6]:
data_extra_attribs = pd.DataFrame(data_extra_attribs)
data_extra_attribs

Unnamed: 0,0,1,2,3
0,1.5,2.354153,0.75,4.5
1,2.5,3.314048,2.083333,7.5
2,3.5,4.021604,4.083333,10.5
3,4.5,,6.75,13.5
4,5.5,5.847601,10.083333,16.5
5,6.5,7.22991,14.083333,19.5
6,7.5,7.997255,18.75,22.5
7,8.5,9.203947,24.083333,25.5
8,9.5,10.335348,,
9,10.5,11.112142,36.75,31.5


# Create Transformation Pipeline for Numerical Features

Create a custom transformation pipeline for numeric data only called `num_pipeline` that:

1. Applies the `SimpleImputer` class to the data, where the strategy is set to `mean`.

2. Applies the custom `Assignment4Transformer` class to the data.

3. Applies the `StandardScaler` class to the data.

In [7]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

imputer = SimpleImputer(strategy = "mean")

num_pipeline = Pipeline([
                ('imputer', SimpleImputer(strategy = "mean")),
                ('attribs_adder', Assignment4Transformer()),
                ('std_scaler', StandardScaler())
])

In [8]:
csvnumeric_transformed = num_pipeline.fit_transform(csvnumeric)
csvnumeric_transformed

array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502],
       [-1.06011273,  0.        , -1.02546024, -1.01846579],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961],
       [ 0.28912165,  0.26571811,  0.02993235,  0.31628884],
       [ 0.48186942,  0.4501331 ,  0.25608791,  0.50696808],
       [ 0.67461719,  0.75841437,  0.50108976,  0.69764731],
       [ 0.86736496,  0.88038895,  0.76493791,  0.88832654],
       [ 1.06011273,  0.        ,  1.04763235,  1.07900578],
       [ 1.2528605 ,  1.41623801,  1.34917309,  1.26968501],
       [ 1.44560827,  1.

# Create Numeric and Categorical DataFrames

Create two new data frames.  Create one DataFrame called `data_num` that holds the numeric features.  Create another DataFrame called `data_cat` that holds the categorical features.

In [9]:
data_num = pd.DataFrame(csvnumeric_transformed)
data_num

Unnamed: 0,0,1,2,3
0,-1.638356,-1.72915,-1.195077,-1.590503
1,-1.445608,-1.525559,-1.157384,-1.399824
2,-1.252861,-1.375488,-1.100845,-1.209145
3,-1.060113,0.0,-1.02546,-1.018466
4,-0.867365,-0.9882,-0.931229,-0.827787
5,-0.674617,-0.695017,-0.818151,-0.637107
6,-0.481869,-0.532266,-0.686227,-0.446428
7,-0.289122,-0.27633,-0.535457,-0.255749
8,-0.096374,-0.036364,0.0,-0.60993
9,0.096374,0.128392,-0.177377,0.12561


In [10]:
data_cat = pd.DataFrame(csvcategorical)
data_cat

Unnamed: 0,x3
0,COLD
1,WARM
2,COLD
3,COLD
4,WARM
5,WARM
6,HOT
7,COLD
8,WARM
9,HOT


# Quick Testing

The full pipeline will be implemented with a `ColumnTransformer` class.  However, to be sure that our numeric pipeline is working properly, lets invoke the `fit_transform()` method of the `num_pipeline` object.  Then, take a look at the transformed data to be sure all is well.

### Run Pipeline and Create Transformed Numeric Data

In [11]:
# invoke the fit_transform() method of the num_pipeline object. 
# Then, take a look at the transformed data to be sure all is well.
num_pipeline.fit_transform(csvnumeric)

array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502],
       [-1.06011273,  0.        , -1.02546024, -1.01846579],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961],
       [ 0.28912165,  0.26571811,  0.02993235,  0.31628884],
       [ 0.48186942,  0.4501331 ,  0.25608791,  0.50696808],
       [ 0.67461719,  0.75841437,  0.50108976,  0.69764731],
       [ 0.86736496,  0.88038895,  0.76493791,  0.88832654],
       [ 1.06011273,  0.        ,  1.04763235,  1.07900578],
       [ 1.2528605 ,  1.41623801,  1.34917309,  1.26968501],
       [ 1.44560827,  1.

### One-Hot Encode Categorical Features

Similarly, you will employ a `OneHotEncoder` class in the `ColumnTransformer` below to construct the final full pipeline.  However, let's instantiate an object of the `OneHotEncoder` class called `cat_encoder` that has the `drop` parameter set to `first`.  Next, call the `fit_transform()` method and pass it your categorical data.  Take a look at the transformed one-hot vectors to be sure all is well.

In [12]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(drop="first") # instantiate an object of the OneHotEncoder class called cat_encoder 
                              # that has the drop parameter set to first
csvcategorical_cat_1hot = cat_encoder.fit_transform(csvcategorical) #call the fit_transform() method and pass it your categorical data
csvcategorical_cat_1hot

<18x2 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [13]:
csvcategorical_cat_1hot.toarray() #Take a look at the transformed one-hot vectors to be sure all is well

array([[0., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

In [14]:
#Take a look at the transformed one-hot vectors to be sure all is well
cat_encoder = OneHotEncoder(drop="first", sparse=False)
csvcategorical_cat_1hot = cat_encoder.fit_transform(csvcategorical)
csvcategorical_cat_1hot

array([[0., 0.],
       [0., 1.],
       [0., 0.],
       [0., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

In [15]:
#Take a look at the transformed one-hot vectors to be sure all is well
cat_encoder.categories_

[array(['COLD', 'HOT', 'WARM'], dtype=object)]

# Put it All Together with a Column Transformer

Now, we are finally ready to construct the full transformation pipeline called `full_pipeline` that will transform our raw data into clean, ready-to-train data.  Construct this ColumnTransformer below, then call the `fit_transform()` method to obtain the final, clean data.  Save this output data into a variable called `data_trans`.

In [21]:
from sklearn.compose import ColumnTransformer

num_attribs = list(csvnumeric)
cat_attribs = ["x3"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    #name, which pipeline, apply on list of attributes
    ("cat", OneHotEncoder(drop="first"), cat_attribs)])


In [22]:
data_trans = full_pipeline.fit_transform(csvfile)
data_trans

array([[-1.63835604, -1.72914963, -1.19507691, -1.59050349,  0.        ,
         0.        ],
       [-1.44560827, -1.52555901, -1.15738431, -1.39982426,  0.        ,
         1.        ],
       [-1.2528605 , -1.37548847, -1.10084543, -1.20914502,  0.        ,
         0.        ],
       [-1.06011273,  0.        , -1.02546024, -1.01846579,  0.        ,
         0.        ],
       [-0.86736496, -0.9882004 , -0.93122876, -0.82778656,  0.        ,
         1.        ],
       [-0.67461719, -0.69501705, -0.81815098, -0.63710732,  0.        ,
         1.        ],
       [-0.48186942, -0.53226557, -0.68622691, -0.44642809,  1.        ,
         0.        ],
       [-0.28912165, -0.27633017, -0.53545654, -0.25574886,  0.        ,
         0.        ],
       [-0.09637388, -0.03636359,  0.        , -0.6099295 ,  0.        ,
         1.        ],
       [ 0.09637388,  0.128392  , -0.17737691,  0.12560961,  1.        ,
         0.        ],
       [ 0.28912165,  0.26571811,  0.02993235,  0.

# Prepare for Grading

Prepare your `data_trans` NumPy array for grading by using the NumPy [around()](https://numpy.org/doc/stable/reference/generated/numpy.around.html) function to round all the values to 2 decimal places - this will return a NumPy array.

Please note the final order of the features in your final numpy array, which is given at the top of this document.

___You MUST print your final answer, which is the NumPy array discussed above, using the `print()` function!  This MUST be the only `print()` statement in the entire notebook!  Do not print anything else using the print() function in this notebook!___

In [23]:
print(np.around(data_trans,decimals=2))

[[-1.64 -1.73 -1.2  -1.59  0.    0.  ]
 [-1.45 -1.53 -1.16 -1.4   0.    1.  ]
 [-1.25 -1.38 -1.1  -1.21  0.    0.  ]
 [-1.06  0.   -1.03 -1.02  0.    0.  ]
 [-0.87 -0.99 -0.93 -0.83  0.    1.  ]
 [-0.67 -0.7  -0.82 -0.64  0.    1.  ]
 [-0.48 -0.53 -0.69 -0.45  1.    0.  ]
 [-0.29 -0.28 -0.54 -0.26  0.    0.  ]
 [-0.1  -0.04  0.   -0.61  0.    1.  ]
 [ 0.1   0.13 -0.18  0.13  1.    0.  ]
 [ 0.29  0.27  0.03  0.32  0.    1.  ]
 [ 0.48  0.45  0.26  0.51  0.    1.  ]
 [ 0.67  0.76  0.5   0.7   0.    0.  ]
 [ 0.87  0.88  0.76  0.89  1.    0.  ]
 [ 1.06  0.    1.05  1.08  1.    0.  ]
 [ 1.25  1.42  1.35  1.27  0.    1.  ]
 [ 1.45  1.55  1.67  1.46  1.    0.  ]
 [ 1.64  1.71  2.01  1.65  1.    0.  ]]
