# OOP With Scikit-Learn (sklearn)

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# Objectives

- Understand the concept of object-oriented inheritance
- Understand the main object types of the Scikit-Learn API
- Extend and create custom Scikit-Learn Estimators

# Inheritance

We've learned a lot already on object-oriented programming and how to create our own classes.

We can also define classes in terms of _other_ classes, in which case the new classes **inherit** the attributes and methods from the classes in terms of which they're defined.

## Motivation: So What's the Benefit? 

_More abstraction is better_

Take a look at this code below. Look at how much we've already done:

In [None]:
# Look at all that code we wrote... do we have to do it all again...?
class Robot():
    purpose = 'To love humans'
    
    # We'd like to start off with some initial attributes
    def __init__(self, first_name='?', last_name=''):
        # Clean the names of extra spaces at beginning & end
        first_name = first_name.strip()
        last_name = last_name.strip()    
        # Setting attributes
        self.first_name_ = first_name
        self._last_name = last_name
        # Combine first and last names and remove any extra spacing
        self.name = ' '.join([first_name,last_name]).strip()

           
    def change_name(self, new_name):
        self.name = new_name
    
    def speak(self):
        print(f'I am {self.name}!')

Let's say we wanted to make another bot with some extra functionality like keeping track of its battery charge.

Do we have to copy and paste this and then add our new functionality? 

Nope! Since we can abstract away the stuff we already did!

In [None]:
class GarbageBot(Robot):
    """Robot that takes out the trash, has a battery charge"""
    battery = 100
    
    def speak_gar(self):
        self.battery -= 10
        print(f"I'm {self.name} and have {self.battery}% battery charged")

In [None]:
new_robot = GarbageBot('Wall-e')
new_robot.speak()

In [None]:
new_robot.speak()

In [None]:
new_robot.battery

In [None]:
new_robot.name

In [None]:
new_robot.change_name('Dan')

In [None]:
new_robot.speak_gar()

And I still keep the other functionality from the original class!

In [None]:
new_robot.change_name('E-llaw') # Note we never defined this in GarbageBot!
new_robot.speak()

In [None]:
test_bot = GarbageBot('Dan', 'Burdeno')
test_bot.name

In [None]:
test_bot.first_name_

In [None]:
test_bot._last_name

## Inheritance in Data Science

A lot of motivation in how we write our code can be summed up with, "Never reinvent the wheel". And using **inheritance** can make this really easy.

Later, we'll be taking Scikit-Learn's objects and customizing them to our particular needs. This can be a common practice as we use libraries and tools to write reproducible code.

Inheritance allows us to write some of this code quickly by avoiding a lot of "boilerplate" code (the same code we write over and over just to do a minor change).

# Duck Typing

But we don't need inheritance to do everything. 

A different method of getting functionality using different objects is called **duck typing**. The term comes from the saying: 
> **"If it walks like a duck and it quacks like a duck, then it must be a duck."**

![](img/duck.jpg)
> <a href="https://commons.wikimedia.org/wiki/File:Rubber_Duck_Front_View_in_Fine_Day_20140107.jpg">玄史生</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a>, via Wikimedia Commons

When you're using the concept of duck typing, you really don't care about the object _type_ and if it's compatible.

All you _care about are the **methods and properties**_ of the object over the type or even class.

## Duck Typing in Scikit-Learn

Scikit-Learn relies more on duck typing over pure inheritance. In general, if an object has certain methods that `sklearn` expects, than it's mostly compatible!

However, inheritance in Scikit-Learn is typically used to avoid _boilerplate_ code. Usually this involves using [`sklearn.base`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.base) such as [`sklearn.base.BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator).

# Scikit-Learn's API: (Estimators, Transformers, Predictors)

Scikit-Learn has a great [API](https://scikit-learn.org/stable/developers/develop.html) that has objects that are consistent and easy to make compatible with your own made objects!

Let's go over the API's object that will be most relevant to us in the near future.

## Estimator

> This is an object that can can take in data and _estimate_ (or *learn*) some parameters. 

This means regression and classification models are estimators but so are objects that transform the original dataset ([Transformers](#Transformer)) such as `StandardScaler`.

### `fit`

All estimators estimate/learn by calling the `fit()` method by passing in the dataset. Other parameters can be passed in to "help" the estimator to learn. These are called **hyperparameters**, parameters used to tweak the learning process.

## Transformer

> Some estimators can change the original data to something new, a **transformation**. 

You can think of examples of these **transformers** when you do scaling, data cleaning, or expanding/reducing on a dataset.

### `transform`

Transformers will call the `transform()` method to apply the transformation to a dataset after a `fit()` call.

###  `fit_transform`

Remember that all estimators have a `fit()` method, so a transformer can use the `fit()` method to learn something about the given dataset. After learning with `fit()`, a transformation on the dataset can be made with the `transform()` method. 

An example of this would be a function that performs normalization on the dataset; the `fit()` method would learn the minimum and maximum of the dataset and the `transform()` method will scale the dataset.

When you call `fit` and `transform` with the same dataset, you can simply call the `fit_transform()` method. This essentially has the same results as calling `fit()` and then `transform()` on the dataset but possibly with some optimization and efficiencies baked in.

## Predictor

> We would use the `fit()` method to train our predictor object and then feed in new data to make predictions (based on what it learned in the fitting stage).

We've used **predictors** whenever we've made predictions like with a `LinearRegression` model.

### `predict`

As you probably can guess, the `predict()` method predicts results from a dataset given to it after being trained with a `fit()` method

### `score`

Predictors also have a `score()` method that can be used to evaluate how well the predictor performed on a dataset (such as the test set).

## Observing a Scikit-Learn Class Definition from Source

Let's begin by taking a look at the source code for `sklearn`'s [StandardScaler](https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/preprocessing/_data.py#L517)

Take a minute to peruse the source code on your own. What do you notice?

# Creating a Scikit-Learn Transformer

> Sometimes we want to create our own Scikit-Learn objects to be used in our code.

Let's try to create a new _transformer_ that will transform the data in the following manner:

- If the value is **positive**, scale the value by the **largest value** in that column
- If the value is **negative**, change it to $0$

## Creating a New Transformer

First, we create our base estimator/transformer through inheritance of [`sklearn.base.BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator):

In [None]:
class SpecialTransformer(BaseEstimator):
    pass

my_transformer = SpecialTransformer()
my_transformer

In [None]:
my_transformer.

This by itself is pretty useless. But we can now add in new `fit()` method which will find the maximum value for each column/feature.

## Creating a `fit` Method

In [None]:
class SpecialTransformer(BaseEstimator):
    
    

In [None]:
my_transformer = SpecialTransformer()

In [None]:
## Let's use some test data
# Note each column is a feature, each row a data point
X = np.array([
    [-4, 400, 40],
    [10, -100, 1],
    [6, -800, 700],
    [2, 0, 400],
    [8, 200, 1000]
])

X

In [None]:
import pandas as pd

In [None]:
X_df = pd.DataFrame(X)

In [None]:
X_df.max()

In [None]:
X.max(axis=0)

In [None]:
np.max(X_df, axis=0)

In [None]:
X.flatten()

In [None]:
np.max(X)

> Quick check: What would be the max values for each column/feature?

In [None]:
my_transformer.max_

In [None]:
# No transformation yet, but finds the maximum values
my_transformer.fit(X)
my_transformer.max_

Great! 

## Creating `transform` Method

Let's now actually implement a way to transform our data:

In [None]:
class SpecialTransformer(BaseEstimator):
    
    def fit(self, X, y=None):
        self.max_ = np.max(X, axis=0) 
        return self

In [None]:
# Recall the data
X

In [None]:
# Create a SpecialTransformer and fit with the data
my_transformer = SpecialTransformer()
my_transformer.fit(X)

In [None]:
my_transformer.max_

In [None]:
# Transform the data
X_new = my_transformer.transform(X)
X_new

## Conclusion

We now created our very own transformer! We could even feed in one data set to _fit_ our object and then a different dataset to _transform_.

We should note that there's still a lot of customization we could have done. 

For example, we didn't consider what happens if the maximum value for a feature was $0$. We really should code how we want that to be handled (but we just ignored it for now).

We also could have gotten the `fit_transform()` method automatically by also inheriting from [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin). See the code below:

In [None]:
class SpecialTransformer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        self.max_ = np.max(X,axis=0) 
        return self
    
    def transform(self, X):
        X_copy = np.copy(X)
        X_copy[X_copy < 0] = 0
        return X_copy / self.max_

In [None]:
my_transformer = SpecialTransformer()
# Note we can now do fit_transform()
X_new = my_transformer.fit_transform(X)
X_new

# Exercise: Create Your Own Transformer

Your turn! Let's try to recreate the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) object!

Recall that standard scaling transforms the values in the following way:

$$x_i = \frac{x_i-\bar{x_i}}{\sigma_{x_i}}$$

where the $i$ subscript reminds us that it comes from a single column/feature.

In [None]:
## YOUR CODE HERE!
None

<details>
    <summary>Answer</summary>
        <code>class MyStandardScaler:
    def fit(self, arr):
        self.mean_ = np.mean(arr, axis=0)
        self.scale_ = np.std(arr, axis=0)
    def transform(self, arr):
        return (arr - self.mean_) / self.scale_</code>
</details>

## Test Your Code!

Once you have it, you can test it against the data below and Scikit-Learn's `StandardScaler`

In [None]:
# Your test data
X = np.array([
    [-4, 400, 40],
    [10, -100, 1],
    [6, -800, 700],
    [2, 0, 400],
    [8, 200, 1000]
])
X

In [None]:
# Test against StandardScaler
sklearn_scaler = StandardScaler()
X_sklearn_scaled = sklearn_scaler.fit_transform(X)
X_sklearn_scaled

In [None]:
# Catches errors
try:
    # Your implementation
    my_scaler = MyStandardScaler()
    my_scaler.fit(X)
    X_my_scaled = my_scaler.transform(X)
    
    # Check against StandardScaler
    print('StandardScaler and MyStandardScaler same?')
    print(X_sklearn_scaled == X_my_scaled)
except:
    print('Check your fit() and transform() methods!')

In [None]:
my_scaler = MyStandardScaler()
my_scaler.fit_transform(X)

## Objectives Recap

- Understand the concept of object-oriented inheritance
- Understand the main object types of the Scikit-Learn API
- Extend and create custom Scikit-Learn Estimators