# Tutorial - Implementing a custom mixer in Lightwood


## Introduction

Mixers are the center piece of lightwood, tasked with learning the mapping between the encoded feature and target representation


## Objective

In this tutorial we'll be trying to implement a sklearn random forest as a mixer that handles categorical and binary targets. 

## Step 1: The Mixer Interface

The Mixer interface is defined by the `BaseMixer` class, a mixer needs methods for 4 tasks:
* fitting (`fit`)
* predicting (`__call__`)
* construction (`__init__`)
* partial fitting (`partial_fit`), though this one is optional

## Step 2: Writing our mixer

I'm going to create a file called `random_forest_mixer.py` inside `/etc/lightwood_modules`, this is where lightwood sources custom modules from.

Inside of it I'm going to write the following code:

In [1]:
%%writefile random_forest_mixer.py

from lightwood.mixer import BaseMixer
from lightwood.api.types import PredictionArguments
from lightwood.data.encoded_ds import EncodedDs, ConcatedEncodedDs
from type_infer.dtype import dtype
from lightwood.encoder import BaseEncoder

import torch
import pandas as pd
from sklearn.ensemble import RandomForestClassifier


class RandomForestMixer(BaseMixer):
    clf: RandomForestClassifier

    def __init__(self, stop_after: int, dtype_dict: dict, target: str, target_encoder: BaseEncoder):
        super().__init__(stop_after)
        self.target_encoder = target_encoder
        # Throw in case someone tries to use this for a problem that's not classification, I'd fail anyway, but this way the error message is more intuitive
        if dtype_dict[target] not in (dtype.categorical, dtype.binary):
            raise Exception(f'This mixer can only be used for classification problems! Got target dtype {dtype_dict[target]} instead!')

        # We could also initialize this in `fit` if some of the parameters depend on the input data, since `fit` is called exactly once
        self.clf = RandomForestClassifier(max_depth=30)

    def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
        X, Y = [], []
        # By default mixers get some train data and a bit of dev data on which to do early stopping or hyper parameter optimization. For this mixer, we don't need dev data, so we're going to concat the two in order to get more training data. Then, we're going to turn them into an sklearn friendly foramat.
        for x, y in ConcatedEncodedDs([train_data, dev_data]):
            X.append(x.tolist())
            Y.append(y.tolist())
        self.clf.fit(X, Y)

    def __call__(self, ds: EncodedDs,
                 args: PredictionArguments = PredictionArguments()) -> pd.DataFrame:
        # Turn the data into an sklearn friendly format
        X = []
        for x, _ in ds:
            X.append(x.tolist())

        Yh = self.clf.predict(X)

        # Lightwood encoders are meant to decode torch tensors, so we have to cast the predictions first
        decoded_predictions = self.target_encoder.decode(torch.Tensor(Yh))

        # Finally, turn the decoded predictions into a dataframe with a single column called `prediction`. This is the standard behaviour all lightwood mixers use
        ydf = pd.DataFrame({'prediction': decoded_predictions})

        return ydf

    
    # We'll skip implementing `partial_fit`, thus making this mixer unsuitable for online training tasks

Overwriting random_forest_mixer.py


## Step 3: Using our mixer

We're going to use our mixer for diagnosing heart disease using this dataset: [https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv](https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv)

First, since we don't want to bother writing a Json AI for this dataset from scratch, we're going to let lightwood auto generate one.

In [2]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, load_custom_module
import pandas as pd

# load the code
load_custom_module('random_forest_mixer.py')

# read dataset
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/heart_disease/data.csv')

# define the predictive task
pdef = ProblemDefinition.from_dict({
    'target': 'target', # column you want to predict
})

# generate the Json AI intermediate representation from the data and its corresponding settings
json_ai = json_ai_from_problem(df, problem_definition=pdef)

# Print it (you can also put it in a file and edit it there)
print(json_ai.to_json())

[32mINFO:lightwood-1468487:Dropping features: [][0m
[32mINFO:lightwood-1468487:Analyzing a sample of 298[0m
[32mINFO:lightwood-1468487:from a total population of 303, this is equivalent to 98.3% of your data.[0m
[32mINFO:lightwood-1468487:Using 7 processes to deduct types.[0m
[32mINFO:lightwood-1468487:Infering type for: cp[0m
[32mINFO:lightwood-1468487:Infering type for: age[0m
[32mINFO:lightwood-1468487:Infering type for: sex[0m
[32mINFO:lightwood-1468487:Infering type for: trestbps[0m
[32mINFO:lightwood-1468487:Infering type for: chol[0m
[32mINFO:lightwood-1468487:Infering type for: restecg[0m
[32mINFO:lightwood-1468487:Infering type for: fbs[0m
[32mINFO:lightwood-1468487:Column cp has data type categorical[0m
[32mINFO:lightwood-1468487:Column chol has data type integer[0m
[32mINFO:lightwood-1468487:Column sex has data type binary[0m
[32mINFO:lightwood-1468487:Column restecg has data type categorical[0m
[32mINFO:lightwood-1468487:Column trestbps has da

{
    "encoders": {
        "target": {
            "module": "BinaryEncoder",
            "args": {
                "is_target": "True",
                "target_weights": "$statistical_analysis.target_weights"
            }
        },
        "age": {
            "module": "NumericEncoder",
            "args": {}
        },
        "sex": {
            "module": "BinaryEncoder",
            "args": {}
        },
        "cp": {
            "module": "OneHotEncoder",
            "args": {}
        },
        "trestbps": {
            "module": "NumericEncoder",
            "args": {}
        },
        "chol": {
            "module": "NumericEncoder",
            "args": {}
        },
        "fbs": {
            "module": "BinaryEncoder",
            "args": {}
        },
        "restecg": {
            "module": "OneHotEncoder",
            "args": {}
        },
        "thalach": {
            "module": "NumericEncoder",
            "args": {}
        },
        "exang": {
        

Now we have to edit the `mixers` key of this json ai to tell lightwood to use our custom mixer. We can use it together with the others, and have it ensembled with them at the end, or standalone. In this case I'm going to replace all existing mixers with this one

In [3]:
json_ai.model['args']['submodels'] = [{
    'module': 'random_forest_mixer.RandomForestMixer',
    'args': {
        'stop_after': '$problem_definition.seconds_per_mixer',
        'dtype_dict': '$dtype_dict',
        'target': '$target',
                'target_encoder': '$encoders[self.target]'

    }
}]

Then we'll generate some code, and finally turn that code into a predictor object and fit it on the original data.

In [4]:
from lightwood.api.high_level import code_from_json_ai, predictor_from_code

code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)

In [5]:
predictor.learn(df)

[32mINFO:lightwood-1468487:Dropping features: [][0m
[32mINFO:lightwood-1468487:Performing statistical analysis on data[0m
[32mINFO:lightwood-1468487:Starting statistical analysis[0m
[32mINFO:lightwood-1468487:Finished statistical analysis[0m
[37mDEBUG:lightwood-1468487: `analyze_data` runtime: 0.01 seconds[0m
[32mINFO:lightwood-1468487:Cleaning the data[0m
[37mDEBUG:lightwood-1468487: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:lightwood-1468487:Splitting the data into train/test[0m
[37mDEBUG:lightwood-1468487: `split` runtime: 0.01 seconds[0m
[32mINFO:lightwood-1468487:Preparing the encoders[0m
[32mINFO:lightwood-1468487:Encoder prepping dict length of: 1[0m
[32mINFO:lightwood-1468487:Encoder prepping dict length of: 2[0m
[32mINFO:lightwood-1468487:Encoder prepping dict length of: 3[0m
[32mINFO:lightwood-1468487:Encoder prepping dict length of: 4[0m
[32mINFO:lightwood-1468487:Encoder prepping dict length of: 5[0m
[32mINFO:lightwood-1468487:Encoder pre

Finally, we can use the trained predictor to make some predictions, or save it to a pickle for later use

In [6]:
predictions = predictor.predict(pd.DataFrame({
    'age': [63, 15, None],
    'sex': [1, 1, 0],
    'thal': [3, 1, 1]
}))
print(predictions)

predictor.save('my_custom_heart_disease_predictor.pickle')

[32mINFO:lightwood-1468487:Dropping features: [][0m
[32mINFO:lightwood-1468487:Cleaning the data[0m
[37mDEBUG:lightwood-1468487: `preprocess` runtime: 0.0 seconds[0m
[32mINFO:lightwood-1468487:Featurizing the data[0m
[37mDEBUG:lightwood-1468487: `featurize` runtime: 0.0 seconds[0m
[32mINFO:lightwood-1468487:The block ICP is now running its explain() method[0m
[32mINFO:lightwood-1468487:The block AccStats is now running its explain() method[0m
[32mINFO:lightwood-1468487:AccStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[32mINFO:lightwood-1468487:The block ConfStats is now running its explain() method[0m
[32mINFO:lightwood-1468487:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[37mDEBUG:lightwood-1468487: `predict` runtime: 0.03 seconds[0m


   original_index prediction  confidence
0               0          1    0.972203
1               1          0    0.989355
2               2          1    0.849581


That's it, all it takes to solve a predictive problem with lightwood using your own custom mixer.