# Introduction

In this tutorial, we will go through an example to update a preexisting model. This might be useful when you come across additional data that you would want to consider, without having to train a model from scratch.

The main abstraction that Lightwood offers for this is the `BaseMixer.partial_fit()` method. To call it, you need to pass new training data and a held-out dev subset for internal mixer usage (e.g. early stopping). If you are using an aggregate ensemble, it's likely you will want to do this for every single mixer. The convienient `PredictorInterface.adjust()` does this automatically for you.


# Initial model training

First, let's train a Lightwood predictor for the `concrete strength` dataset:

In [1]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, predictor_from_json_ai
import pandas as pd

[32mINFO:lightwood-2187:No torchvision detected, image helpers not supported.[0m
[32mINFO:lightwood-2187:No torchvision/pillow detected, image encoder not supported[0m


In [2]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/staging/tests/data/concrete_strength.csv')

df = df.sample(frac=1, random_state=1)
train_df = df[:int(0.1*len(df))]
update_df = df[int(0.1*len(df)):int(0.8*len(df))]
test_df = df[int(0.8*len(df)):]

print(f'Train dataframe shape: {train_df.shape}')
print(f'Update dataframe shape: {update_df.shape}')
print(f'Test dataframe shape: {test_df.shape}')

Train dataframe shape: (103, 10)
Update dataframe shape: (721, 10)
Test dataframe shape: (206, 10)


Note that we have three different data splits.

We will use the `training` split for the initial model training. As you can see, it's only a 20% of the total data we have. The `update` split will be used as training data to adjust/update our model. Finally, the held out `test` set will give us a rough idea of the impact our updating procedure has on the model's predictive capabilities.

In [3]:
# Define predictive task and predictor
target = 'concrete_strength'
pdef = ProblemDefinition.from_dict({'target': target, 'time_aim': 200})
jai = json_ai_from_problem(df, pdef)

# We will keep the architecture simple: a single neural mixer, and a `BestOf` ensemble:
jai.model = {
    "module": "BestOf",
    "args": {
        "args": "$pred_args",
        "accuracy_functions": "$accuracy_functions",
        "submodels": [{
            "module": "Neural",
            "args": {
                "fit_on_dev": False,
                "stop_after": "$problem_definition.seconds_per_mixer",
                "search_hyperparameters": False,
            }
        }]
    }
}

# Build and train the predictor
predictor = predictor_from_json_ai(jai)
predictor.learn(train_df)

[32mINFO:type_infer-2187:Analyzing a sample of 979[0m
[32mINFO:type_infer-2187:from a total population of 1030, this is equivalent to 95.0% of your data.[0m
[32mINFO:type_infer-2187:Infering type for: id[0m
[32mINFO:type_infer-2187:Column id has data type integer[0m
[32mINFO:type_infer-2187:Infering type for: cement[0m
[32mINFO:type_infer-2187:Column cement has data type float[0m
[32mINFO:type_infer-2187:Infering type for: slag[0m
[32mINFO:type_infer-2187:Column slag has data type float[0m
[32mINFO:type_infer-2187:Infering type for: flyAsh[0m
[32mINFO:type_infer-2187:Column flyAsh has data type float[0m
[32mINFO:type_infer-2187:Infering type for: water[0m
[32mINFO:type_infer-2187:Column water has data type float[0m
[32mINFO:type_infer-2187:Infering type for: superPlasticizer[0m
[32mINFO:type_infer-2187:Column superPlasticizer has data type float[0m
[32mINFO:type_infer-2187:Infering type for: coarseAggregate[0m
[32mINFO:type_infer-2187:Column coarseAggrega

In [4]:
# Train and get predictions for the held out test set
predictions = predictor.predict(test_df)
predictions

[32mINFO:dataprep_ml-2187:[Predict phase 1/4] - Data preprocessing[0m
[32mINFO:dataprep_ml-2187:Cleaning the data[0m
[37mDEBUG:lightwood-2187: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:dataprep_ml-2187:[Predict phase 2/4] - Feature generation[0m
[32mINFO:dataprep_ml-2187:Featurizing the data[0m
[37mDEBUG:lightwood-2187: `featurize` runtime: 0.0 seconds[0m
[32mINFO:dataprep_ml-2187:[Predict phase 3/4] - Calling ensemble[0m
[32mINFO:dataprep_ml-2187:[Predict phase 4/4] - Analyzing output[0m
[32mINFO:lightwood-2187:The block ICP is now running its explain() method[0m
[32mINFO:lightwood-2187:The block ConfStats is now running its explain() method[0m
[32mINFO:lightwood-2187:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[32mINFO:lightwood-2187:The block AccStats is now running its explain() method[0m
[32mINFO:lightwood-2187:AccStats.explain() has not been implemented, no modifications will be done to the dat

Unnamed: 0,original_index,prediction,confidence,lower,upper
0,0,56.439272,0.9991,21.089785,91.788760
1,1,34.933219,0.9991,0.000000,70.282706
2,2,13.282833,0.9991,0.000000,48.632320
3,3,4.464484,0.9991,0.000000,39.813972
4,4,40.835308,0.9991,5.485821,76.184796
...,...,...,...,...,...
201,201,47.562860,0.9991,12.213373,82.912348
202,202,39.566722,0.9991,4.217235,74.916210
203,203,33.415344,0.9991,0.000000,68.764831
204,204,23.107278,0.9991,0.000000,58.456766


## Updating the predictor

For this, we have two options:

### `BaseMixer.partial_fit()`

Updates a single mixer. You need to pass the new data wrapped in `EncodedDs` objects.

**Arguments:** 
* `train_data: EncodedDs`
* `dev_data: EncodedDs`
* `adjust_args: Optional[dict]` - This will contain any arguments needed by the mixer to adjust new data.

If the mixer does not need a `dev_data` partition, pass a dummy:

```
dev_data = EncodedDs(predictor.encoders, pd.DataFrame(), predictor.target)
```

### `PredictorInterface.adjust()`

Updates all mixers inside the predictor by calling their respective `partial_fit()` methods. Any `adjust_args` will be transparently passed as well.

**Arguments:**

* `new_data: pd.DataFrame`
* `old_data: Optional[pd.DataFrame]`
* `adjust_args: Optional[dict]`

Let's `adjust` our predictor:

In [5]:
predictor.adjust(update_df, train_df)  # data to adjust and original data

[32mINFO:dataprep_ml-2187:Cleaning the data[0m
[37mDEBUG:lightwood-2187: `preprocess` runtime: 0.02 seconds[0m
[32mINFO:dataprep_ml-2187:Cleaning the data[0m
[37mDEBUG:lightwood-2187: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:dataprep_ml-2187:Updating the mixers[0m
[32mINFO:lightwood-2187:Loss @ epoch 1: 0.019952305282155674[0m
[37mDEBUG:lightwood-2187: `adjust` runtime: 4.64 seconds[0m


In [6]:
new_predictions = predictor.predict(test_df)
new_predictions

[32mINFO:dataprep_ml-2187:[Predict phase 1/4] - Data preprocessing[0m
[32mINFO:dataprep_ml-2187:Cleaning the data[0m
[37mDEBUG:lightwood-2187: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:dataprep_ml-2187:[Predict phase 2/4] - Feature generation[0m
[32mINFO:dataprep_ml-2187:Featurizing the data[0m
[37mDEBUG:lightwood-2187: `featurize` runtime: 0.0 seconds[0m
[32mINFO:dataprep_ml-2187:[Predict phase 3/4] - Calling ensemble[0m
[32mINFO:dataprep_ml-2187:[Predict phase 4/4] - Analyzing output[0m
[32mINFO:lightwood-2187:The block ICP is now running its explain() method[0m
[32mINFO:lightwood-2187:The block ConfStats is now running its explain() method[0m
[32mINFO:lightwood-2187:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[32mINFO:lightwood-2187:The block AccStats is now running its explain() method[0m
[32mINFO:lightwood-2187:AccStats.explain() has not been implemented, no modifications will be done to the dat

Unnamed: 0,original_index,prediction,confidence,lower,upper
0,0,56.719032,0.9991,21.369544,92.068519
1,1,35.560889,0.9991,0.211402,70.910377
2,2,13.867125,0.9991,0.000000,49.216613
3,3,5.049366,0.9991,0.000000,40.398853
4,4,41.410596,0.9991,6.061108,76.760083
...,...,...,...,...,...
201,201,47.838662,0.9991,12.489175,83.188150
202,202,39.390884,0.9991,4.041397,74.740372
203,203,33.992365,0.9991,0.000000,69.341852
204,204,22.908402,0.9991,0.000000,58.257889


Nice! Our predictor was updated, and new predictions are looking good. Let's compare the old and new accuracies to complete the experiment:

In [7]:
from sklearn.metrics import r2_score
import numpy as np

old_acc = r2_score(test_df['concrete_strength'], predictions['prediction'])
new_acc = r2_score(test_df['concrete_strength'], new_predictions['prediction'])

print(f'Old Accuracy: {round(old_acc, 3)}\nNew Accuracy: {round(new_acc, 3)}')

Old Accuracy: 0.784
New Accuracy: 0.784


## Conclusion

We have gone through a simple example of how Lightwood predictors can leverage newly acquired data to improve their predictions. The interface for doing so is fairly simple, requiring only some new data and a single call to update.

You can further customize the logic for updating your mixers by modifying the `partial_fit()` methods in them.