# Introduction

In this tutorial, we will go through an example to update a preexisting model. This might be useful when you come across additional data that you would want to consider, without having to train a model from scratch.

The main abstraction that Lightwood offers for this is the `BaseMixer.partial_fit()` method. To call it, you need to pass new training data and a held-out dev subset for internal mixer usage (e.g. early stopping). If you are using an aggregate ensemble, it's likely you will want to do this for every single mixer. The convienient `PredictorInterface.adjust()` does this automatically for you.


# Initial model training

First, let's train a Lightwood predictor for the `concrete strength` dataset:

In [1]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, predictor_from_json_ai
import pandas as pd

[nltk_data] Downloading package punkt to /home/runner/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[32mINFO:lightwood-2308:No torchvision detected, image helpers not supported.[0m
[32mINFO:lightwood-2308:No torchvision/pillow detected, image encoder not supported[0m


In [2]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/staging/tests/data/concrete_strength.csv')

df = df.sample(frac=1, random_state=1)
train_df = df[:int(0.1*len(df))]
update_df = df[int(0.1*len(df)):int(0.8*len(df))]
test_df = df[int(0.8*len(df)):]

print(f'Train dataframe shape: {train_df.shape}')
print(f'Update dataframe shape: {update_df.shape}')
print(f'Test dataframe shape: {test_df.shape}')

Train dataframe shape: (103, 10)
Update dataframe shape: (721, 10)
Test dataframe shape: (206, 10)


Note that we have three different data splits.

We will use the `training` split for the initial model training. As you can see, it's only a 20% of the total data we have. The `update` split will be used as training data to adjust/update our model. Finally, the held out `test` set will give us a rough idea of the impact our updating procedure has on the model's predictive capabilities.

In [3]:
# Define predictive task and predictor
target = 'concrete_strength'
pdef = ProblemDefinition.from_dict({'target': target, 'time_aim': 200})
jai = json_ai_from_problem(df, pdef)

# We will keep the architecture simple: a single neural mixer, and a `BestOf` ensemble:
jai.model = {
    "module": "BestOf",
    "args": {
        "args": "$pred_args",
        "accuracy_functions": "$accuracy_functions",
        "submodels": [{
            "module": "Neural",
            "args": {
                "fit_on_dev": False,
                "stop_after": "$problem_definition.seconds_per_mixer",
                "search_hyperparameters": False,
            }
        }]
    }
}

# Build and train the predictor
predictor = predictor_from_json_ai(jai)
predictor.learn(train_df)

[32mINFO:type_infer-2308:Analyzing a sample of 979[0m
[32mINFO:type_infer-2308:from a total population of 1030, this is equivalent to 95.0% of your data.[0m
[32mINFO:type_infer-2308:Infering type for: id[0m
[32mINFO:type_infer-2308:Column id has data type integer[0m
[32mINFO:type_infer-2308:Infering type for: cement[0m
[32mINFO:type_infer-2308:Column cement has data type float[0m
[32mINFO:type_infer-2308:Infering type for: slag[0m
[32mINFO:type_infer-2308:Column slag has data type float[0m
[32mINFO:type_infer-2308:Infering type for: flyAsh[0m
[32mINFO:type_infer-2308:Column flyAsh has data type float[0m
[32mINFO:type_infer-2308:Infering type for: water[0m
[32mINFO:type_infer-2308:Column water has data type float[0m
[32mINFO:type_infer-2308:Infering type for: superPlasticizer[0m
[32mINFO:type_infer-2308:Column superPlasticizer has data type float[0m
[32mINFO:type_infer-2308:Infering type for: coarseAggregate[0m
[32mINFO:type_infer-2308:Column coarseAggrega

In [4]:
# Train and get predictions for the held out test set
predictions = predictor.predict(test_df)
predictions

[32mINFO:lightwood-2308:[Predict phase 1/4] - Data preprocessing[0m
[32mINFO:lightwood-2308:Cleaning the data[0m
[37mDEBUG:lightwood-2308: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:lightwood-2308:[Predict phase 2/4] - Feature generation[0m
[32mINFO:lightwood-2308:Featurizing the data[0m
[37mDEBUG:lightwood-2308: `featurize` runtime: 0.0 seconds[0m
[32mINFO:lightwood-2308:[Predict phase 3/4] - Calling ensemble[0m
[32mINFO:lightwood-2308:[Predict phase 4/4] - Analyzing output[0m
[32mINFO:lightwood-2308:The block ICP is now running its explain() method[0m
[32mINFO:lightwood-2308:The block ConfStats is now running its explain() method[0m
[32mINFO:lightwood-2308:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[32mINFO:lightwood-2308:The block AccStats is now running its explain() method[0m
[32mINFO:lightwood-2308:AccStats.explain() has not been implemented, no modifications will be done to the data insights.

Unnamed: 0,original_index,prediction,confidence,lower,upper
0,0,49.177926,0.9991,0.0,99.741836
1,1,18.238424,0.9991,0.0,68.802334
2,2,22.289925,0.9991,0.0,72.853835
3,3,20.362161,0.9991,0.0,70.926070
4,4,38.186154,0.9991,0.0,88.750063
...,...,...,...,...,...
201,201,47.799258,0.9991,0.0,98.363168
202,202,41.190272,0.9991,0.0,91.754182
203,203,37.798291,0.9991,0.0,88.362201
204,204,29.786581,0.9991,0.0,80.350491


## Updating the predictor

For this, we have two options:

### `BaseMixer.partial_fit()`

Updates a single mixer. You need to pass the new data wrapped in `EncodedDs` objects.

**Arguments:** 
* `train_data: EncodedDs`
* `dev_data: EncodedDs`

If the mixer does not need a `dev_data` partition, pass a dummy:

```
dev_data = EncodedDs(predictor.encoders, pd.DataFrame(), predictor.target)
```

### `PredictorInterface.adjust()`

Updates all mixers inside the predictor by calling their respective `partial_fit()` methods.

**Arguments:**
* `new_data: Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame]`
* `old_data: Optional[Union[EncodedDs, ConcatedEncodedDs, pd.DataFrame]]`

Let's `adjust` our predictor:

In [5]:
from lightwood.data import EncodedDs

train_ds = EncodedDs(predictor.encoders, train_df, target)
update_ds = EncodedDs(predictor.encoders, update_df, target)

predictor.adjust(update_ds, train_ds)  # data to adjust and original data

[32mINFO:lightwood-2308:Updating the mixers[0m
torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
[32mINFO:lightwood-2308:Loss @ epoch 1: 0.06682534888386726[0m
[37mDEBUG:lightwood-2308: `adjust` runtime: 4.21 seconds[0m


In [6]:
new_predictions = predictor.predict(test_df)
new_predictions

[32mINFO:lightwood-2308:[Predict phase 1/4] - Data preprocessing[0m
[32mINFO:lightwood-2308:Cleaning the data[0m
[37mDEBUG:lightwood-2308: `preprocess` runtime: 0.01 seconds[0m
[32mINFO:lightwood-2308:[Predict phase 2/4] - Feature generation[0m
[32mINFO:lightwood-2308:Featurizing the data[0m
[37mDEBUG:lightwood-2308: `featurize` runtime: 0.0 seconds[0m
[32mINFO:lightwood-2308:[Predict phase 3/4] - Calling ensemble[0m
[32mINFO:lightwood-2308:[Predict phase 4/4] - Analyzing output[0m
[32mINFO:lightwood-2308:The block ICP is now running its explain() method[0m
[32mINFO:lightwood-2308:The block ConfStats is now running its explain() method[0m
[32mINFO:lightwood-2308:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.[0m
[32mINFO:lightwood-2308:The block AccStats is now running its explain() method[0m
[32mINFO:lightwood-2308:AccStats.explain() has not been implemented, no modifications will be done to the data insights.

Unnamed: 0,original_index,prediction,confidence,lower,upper
0,0,50.011687,0.9991,0.0,100.575597
1,1,19.743566,0.9991,0.0,70.307476
2,2,23.152284,0.9991,0.0,73.716193
3,3,21.111830,0.9991,0.0,71.675739
4,4,39.706134,0.9991,0.0,90.270044
...,...,...,...,...,...
201,201,48.683375,0.9991,0.0,99.247284
202,202,41.174847,0.9991,0.0,91.738757
203,203,39.265443,0.9991,0.0,89.829353
204,204,30.940978,0.9991,0.0,81.504888


Nice! Our predictor was updated, and new predictions are looking good. Let's compare the old and new accuracies:

In [7]:
from sklearn.metrics import r2_score
import numpy as np

old_acc = r2_score(test_df['concrete_strength'], predictions['prediction'])
new_acc = r2_score(test_df['concrete_strength'], new_predictions['prediction'])

print(f'Old Accuracy: {round(old_acc, 3)}\nNew Accuracy: {round(new_acc, 3)}')

Old Accuracy: 0.442
New Accuracy: 0.472


After updating, we see an increase in the R2 score of predictions for the held out test set.

## Conclusion

We have gone through a simple example of how Lightwood predictors can leverage newly acquired data to improve their predictions. The interface for doing so is fairly simple, requiring only some new data and a single call to update.

You can further customize the logic for updating your mixers by modifying the `partial_fit()` methods in them.