# POC Use Xingu

Use this notebook to get you started with Xingu.

**Do not use it here. Copy this notebook and feel free to edit, modify and experiment in there.**

**Never commit your changes to the repo**, since this notebook is just a standard example and will be used by others to play with Xingu.

<hr/>

Start by importing configuration bundles to train, or batch predict, or explore metrics.

In [None]:
# this is a simple `.py` file in notebooks folder, full of configuration parameters for Robson
import config_my_xingu

## Setup environment
### The `env` bundle controls locations for files and databases
#### `config_my_xingu.bundles['env']['alpha_explorer']`
Use when working on everyday Robson improvements
* Robson database: local SQLite
* DVC: off
* Query cache: on, in `../data`
* Trained models in: `../models`

#### `config_my_xingu.bundles['env']['beta_explorer']`
Use when working with staging database
* Robson database: staging PostgreSQL
* DVC: on
* Query cache: on, in `../data`
* Trained models in: staging S3

#### `config_my_xingu.bundles['env']['staging']`
Similar to `beta_explorer`, used in GitHub staging workflow
* Robson database: staging PostgreSQL
* DVC: on
* Query cache: off
* Trained models in: staging S3

#### `config_my_xingu.bundles['env']['production']`
Do not use in your laptop, this is just documented as how to configure for production
* Robson database: production PostgreSQL
* DVC: on
* Query cache: off
* Trained models in: production S3

### The `parallel` bundle controls parallelism and modus operandi
#### `config_my_xingu.bundles['parallel']['train_and_predict']`
Use when working on everyday Robson improvements
* Train: yes
    * Train parallelism: maximum
    * Hyper-parameters optimization: use what is found in DB, or estimator default
* Post process (pickle, metrics etc): yes
    * Batch predict: yes
    * Post-process parallelism: maximum

#### `config_my_xingu.bundles['parallel']['predict_only']`
Use with pre-trained models
* Train: no
    * Hyper-parameters optimization: no
* Post process (pickle, metrics etc): no
    * Post-process parallelism: maximum
    * Batch predict: yes

#### `config_my_xingu.bundles['parallel']['hyper_optimize_only']`
Use when working on hyper-parameters optimization
* Train: yes
    * Train parallelism: one model at a time
    * Hyper-parameters optimization: compute
    * Hyper-parameters optimization parallelism: maximum
* Post process (pickle, metrics etc): no
    * Batch predict: no

#### `config_my_xingu.bundles['parallel']['do_all']`
Use when working on hyper-parameters optimization
* Train: yes
    * Train parallelism: 3 models at a time
    * Hyper-parameters optimization: compute
    * Hyper-parameters optimization parallelism: 6
* Post process (pickle, metrics etc): yes
    * Post-process parallelism: 3 models at a time
    * Batch predict: yes

Choose one **env** bundle and one **parallel** bundle

In [None]:
import os
import sys
import pathlib

os.environ.update(config_my_xingu.bundles['env']['alpha_explorer'])
os.environ.update(config_my_xingu.bundles['parallel']['train_and_predict'])

Amend anything you want to change. All values must be text.

In [None]:
os.environ.update(
    dict(
        HYPEROPT_STRATEGY     = 'dp',
        BATCH_PREDICT         = 'false',
        DEBUG                 = 'true'
    )
)

## Import Xingu and configure Logging
Next line is required if `xingu` folder not in `PYTHON_PATH` or robson not installed by pip.

In [None]:
# Give priority to local packages (not needed in case Robson was installed by pip)
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(''), '..')))

In [None]:
import logging
import decouple
import pandas as pd
import numpy
from xingu import DataProviderFactory
from xingu import ConfigManager
from xingu import Coach
from xingu import Model

%config InlineBackend.figure_format='retina'

# Configure logging for Xingu
logger=logging.getLogger('xingu')
FORMATTER = logging.Formatter("%(asctime)s|%(levelname)s|%(name)s|%(message)s")
HANDLER = logging.StreamHandler()
HANDLER.setFormatter(FORMATTER)
logger.addHandler(HANDLER)
logger.setLevel(logging.DEBUG)

## POC 1. Train some Models
A `Coach` is needed to train anything. Put more DataProvider IDs in the `data_providers` list. If you want to train models that have pre-reqs and are not training their dependencies in the same train session, pre-trained pre-req models will be efficiently loaded upfront.

In [None]:
data_providers=['datarisk_cartoes']

dpf=DataProviderFactory(providers_list=data_providers)
coach=Coach(dpf)

!rm xingu.db* ../models/*

In [None]:
%%time
coach.team_train()

Also try `config_my_xingu.bundles['parallel']['hyper_optimize_only']` config bundle to radically change what `team_train()` does.

Trained models are here:

In [None]:
coach.trained

Trained models can be used now to compute estimations.

## POC 2. Use Pre-Trained Models for Batch Predict

Reset this notebook before continuing. Run again only cells before "POC 1" just to setup environment.

A `Coach` is needed to eficiently load pre-trained models

In [None]:
data_providers=['datarisk_cartoes']
dpf=DataProviderFactory(providers_list=data_providers)
coach=Coach(dpf)

Notice that `anuncios` is not in the `data_providers` list, but it will be loaded since it is a pre-req for `cartorios`, which is in the list.

Pre-trained pickles will be search in and loaded from whatever is set in `TRAINED_MODELS_PATH` environment variable. This is usually set to `models` local folder or to some S3 URL.

Models will be loaded in parallel.

In [None]:
print(os.environ['TRAINED_MODELS_PATH'])
coach.team_load()

Use embedded DataProvider to load some data. The following logic is barely what happens in `Model::batch_predict()` method. See also `Model::fit()` method for the training data preparation logic.

In [None]:
model=coach.trained['datarisk_cartoes']

# Following line is here just to force use of cached parquet, if available
model.context='batch_predict'

# Get DP’s batch predict SQL queriesp
dict_with_queries     = model.dp.get_dataset_sources_for_batch_predict()

# Use queries to get multiple DataFrames
dict_with_dataframes  = model.data_sources_to_data(dict_with_queries)

# Integrate into one DataFrame and apply logic to clean data
df                    = model.dp.clean_data_for_batch_predict(dict_with_dataframes)

# Feature engineering
df                    = model.dp.feature_engineering_for_batch_predict(df)

# Resulting DataFrame used for batch predict
df

In [None]:
dict_with_dataframes

Compute estimations, finaly

In [None]:
# Illustrative only. For you to see what pred_quantiles() does internally
X_features=model.dp.get_features_list()

# Don't need to filter by X_features, it will be filtered internally
Y_pred=model.predict_proba(df)

Y_pred

### Compute metrics

Put data in right places so we can use convenient internal methods

In [None]:
model.batch_predict_data=df
model.batch_predict_valuations=Y_pred

Compute all metrics available for model, including methods provided by its DataProvider

In [None]:
model.compute_model_metrics()

In [None]:
model.compute_estimation_metrics()

If `model.sets['train']`, `model.sets['val']` and `model.sets['test']` are defined and have data, this should work too:

In [None]:
model.compute_trainsets_model_metrics()

**POC 1** and **POC 2** unveil what happens in the [train workflow](https://github.com/loft-br/robson_avm/blob/main/.github/workflows/build_and_train_staging.yml).

---

---

## POC 3. Assess Metrics and create Comparative Reports
Since all metrics are stored in DB, they can be assessed and compared.
The `RobsonCoach` class has reporting tools.

In [None]:
# Get metrics from staging and development DB

os.environ.update(
    dict(
        XINGU_DB_URL=config_my_xingu.bundles['env']['beta_explorer']['XINGU_DB_URL']
    )
)

coach=Coach()

Retrieve all metrics and metadata about 4 specific `train_id`s and show it in a comparative way.

In [None]:
report=coach.report(train_ids=['salmon-participant','wise-jacquard'])

display(report['meta'])
display(report['metrics'])

### Display a subset of metrics: only the m² values for São Paulo.

In [None]:
report['metrics'][['value per m²:São Paulo' in s for s in report['metrics'].index]].xs('global', level='set', axis=1).dropna()

### Display a subset of metrics: only the ones related to the `test` split part.

In [None]:
report['metrics'].xs('test', level='set', axis=1).dropna()

### Save all metrics as Excel file

In [None]:
# Excel won't support time with timezone - how typical. Make it naïve.
report['meta'].loc['time_utc']=report['meta'].loc['time_utc'].apply(lambda x: x.tz_convert(None))

with pd.ExcelWriter(f'Metrics for Comitee Report — {pd.Timestamp.now().strftime("%Y.%m.%d-%H.%M.%S")}.xlsx') as writer:

    report_aux = report['meta'].sort_values("dataprovider_id", axis=1)
    report_aux.to_excel(writer, sheet_name="meta")

    dataprovider_list = list(set(report_aux.loc["dataprovider_id", :]))

    for dataprovider_id in dataprovider_list:

        train_ids = list(report_aux.loc[:, report_aux.loc["dataprovider_id", :] == dataprovider_id].columns)
        train_session_ids = report_aux.loc["train_session_id", report_aux.loc["dataprovider_id", :] == dataprovider_id]

        sheet = report["metrics"].loc[:, report["metrics"].columns.get_level_values(1).isin(train_ids)]

        aux_list = {id: id + '|'+ train_session_ids[id] for id in sheet.columns.get_level_values(1)}

        sheet = sheet.rename(columns=aux_list)

        sheet.to_excel(writer, sheet_name=dataprovider_id)

## POC 4. Check and report how Metrics evolved

This example reports how metrics of same estimator evolved throughout time. We’ll use the production database.

In [None]:
os.environ.update(config_my_xingu.bundles['env']['production'])

coach=Coach()

In [None]:
dp='vitrine_sp'

query="""
    select * from metrics_model
    where dataprovider_id = '{dp}'
    -- and set='global';
"""

In [None]:
# Extract from DB
report=pandas.read_sql(query.format(dp=dp),con=coach.get_db_connection('xingu'))

# Make time human readable
report['time']=pd.to_datetime(report['time'], unit='s', utc=True)

# Display a simple evolution report with just OKRs
print(f"Evolution of metrics for {dp}")

(
    report[report['name'].str.contains('OKR')]
    .set_index(['name','time'])
    .drop(columns='dataprovider_id train_session_id train_id set value_text'.split())
    .unstack()
    .sort_index()
)

How `OKR error > 15%:proportion` metric evolved through time

In [None]:
KPI="OKR error > 15%:proportion"

(
    report
    .query('name==@KPI')
    [['time','value_number']]
    .sort_values('time')
    .set_index('time')
    .plot
    .line(title=f'{KPI} @ {dp}')
)

## POC 5. Play with Cingu barebones

`Coach` is handy to coordinate full trains, full batch predict process (including metrics computation) and multi-model loading. But you can play with `Model` objects directly too. A coach is still needed for DB access, though.

In [None]:
data_providers=['datarisk_cartoes']

dpf=DataProviderFactory(providers_list=data_providers)

coach=Coach(dpf)

Get an untrained object for `anuncios_scs`.

In [None]:
model=Model(
    dp                     = next(coach.dp_factory.produce()),
    coach                  = coach,
    trained                = False,
    delayed_prereq_binding = True
)

Manualy load and bind pre-req models

In [None]:
# Use the coach to load them efficiently
# coach.team_load(explicit_list=model.dp.pre_req)

# Bind them to current model
model.load_pre_req_model()

# See result
model.dp.pre_req_model

Get DP’s SQL queries and related data, clean, integrate and engineer some features

In [None]:
# Following line is here just to force use of cached parquet, if available
model.context='train_dataprep'

# Get DP’s batch predict SQL queries
dict_with_queries     = model.dp.get_dataset_sources_for_train()

# Use queries to get multiple DataFrames
dict_with_dataframes  = model.data_sources_to_data(dict_with_queries)

# Integrate into one DataFrame and apply logic to clean data
df                    = model.dp.clean_data_for_train(dict_with_dataframes)

In [None]:
df

In [None]:
# Feature engineering
df=model.dp.feature_engineering_for_train(df)

# Resulting DataFrame used for batch predict
df

In [None]:
model.dp.data_split_for_train(df)

## POC 6. Play with `ConfigManager`

Reset this notebook before continuing. Run again only cells **before “POC 1”** just to setup environment.

Here is `XINGU_DB_URL` env var with AWS secrets and parameters. Use `ConfigManager` to resolve them.

In [None]:
config_my_robson.bundles['env']['beta_explorer']['XINGU_DB_URL']

In [None]:
os.environ.update(
    dict(
        ROBSON_DB_URL = config_my_robson.bundles['env']['beta_explorer']['XINGU_DB_URL']
    )
)

In [None]:
ConfigManager.get('XINGU_DB_URL')

One more try. Reset its cache first.

In [None]:
ConfigManager.cache={}

In [None]:
os.environ.update(
    dict(
        XYZ = '{%AWS_PARAM:robson-avm-staging-url%}/{%AWS_PARAM:robson-avm-staging-database-name%}'
    )
)

In [None]:
ConfigManager.get('XYZ')

## POC 7. Xingu Estimators in the Command Line

In staging and production environments, Xingu training is invoked in the command line. Inspect workflow files for [staging train](https://github.com/loft-br/robson_avm/blob/main/.github/workflows/build_and_train_staging.yml) and [staging hyper-param optimization](https://github.com/loft-br/robson_avm/blob/main/.github/workflows/build_and_hyperopt_staging.yml) to see how this is simply a mater of setting environment variables (as the ones on `config_my_xingu.py`) and running the `xingu` command as bellow. The `xingu` command source code is in [`/xingu/__main__.py`](https://github.com/loft-br/robson_avm/blob/main/robson/__main__.py).

### All Xingu features can be controlled in the command line; see them all here

```shell
python3 -m xingu -h
```

### Train and Batch Predict 2 models in your laptop:

This is fully parallel. One model will execute post-train actions (batch predict, data and pickle saving, metrics etc) while other model is being trained.

```shell
python3 -m xingu \
    --xingu-db "sqlite:///xingu.db?check_same_thread=False" \
    --datalake-athena "awsathena+rest://athena.us-east-1.amazonaws.com:443/robson_valuation?work_group=mlops&compression=snappy" \
    --datalake-databricks "databricks+connector://token:dapi170fe70c366410b94bc76d2082ca01a3@dbc-da926df9-ab65.cloud.databricks.com/default?http_path=/sql/1.0/endpoints/b49aee71843b4d3e" \
    --query-cache-path data \
    --trained-models-path models \
    --debug \
    --project-home . \
    --dps datarisk_cartoes,leves
```

### Batch Predict only in Production environment

Note the `--no-train` parameter.

```shell
python3 -m xingu \
    --no-train \
    --robson-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-production-user%}:{%AWS_SECRET:robson-avm-production-rds-secret%}@{%AWS_PARAM:robson-avm-production-url%}/{%AWS_PARAM:robson-avm-production-database-name%}" \
    --datalake-athena "awsathena+rest://athena.us-east-1.amazonaws.com:443/robson_valuation?work_group=mlops&compression=snappy" \
    --datalake-databricks "databricks+connector://token:dapi170fe70c366410b94bc76d2082ca01a3@dbc-da926df9-ab65.cloud.databricks.com/default?http_path=/sql/1.0/endpoints/b49aee71843b4d3e" \
    --query-cache-path data \
    --trained-models-path "s3://{%AWS_PARAM:xingu-production-bucket%}/trained-models" \
    --debug \
    --project-home . \
    --dps anuncios_rj,anuncios_bh,anuncios_sp,cartorios,anuncios_gru,vitrine
```

### Hyper-parameters optimization only

Notice how everything is turned off and disabled most parallelism to let Ray/SKOpt/Optimizer consume all CPUs

```shell
python3 -m xingu \
    --robson-db "sqlite:///xingu.db?check_same_thread=False" \
    --datalake-athena "awsathena+rest://athena.us-east-1.amazonaws.com:443/robson_valuation?work_group=mlops&compression=snappy" \
    --datalake-databricks "databricks+connector://token:dapi170fe70c366410b94bc76d2082ca01a3@dbc-da926df9-ab65.cloud.databricks.com/default?http_path=/sql/1.0/endpoints/b49aee71843b4d3e" \
    --query-cache-path data \
    --trained-models-path models \
    --debug \
    --project-home . \
    --no-post-process \
    --no-batch-predict \
    --hyperopt-strategy self \
    --parallel-train-max-workers 1 \
    --dps cartorios,anuncios_scs,listings
```

### Control Parallelism

Explore these options to avoid over-subscribing and over-loading your CPU and RAM.

```shell
python3 -m xingu \
    --xingu-db "sqlite:///xingu.db?check_same_thread=False" \
    --datalake-athena "awsathena+rest://athena.us-east-1.amazonaws.com:443/robson_valuation?work_group=mlops&compression=snappy" \
    --datalake-databricks "databricks+connector://token:dapi170fe70c366410b94bc76d2082ca01a3@dbc-da926df9-ab65.cloud.databricks.com/default?http_path=/sql/1.0/endpoints/b49aee71843b4d3e" \
    --query-cache-path data \
    --trained-models-path models \
    --debug \
    --project-home . \
    --hyperopt-strategy self \
    --parallel-train-max-workers 3 \
    --parallel-hyperopt-max-workers 6 \
    --parallel-post-process-max-workers 3 \
    --parallel-estimators-max-workers 3
```



## POC 8. Deploy Robson Data and Estimators between environments
### Staging to Production
This is exactly what the [Deploy ⛔Production from ✅Staging GitHub Action](https://github.com/loft-br/robson_avm/actions/workflows/deploy_production_from_staging.yml) does.

```shell
python3 -m xingu.deploy \
    --source-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-staging-user%}:{%AWS_SECRET:robson-avm-staging-rds-secret%}@{%AWS_PARAM:robson-avm-staging-url%}/{%AWS_PARAM:robson-avm-staging-database-name%}" \
    --target-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-production-user%}:{%AWS_SECRET:robson-avm-production-rds-secret%}@{%AWS_PARAM:robson-avm-production-url%}/{%AWS_PARAM:robson-avm-production-database-name%}" \
    --source-trained-models-path "s3://{%AWS_PARAM:robson-avm-staging-bucket%}/trained-models" \
    --target-trained-models-path "s3://{%AWS_PARAM:robson-avm-production-bucket%}/trained-models" \
    --project-home . \
    --debug
```
### Build API Container with Production Estimators
Note how `--dps` is not being used, causing it to act on all DataProviders. Note the `--no-db` parameter, to not copy DB entries, because the production API doesn’t use the Xingu database.
```shell
git clone git@github.com:avibrazil/xingu.git;
cd xingu;
# Change to production branch
git checkout deploy-command;

python3 -m xingu.deploy \
    --source-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-production-user%}:{%AWS_SECRET:robson-avm-production-rds-secret%}@{%AWS_PARAM:robson-avm-production-url%}/{%AWS_PARAM:robson-avm-production-database-name%}" \
    --source-trained-models-path "s3://{%AWS_PARAM:robson-avm-production-bucket%}/trained-models" \
    --target-trained-models-path models \
    --project-home . \
    --no-db \
    --debug;
```

### Production to Laptop or SageMaker
```shell
git clone git@github.com:avibrazil/xingu.git;
cd robson_avm;
# Change to production branch
git checkout deploy-command;

python3 -m xingu.deploy \
    --dps anuncios_sp,vitrine_sp \
    --source-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-production-user%}:{%AWS_SECRET:robson-avm-production-rds-secret%}@{%AWS_PARAM:robson-avm-production-url%}/{%AWS_PARAM:robson-avm-production-database-name%}" \
    --target-xingu-db "sqlite:///robson.db?check_same_thread=False" \
    --source-trained-models-path "s3://{%AWS_PARAM:robson-avm-production-bucket%}/trained-models" \
    --target-trained-models-path models \
    --project-home . \
    --debug;
```

### Staging to Laptop or SageMaker
Manually edit `inventory.yaml` to correctly map desired `train_ids` to `dataprovider_ids`, and then:
```shell
python3 -m xingu.deploy \
    --dps anuncios_sp,anuncios_rj,anuncios_scs,cartorios \
    --source-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-staging-user%}:{%AWS_SECRET:robson-avm-staging-rds-secret%}@{%AWS_PARAM:robson-avm-staging-url%}/{%AWS_PARAM:robson-avm-staging-database-name%}" \
    --target-xingu-db "sqlite:///robson.db?check_same_thread=False" \
    --source-trained-models-path "s3://{%AWS_PARAM:robson-avm-staging-bucket%}/trained-models" \
    --target-trained-models-path models \
    --project-home . \
    --debug;
```

### Laptop or SageMaker to Staging (go to committee)
Your `inventory.yaml` has the `train_id` of an estimator that you just trained for a certain `dataprovider_ids`.
```shell
python3 -m xingu.deploy \
    --dps vitrine \
    --source-xingu-db "sqlite:///robson.db?check_same_thread=False" \
    --target-xingu-db "postgresql+psycopg2://{%AWS_PARAM:robson-avm-staging-user%}:{%AWS_SECRET:robson-avm-staging-rds-secret%}@{%AWS_PARAM:robson-avm-staging-url%}/{%AWS_PARAM:robson-avm-staging-database-name%}" \
    --source-trained-models-path models \
    --target-trained-models-path "s3://{%AWS_PARAM:robson-avm-staging-bucket%}/trained-models" \
    --project-home . \
    --debug;
```

### Partial deployment or deployment failed?
Low RAM can hurt data extraction bacause `SELECT`s might return several million lines of data. Deploy command tries to transfer data in chunks of variable size, based on the detected RAM. If it fails, use the `--db-page-size` parameter with values as low as 200000. It will take longer but it won’t fail.

```shell
python3 -m xingu.deploy \
    ...
    --db-page-size 200000 \
    ...
```

## POC 9. Explain Estimations with Shapley Values
Shapley value is a number received by each feature used as input to an estimation. It has 2 dimensions:

1. Strength, which tells how much this feature influenced the estimation
2. Signal, which tells if the feature influeced the estimation to go above (+) or below (-) the model average

The SHAP module extracts information from a trained model, computes Shapley values per feature per estimation and is capable of producing high impact graphics to help explain the forces that influenced that estimation or the model modus operandi.

To start, import high performance modules capable of using multiple cores (original shap module is inneficient in this regard).

In [None]:
import fasttreeshap

Load estimator and its pre-reqs

In [None]:
dp='vitrine_sp'

data_providers=[dp]
dpf=DataProviderFactory(providers_list=data_providers)
coach=Coach(dpf)
coach.team_load()
model=coach.trained[dp]

Get some data

In [None]:
city='São Paulo'

query=f'select * from table where city={city}'

df=pd.read_sql_query(query,con=coach.get_db_connection('datalake_athena'))

df

Simple clean data, compute features with pre-req model and estimate prices for all

In [None]:
org=df
# df=org

In [None]:
df=(
    # Estimate with pre-req model
    model.dp.feature_engineering_for_predict(
        # clean it first
        df.dropna(
            subset=model.dp.get_features_list()
        )
    )
)

# Estimate prices and errors
df=(
    pandas.concat([df,model.predict(df)], axis=1)
    .assign(
        estimation  =lambda table: numpy.exp(table['loc']),
        error_abs_1p=lambda table: table.estimation-table.last_transaction_value_1p_per_meter,
        error_abs_3p=lambda table: table.estimation-table.last_transaction_value_3p_per_meter,
        error_pct_1p=lambda table: (table.estimation-table.last_transaction_value_1p_per_meter)/table.last_transaction_value_1p_per_meter,
        error_pct_3p=lambda table: (table.estimation-table.last_transaction_value_3p_per_meter)/table.last_transaction_value_3p_per_meter
    )
)

df

Distribution of error

In [None]:
(
    df
    .dropna(subset=['last_transaction_value_1p_per_meter'])
    .query("last_transaction_value_1p!=0")
    .error_pct_1p
    .plot
    .hist(bins=80,figsize=(20, 6))
)

In [None]:
(
    df
    .dropna(subset=['last_transaction_value_3p_per_meter'])
    .query("last_transaction_value_3p!=0")
    .error_pct_3p
    .plot
    .hist(bins=200, figsize=(20, 6), range=(-1,1))
)

### Compute and visualize Shapley values
Get a fast Shapley explainer. The `model_output=0` flags NGBoost to work with `loc`. If `=1`, works with scale.

In [None]:
shap_explainer = fasttreeshap.TreeExplainer(
    model                   = model.estimator.bagging_members[0],
#     data                    = df[model.dp.get_estimator_features_list()],
#     feature_perturbation    ="interventional",
    model_output    = 0
)

# Expected value is usually close to the mean for an Y_pred computed on train process
shap_explainer.expected_value

Compute Shapley values for a few `unit_id`s

In [None]:
unit_id=[133759,5850832]
df[df.unit_id.isin(unit_id)]

In [None]:
df[df.unit_id.isin(unit_id)][model.dp.get_estimator_features_list()]

In [None]:
shap_values=shap_explainer.shap_values(df[df.unit_id.isin(unit_id)][model.dp.get_estimator_features_list()])
shap_values

Same thing, but now inspect values in a more semantic way

In [None]:
s=pd.DataFrame(
    data     = shap_explainer.shap_values(df[df.unit_id.isin(unit_id)][model.dp.get_estimator_features_list()]),
    columns  = model.dp.get_estimator_features_list(),
    index    = df[df.unit_id.isin(unit_id)].unit_id
)

s

### Visualize Explanations for 1 Datapoint

In [None]:
fasttreeshap.initjs()

* Red features push the target above the base value
* Blue features push target below base value

In [None]:
fasttreeshap.force_plot(
    base_value     = shap_explainer.expected_value,
    shap_values    = s.iloc[[1]].to_numpy(),
    features       = df[df.unit_id==s.iloc[[1]].index[0]][model.dp.get_estimator_features_list()]
)

### Same, but for pre req estimator

In [None]:
model_prereq=model.dp.pre_req_model['anuncios_sp']

prereq_shap_explainer = fasttreeshap.TreeExplainer(
    model=model_prereq.estimator.bagging_members[0],
    model_output=0
)

s=pd.DataFrame(
    data=prereq_shap_explainer.shap_values(df[df.unit_id.isin(unit_id)][model_prereq.dp.get_estimator_features_list()]),
    columns=model_prereq.dp.get_estimator_features_list(),
    index=df[df.unit_id.isin(unit_id)].unit_id
)

fasttreeshap.force_plot(
    base_value=prereq_shap_explainer.expected_value,
    shap_values=s.iloc[[0]].to_numpy(),
    features=df[df.unit_id==s.iloc[[0]].index[0]][model_prereq.dp.get_estimator_features_list()]
)

### Visualize Explanations for Many Datapoints

In [None]:
%%time

size=2000
seed=40

s=pd.DataFrame(
    data         = shap_explainer.shap_values(df.sample(size, random_state=seed)[model.dp.get_estimator_features_list()]),
    columns      = model.dp.get_estimator_features_list(),
    index        = pd.Index(df.sample(size, random_state=seed).unit_id,name='unit_id')
)

fasttreeshap.force_plot(
    base_value   = shap_explainer.expected_value,
    shap_values  = s.to_numpy(),
    features     = df.sample(size, random_state=seed)[model.dp.get_estimator_features_list()]
)

### Visualize Global Model Feature Importance

In [None]:
# import shap
fasttreeshap.summary_plot(s.to_numpy(),s, plot_type='violin')

* Features/columns are ordered from more important, top do bottom
* SHAP value indicates importance of the feature to force target up (+) or down (-)
* Colors indicate value of feature, having blue as low value and red as high value

### Numerical Feature Importance from Shapley

Feature importance for the sample dataset above

In [None]:
(
    pd.DataFrame(s, columns=model.dp.get_estimator_features_list())
    .abs()
    .mean()
    .sort_values(ascending=False)
)

### Analyze Estimations where Error is higher than 30%

In [None]:
asample=(
    df
    .dropna(subset=['last_transaction_value_3p_per_meter'])
    .query("error_pct_3p>=0.3")
)

s=pd.DataFrame(
    data         = shap_explainer.shap_values(asample[model.dp.get_estimator_features_list()]),
    columns      = model.dp.get_estimator_features_list(),
    index        = pd.Index(asample.unit_id,name='unit_id')
)

fasttreeshap.force_plot(
    base_value   = shap_explainer.expected_value,
    shap_values  = s.to_numpy(),
    features     = asample[model.dp.get_estimator_features_list()]
)

In [None]:
s

In [None]:
def explain_unit(unit_id):
    print(f'Shap base value: R${numpy.exp(shap_explainer.expected_value)[0]:.2f}')
    print(f'Estimation:      R${df.query("unit_id == @unit_id")["estimation"].values[0]:.2f}')
    print(f'Sold:            R${df.query("unit_id == @unit_id")["last_transaction_value_3p_per_meter"].values[0]:.2f}')
    print(f'Error%:            {100*df.query("unit_id == @unit_id")["error_pct_3p"].values[0]:.2f}%')

    return fasttreeshap.force_plot(
        base_value=shap_explainer.expected_value,
        shap_values=s.loc[unit_id].to_numpy(),
        features=df[df.unit_id==s.loc[[unit_id]].index[0]][model.dp.get_estimator_features_list()]
    )

In [None]:
unit_id=2886639

explain_unit(unit_id)