# Running TalkToModel On Your Own Model & Dataset!

In this tutorial, we describe how to run TTM on your own model and dataset. For the sake of this tutorial, I'm going to setup TTM on a [Dermatology Dataset](https://datahub.io/machine-learning/dermatology) and train a sklearn random forest classifier, though its assumed you're bringing your own model & dataset.

Your model must,
- be saved in a .pkl file that can be opened via pkl.load and supports both .predict(X) and .predict_proba(X), in the same style as sklearn
For your dataset,

Your dataset must,
- be saved in .csv files and can be called via pd.read_csv(your_data_set_location), where one of the columns is the target variable. This column name can be specified in the interface. Also, the configuration supports passing an index_col argument to read_csv to specify an index column in the data.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(0)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [3]:
# Loading + splitting the data
data = pd.read_csv("./data/derm.csv", index_col=None)
y = data.pop('class')

# I noticed that having class labels on a range besides 0 ==> N introduces some bugs because of the way certain explanation packages we use handle these labels...
# I suggest adjusting them to start with 0. In general, we developed this project using binary classification tasks, so it's a bit better tested for this setting.
print(y.min(), y.max())
y += -1
print(y.min(), y.max())

1 6
0 5


In [4]:
X_train, X_test, y_train, y_test = train_test_split(data, y)
X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_train.mean())

In [5]:
# fitting the model
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Model
rf_pipeline = Pipeline([('scaler', StandardScaler()),
                        ('rf', RandomForestClassifier())])
rf_pipeline.fit(X_train.values, y_train.values)
print(f"Score: {rf_pipeline.score(X_test.values, y_test.values)}")

Score: 0.9565217391304348


Next, we'll save the dataset with the target as 'y' and the model in a .pkl.

In [6]:
import pickle as pkl
X_train['y'] = y_train
X_train.to_csv("./data/background_derm.csv")
X_test['y'] = y_test
X_test.to_csv("./data/dataset_derm.csv")
with open("./data/derm_model.pkl", "wb") as f:
    pkl.dump(rf_pipeline, f)

## Using Few-Shot Parsing Model

After, we'll write a gin configuration file for the model. Note, this can be done by just copying the diabetes-config.gin and changing the paths to the model and data splits.

Also, because we don't have a fine-tuned model (yet) for this dataset, I'm using a few-shot model. This model will create a set of prompts for your dataset that will be used to prompt a GPT style model few shot to do the parsing task. This model is quicker to get started with but has comparatively lower performance than fine-tuning. You could additionally try using one of the diabetes, german, or compas fine-tuned models, which works ok though the dataset is different, due to the guided-decoding strategy used for parsing (see the [paper](https://arxiv.org/abs/2207.04154) for more details on this).

Here is the modified config (remove the python comments """ before using)

In [7]:
"""
##########################################
# The new dermatology dataset conversation config
##########################################

# for few shot, e.g., "EleutherAI/gpt-neo-2.7B"
ExplainBot.parsing_model_name = "EleutherAI/gpt-neo-2.7B"

# Set skip_prompts to true for quicker startup for finetuned models
# make sure to set to false using few-shot models
ExplainBot.skip_prompts = False

# t5 configuration file
ExplainBot.t5_config = "./parsing/t5/gin_configs/t5-large.gin"

# User provided prediction model file path
ExplainBot.model_file_path = "./tutorials/data/derm_model.pkl"

# Seed
ExplainBot.seed = 0

# The dataset to run the conversation on
ExplainBot.dataset_file_path = "./tutorials/data/dataset_derm.csv"

# The background dataset for the conversation
ExplainBot.background_dataset_file_path = "./data/background_derm.csv"
ExplainBot.name = "dermatology"

# Dataset feature information
ExplainBot.dataset_index_column = 0
ExplainBot.target_variable_name = "y"
ExplainBot.categorical_features = None
ExplainBot.numerical_features = None
ExplainBot.remove_underscores = True

# Few-shot settings
ExplainBot.prompt_metric = "cosine"
ExplainBot.prompt_ordering = "ascending"

# Prompt params
Prompts.prompt_cache_size = 1_000_000
Prompts.prompt_cache_location = "./cache/diabetes-prompts.pkl"
Prompts.max_values_per_feature = 2
Prompts.sentence_transformer_model_name = "all-mpnet-base-v2"
Prompts.prompt_folder = "./explain/prompts"
Prompts.num_per_knn_prompt_template = 1
Prompts.num_prompt_template = 7

# Explanation Params
Explanation.max_cache_size = 1_000_000

# MegaExplainer Params
MegaExplainer.cache_location = "./cache/dermatology-explainer.pkl"
MegaExplainer.use_selection = False

# Tabular Dice Params
TabularDice.cache_location = "./cache/dermatology-dice-tabular.pkl"

# Conversation params
Conversation.class_names = {1: "psoriasis", 2: "seboreic dermatitis", 3: "lichen planus", 4: "pityriasis rosea", 5: "cronic dermatitis", 6: "pityriasis rubra pilaris"}

# Dataset description
DatasetDescription.dataset_objective = "predict whether someone has certain types of skin conditions"
DatasetDescription.dataset_description = "dermatology prediction"
DatasetDescription.model_description = "random forrest"

# Feature definitions
ExplainBot.feature_definitions = None
"""

'\n##########################################\n# The new dermatology dataset conversation config\n##########################################\n\n# for few shot, e.g., "EleutherAI/gpt-neo-2.7B"\nExplainBot.parsing_model_name = "EleutherAI/gpt-neo-2.7B"\n\n# Set skip_prompts to true for quicker startup for finetuned models\n# make sure to set to false using few-shot models\nExplainBot.skip_prompts = False\n\n# t5 configuration file\nExplainBot.t5_config = "./parsing/t5/gin_configs/t5-large.gin"\n\n# User provided prediction model file path\nExplainBot.model_file_path = "./tutorials/data/derm_model.pkl"\n\n# Seed\nExplainBot.seed = 0\n\n# The dataset to run the conversation on\nExplainBot.dataset_file_path = "./tutorials/data/dataset_derm.csv"\n\n# The background dataset for the conversation\nExplainBot.background_dataset_file_path = "./data/background_derm.csv"\nExplainBot.name = "dermatology"\n\n# Dataset feature information\nExplainBot.dataset_index_column = 0\nExplainBot.target_variabl

We'd place this in the `./configs` directory as `./configs/derm-config.gin`. Finally, we can set the Flask app to run this model and dataset by setting
```
# Model + dataset configuration specific file
GlobalArgs.config = "./configs/derm-config.gin"
```
in the global configuration file `./global_config.gin`

After, we should be good to go! We can run
```shell
python flask_app.py
```
to start the application.