In [1]:
import io
import os
import pathlib
import zipfile

import category_encoders
import numpy as np
import pandas as pd
import requests
import sklearn.metrics
import sklearn.model_selection
import sklearn.neighbors
import sklearn.pipeline
import sklearn.preprocessing

In [2]:
%load_ext autoreload
%autoreload 2

# Intro
Welcome to the Fiddler notebook experience! This notebook will demonstrate how to effectively get started with the Fiddler platform by uploading your models and data. The notebook is organized into two sections:
1. Loading data and building a scikit-learn model
2. Uploading your data and model to the Fiddler platform

Section 1 does not use any Fiddler code, so if you are familiar with Pandas and Scikit-Learn, you should feel comfortable skimming through and jumping into section 2.

# Section 1: Loading data and building a model

### Working with Data
Being an effective data scientist involves using the right tool for the job. When it comes to importing, cleaning, and and exploring your data in Jupyter, we don't want to interrupt your normal workflow, so we integrate our tools with the popular Pandas DataFrame object. Thus as long as your data can be dumped into a DataFrame object, there is nothing else you need to do to get it ready to upload to Fiddler.

### Downloading the UCI bikeshare dataset

In [3]:
zip_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip'
z = zipfile.ZipFile(io.BytesIO(requests.get(zip_url).content))

# here we pre-configure the datatypes for our dataframe
# so it doesn't require any datatype modification after import
bikeshare_dtypes = dict(season='category', holiday='bool',
                        workingday='bool', weathersit='category')
bikeshare_datetime_columns = ['dteday']
bikeshare_index_column = 'instant'
with z.open('hour.csv') as csv:
    df = pd.read_csv(csv, 
                     dtype=bikeshare_dtypes, 
                     parse_dates=bikeshare_datetime_columns,
                     index_col=bikeshare_index_column)

# split train/test by year
is_2011 = df['yr'] == 0
df_2011 = df[is_2011].reset_index(drop=True)
df_2012 = df[~is_2011].reset_index(drop=True)

# peek at the data
display(df.sample(3, random_state=0).T)

# print info about train-test split
print(f'Train set (bikeshare rentals in 2011) has {df_2011.shape[0]} rows,'
      f' test set (bikeshare rentals in 2012) has {df_2012.shape[0]} rows')

instant,3440,6543,15471
dteday,2011-05-28 00:00:00,2011-10-05 00:00:00,2012-10-11 00:00:00
season,2,4,4
yr,0,0,1
mnth,5,10,10
hr,5,4,19
holiday,False,False,False
weekday,6,3,4
workingday,False,True,True
weathersit,1,1,1
temp,0.56,0.44,0.44


Train set (bikeshare rentals in 2011) has 8645 rows, test set (bikeshare rentals in 2012) has 8734 rows


### Building a model
Just like with data work, we believe in the right tools for the job. We currently integrate tightly models supporing the `sklearn` API, including non-`sklearn` packages that support the `sklearn` API, like `xgboost` and `LightGBM`. Since encoding categorical variables can be a pain in `sklearn`, we also support the `category_encoders` package. Please note that if you introduce any custom classes or transformation functions into your modeling, it may become difficult to get your models running in Fiddler. We therefore recommend using the `Transformer` objects provided by `sklearn` (and the `category_encoders` package) and combining preprocessing and inference steps using the `sklearn` `Pipeline` API.

In [4]:
# specify which columns are features and which are not
target = 'cnt'
not_used_as_features = ['dteday', 'yr', 'casual', 'registered']
non_feature_columns = [target] + not_used_as_features
feature_columns = list(set(df_2011.columns) - set(non_feature_columns))

# split our data into features and targets
x_train = df_2011.drop(columns=non_feature_columns)
x_test = df_2012.drop(columns=non_feature_columns)
y_train = df_2011[target]
y_test = df_2012[target]

In [5]:
# modeling approach: 
# 1) onehot encode categorical variables
# 2) standard scale all variables
# 3) fit a k-Nearest-Neighbors model with k=10 and l1 distance as the distance metric
onehot = category_encoders.OneHotEncoder(cols=df.select_dtypes('category').columns.tolist())
standard_scaler = sklearn.preprocessing.StandardScaler()
knn = sklearn.neighbors.KNeighborsRegressor(
    n_neighbors=10, 
    weights='distance', metric='l1',
    n_jobs=-1)
model = sklearn.pipeline.make_pipeline(onehot, standard_scaler, knn)

In [6]:
# fit the model
model.fit(x_train, y_train)

# score the model
train_r2 = sklearn.metrics.r2_score(y_train, model.predict(x_train))
test_r2 = sklearn.metrics.r2_score(y_test, model.predict(x_test))
print(f'r2 scores: {train_r2:.2f} Train | {test_r2:.2f} Test')

r2 scores: 1.00 Train | 0.38 Test


# Section 2: Uploading to Fiddler
Up until now, we haven't done anything Fiddler-specific. Now we'll go ahead and change that. Let's begin by importing the Fiddler package.

In [7]:
import fiddler as fdl

## Before you start: set up your API connection

### Launch onebox or authenticate with a remote server
Before you can start working with a Fiddler-integrated Jupyter environment, you should set up access to a running instance of Fiddler.

#### Onebox
In onebox, this means running the `start.sh` script to launch onebox locally.

#### Cloud
For the cloud version of our product, this means looking up your authentication token in the [Fiddler settings dashboard](https://app.fiddler.ai/settings/credentials)

### Create a FiddlerApi object

In order to get your data and models into the Fiddler Engine, you'll need to connect using the API. The `FiddlerApi` object to handles most of the nitty-gritty for you, so all you have to do is specify some details about the Fiddler system you're connecting to.

In [8]:
# NOTE: typically the API url for your running instance of Fiddler will be "https://api.fiddler.ai" (or "http://localhost:4100" for onebox)
# however, use "http://host.docker.internal:4100" as our URL if Jupyter is running in a docker VM on the same macOS machine as onebox
url = 'http://host.docker.internal:4100'

# see <Fiddler URL>/settings/credentials to find, create, or change this token
token = os.getenv('FIDDLER_API_TOKEN')

# see <Fiddler URL>/settings/general to find this id (listed as "Organization Name")
org_id = 'onebox'

fiddler_api = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=token)

### Dataset Upload
Now that we have our dataset in working order, let's upload it to the Fiddler platform.  As mentioned above, our `Dataset` class directly integrates with Pandas to make this a snap. 

In [9]:
fiddler_api.list_datasets()

['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']

In [10]:
# now that we have a Dataset, we just need to pass
# it to the FiddlerApi to perform an upload
upload_result = fiddler_api.upload_dataset(
    dataset={'train': df_2011, 'test': df_2012}, 
    dataset_id='bikeshare')
upload_result

Heads up! We are inferring the details of your dataset from the dataframe(s) provided. Please take a second to check our work.

If the following DatasetInfo is an incorrect representation of your data, you can construct a DatasetInfo with the DatasetInfo.from_dataframe() method and modify that object to reflect the correct details of your dataset.

After constructing a corrected DatasetInfo, please re-upload your dataset with that DatasetInfo object explicitly passed via the `info` parameter of FiddlerApi.upload_dataset().

You may need to delete the initially uploaded versionvia FiddlerApi.delete_dataset('bikeshare').

Inferred DatasetInfo to check:
  DatasetInfo:
    display_name: bikeshare
    files: []
    columns:
              column     dtype count(possible_values)
      0       dteday    STRING                      -
      1       season  CATEGORY                      4
      2           yr   INTEGER                      -
      3         mnth   INTEGER                      -
 

{'row_count': 17379,
 'col_count': 16,
 'log': ['Importing dataset bikeshare',
  'Found old data. Deleting it',
  'Creating table for bikeshare',
  'Importing data file: test.csv',
  'Importing data file: train.csv']}

In [11]:
fiddler_api.delete_dataset('bikeshare')

'Dataset deleted bikeshare'

In [12]:
# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()

['imdb_rnn', 'iris', 'bank_churn', '20news', 'p2p_loans', 'winequality']

In [13]:
# Upload example with custom DatasetInfo
bikeshare_info = fdl.DatasetInfo.from_dataframe(df_2011, display_name='Bikeshare Dataset')
bikeshare_info['weathersit'].possible_values.extend([123, 456, 789])
print('We customized the DatasetInfo for this dataset '
      'with a custom display_name and more `weathersit` possible-values.')
print(bikeshare_info)

# upload
upload_result = fiddler_api.upload_dataset(
    dataset={'train': df_2011, 'test': df_2012},
    dataset_id='bikeshare',
    info=bikeshare_info
)
upload_result

We customized the DatasetInfo for this dataset with a custom display_name and more `weathersit` possible-values.
DatasetInfo:
  display_name: Bikeshare Dataset
  files: []
  columns:
            column     dtype count(possible_values)
    0       dteday    STRING                      -
    1       season  CATEGORY                      4
    2           yr   INTEGER                      -
    3         mnth   INTEGER                      -
    4           hr   INTEGER                      -
    5      holiday   BOOLEAN                      -
    6      weekday   INTEGER                      -
    7   workingday   BOOLEAN                      -
    8   weathersit  CATEGORY                      7
    9         temp     FLOAT                      -
    10       atemp     FLOAT                      -
    11         hum     FLOAT                      -
    12   windspeed     FLOAT                      -
    13      casual   INTEGER                      -
    14  registered   INTEGER         

{'row_count': 17379,
 'col_count': 16,
 'log': ['Importing dataset bikeshare',
  'Creating table for bikeshare',
  'Importing data file: test.csv',
  'Importing data file: train.csv']}

In [14]:
# we see that the 'bikeshare' dataset now shows up in the list of all datasets
fiddler_api.list_datasets()

['imdb_rnn',
 'iris',
 'bank_churn',
 '20news',
 'p2p_loans',
 'winequality',
 'bikeshare']

### Accessing the data on Fiddler
We can also verify everything worked by looking at the web UI:
- http://localhost:4100/datasets

(or if you used cloud instead of onebox)
- https://app.fiddler.ai/datasets

### Model Upload
We currently support the upload of scikit-learn models directly through the `fiddler` package. While custom code is tricky to deploy to Fiddler, we support a number of additional packages beyond `sklearn` that enable the deployment of powerful black-box models. These include:
1. `xgboost` (as long as the scikit-learn API is used)
2. `lightgbm` (as long as the scikit-learn API is used)
3. `category_encoders`

For best explainability results, we recommend organizing your modeling pipeline using the scikit-learn `Pipeline` API so that your feature transformations are integrated with your model. This is because pre-transforming your data can have a negative effect on explanation interpretability.

In [15]:
# To organize our models, let's first create a project on Fiddler.
fiddler_api.create_project('bikeshare_forecasting')

{'project_name': 'bikeshare_forecasting'}

In [16]:
# we see that the 'bikeshare_forecasting' project now shows up in the list of all datasets
fiddler_api.list_projects()

['imdb_rnn',
 'bank_churn',
 'newsgroup_text_topics',
 'lending',
 'bikeshare_forecasting',
 'iris_classification',
 'wine_quality']

#### ModelInfo
For Fiddler to properly run and explain your model, you need to provide some information about model inputs and outputs that is not captured by the `sklearn` object itself. Luckily the `Dataset` we created above has a `DatasetInfo` component that can help us infer the `ModelInfo` of models trained on that dataset.

In [17]:
model_info = fdl.ModelInfo.from_dataset_info(
    dataset_info=fiddler_api.get_dataset_info('bikeshare'),
    target=target, 
    features=feature_columns,
    display_name='Bikeshare kNN',
    description='A kNN model trained for predict the `cnt` feature of the bikeshare dataset.'
)
model_info

ModelInfo:
  display_name: Bikeshare kNN
  description: A kNN model trained for predict the `cnt` feature of the bikeshare dataset.
  input_type: ModelInputType.TABULAR
  model_task: ModelTask.REGRESSION
  inputs and outputs:
               column column_type     dtype count(possible_values)
    0          season       input  CATEGORY                      4
    1            mnth       input   INTEGER                      -
    2              hr       input   INTEGER                      -
    3         holiday       input   BOOLEAN                      -
    4         weekday       input   INTEGER                      -
    5      workingday       input   BOOLEAN                      -
    6      weathersit       input  CATEGORY                      7
    7            temp       input     FLOAT                      -
    8           atemp       input     FLOAT                      -
    9             hum       input     FLOAT                      -
    10      windspeed       input    

In [18]:
fiddler_api.upload_model_sklearn(
    model=model,
    info=model_info,
    project_id='bikeshare_forecasting',
    model_id='knn_model',
    associated_dataset_ids=['bikeshare'])

You are uploading a scikit-learn model using the Fiddler API.
If this model uses any custom (non-sklearn) code, it will not run properly on the Fiddler Engine.
The Fiddler engine may not be able to detect this in advance.


{'model': {'display name': 'Bikeshare kNN',
  'input-type': 'structured',
  'model-task': 'regression',
  'inputs': [{'column-name': 'season',
    'data-type': 'category',
    'possible-values': ['1', '2', '3', '4']},
   {'column-name': 'mnth', 'data-type': 'int'},
   {'column-name': 'hr', 'data-type': 'int'},
   {'column-name': 'holiday', 'data-type': 'bool'},
   {'column-name': 'weekday', 'data-type': 'int'},
   {'column-name': 'workingday', 'data-type': 'bool'},
   {'column-name': 'weathersit',
    'data-type': 'category',
    'possible-values': ['1', '2', '3', '4', '123', '456', '789']},
   {'column-name': 'temp', 'data-type': 'float'},
   {'column-name': 'atemp', 'data-type': 'float'},
   {'column-name': 'hum', 'data-type': 'float'},
   {'column-name': 'windspeed', 'data-type': 'float'}],
  'outputs': [{'column-name': 'predicted_cnt', 'data-type': 'float'}],
  'description': 'A kNN model trained for predict the `cnt` feature of the bikeshare dataset.',
  'datasets': ['bikeshare']}

We can now look at explanations!
- http://localhost:4100/projects/bikeshare_forecasting/explain

(or if you used cloud instead of onebox)
- https://app.fiddler.ai/projects/bikeshare_forecasting/explain