# Demand Curve Pipeline

This worksheet demonstrates a Wallaroo pipeline with data preprocessing, a model, and data postprocessing.

The model is a "demand curve" that predicts the expected number of units of a product that will be sold to a customer as a function of unit price and facts about the customer. Such models can be used for price optimization or sales volume forecasting.

Data preprocessing is required to create the features used by the model. Simple postprocessing prevents nonsensical estimates (e.g. negative units sold).


In [1]:
import json
import wallaroo
import pandas
import numpy
import conversion
import os

Start up the wallaroo client, and upload the model.

In [2]:
os.environ["WALLAROO_SDK_CREDENTIALS"] = 'creds.json'
wl = wallaroo.Client(auth_type="user_password")

Create the workspace

In [3]:
new_workspace = wl.create_workspace("demandcurve-workspace")
_ = wl.set_current_workspace(new_workspace)

Just to make sure, let's list our current workspace.  If everything is going right, it will show us we're in the `demandcurve-workspace`.

In [4]:
wl.get_current_workspace()

{'name': 'demandcurve-workspace', 'id': 2, 'archived': False, 'created_by': '7dbb3754-4c14-4730-8b77-33caeea7a2a0', 'created_at': '2022-03-28T16:28:21.625896+00:00', 'models': [], 'pipelines': []}

Upload the models to Wallaroo:

* `demand_curve_v1.onnx`: Our demand_curve model.  We'll store the upload configuration into `demand_curve_model`.
* `preprocess`:  Takes the data and prepares it for the demand curve model.  We'll store the upload configuration into `module_pre`.
* `postprocess`:  Takes the results from our demand curve model and prepares it for our display.  We'll store the upload configuration into `module_post`.

In [5]:
# upload to wallaroo
demand_curve_model = wl.upload_model('demandcurve', "./demand_curve_v1.onnx").configure()

In [6]:
# load the preprocess module
module_pre = wl.upload_model("preprocess", "./preprocess.py").configure('python')

In [7]:
# load the postprocess module
module_post = wl.upload_model("postprocess", "./postprocess.py").configure('python')

With our models uploaded, we're gong to create our own pipeline and give it three steps:

* First, start with the preprocess module we called `module_pre` to prepare the data.
* Second, we apply the data to our `demand_curve_model`.
* And finally, we prepare our data for output with the `module_post`.

In [8]:
# now make a pipeline
demandcurve_pipeline = (wl.build_pipeline("demand-curve-pipeline")
                        .add_model_step(module_pre)
                        .add_model_step(demand_curve_model)
                        .add_model_step(module_post))

And with that - let's deploy our model pipeline.  This usually takes about 45 seconds for the deployment to finish.

In [9]:
demandcurve_pipeline.deploy()

Waiting for deployment - this will take up to 45s ........ ok


{'name': 'demand-curve-pipeline', 'create_time': datetime.datetime(2022, 3, 28, 16, 28, 22, 313553, tzinfo=tzutc()), 'definition': "[{'ModelInference': {'models': [{'name': 'preprocess', 'version': 'b1f51290-ac47-4289-8a55-310507d52af5', 'sha': 'c328e2d5bf0adeb96f37687ab4da32cecf5f2cc789fa3a427ec0dbd2c3b8b663'}]}}, {'ModelInference': {'models': [{'name': 'demandcurve', 'version': '9cd1fcae-1fa1-4e12-8e67-d4a67f240a46', 'sha': '2820b42c9e778ae259918315f25afc8685ecab9967bad0a3d241e6191b414a0d'}]}}, {'ModelInference': {'models': [{'name': 'postprocess', 'version': '06e79dfe-623e-482e-95f6-bd6fa1b26264', 'sha': '4bd3109602e999a3a5013893cd2eff1a434fd9f06d6e3e681724232db6fdd40d'}]}}]"}

We can check the status of our pipeline to make sure everything was set up correctly:

In [10]:
demandcurve_pipeline.status()

{'status': 'Running',
 'details': None,
 'engines': [{'ip': '10.12.1.227',
   'name': 'engine-7cbf9b8d6d-xs64b',
   'status': 'Running',
   'reason': None,
   'pipeline_statuses': {'pipelines': [{'id': 'demand-curve-pipeline',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'demandcurve',
      'version': '9cd1fcae-1fa1-4e12-8e67-d4a67f240a46',
      'sha': '2820b42c9e778ae259918315f25afc8685ecab9967bad0a3d241e6191b414a0d',
      'status': 'Running'},
     {'name': 'preprocess',
      'version': 'b1f51290-ac47-4289-8a55-310507d52af5',
      'sha': 'c328e2d5bf0adeb96f37687ab4da32cecf5f2cc789fa3a427ec0dbd2c3b8b663',
      'status': 'Running'},
     {'name': 'postprocess',
      'version': '06e79dfe-623e-482e-95f6-bd6fa1b26264',
      'sha': '4bd3109602e999a3a5013893cd2eff1a434fd9f06d6e3e681724232db6fdd40d',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.12.1.226',
   'name': 'engine-lb-85846c64f8-6l9rr',
   'status': 'Running',
   'reason': None}]}

Everything is ready.  Let's feed our pipeline some data.  We have some information prepared with the `daily_purchasses.csv` spreadsheet.  We'll start with just one row to make sure that everything is working correctly.

In [11]:
# read in some purchase data
purchases = pandas.read_csv('daily_purchases.csv')

# start with a one-row data frame for testing
subsamp_raw = purchases.iloc[0:1,: ]
subsamp_raw

# create the input dictionary from the original one-line dataframe
input_dict = conversion.pandas_to_dict(subsamp_raw)

result = demandcurve_pipeline.infer(input_dict)
result

Waiting for inference response - this will take up to 45s .. ok


[InferenceResult({'check_failures': [],
  'elapsed': 479657,
  'model_name': 'postprocess',
  'model_version': '06e79dfe-623e-482e-95f6-bd6fa1b26264',
  'original_data': {'colnames': ['Date',
                                 'cust_known',
                                 'StockCode',
                                 'UnitPrice',
                                 'UnitsSold'],
                    'query': [['2010-12-01', False, '21928', 4.21, 1]]},
  'outputs': [{'Json': {'data': [{'original': {'outputs': [{'Double': {'data': [6.68025518653071],
                                                                       'dim': [1,
                                                                               1],
                                                                       'v': 1}}]},
                                  'prediction': [6.68025518653071]}],
                        'dim': [1],
                        'v': 1}}],
  'pipeline_name': 'demand-curve-pipeline',
  'time': 1648484

In [12]:
result[0].data()

[array([6.68025519])]

In [13]:
demandcurve_pipeline.logs()

Timestamp,Output,Input,Anomalies
2022-28-Mar 16:25:09,[array([6.68025519])],"{'colnames': ['Date', 'cust_known', 'StockCode', 'UnitPrice', 'UnitsSold'], 'query': [['2010-12-01', False, '21928', 4.21, 1]]}",0
2022-28-Mar 16:25:20,"[array([ 6.77154593, 49.73419364, 6.77154593, 0. , 49.73419364,  9.11087115, 6.77154593, 6.77154593, 9.11087115, 33.12532316])]","{'colnames': ['Date', 'cust_known', 'StockCode', 'UnitPrice', 'UnitsSold'], 'query': [['2011-02-01', False, '85099F', 4.13, 1], ['2011-09-22', True, '85099F', 1.79, 20], ['2011-07-13', False, '22386', 4.13, 7], ['2011-10-10', True, '21931', 4.13, 1], ['2011-06-10', True, '22386', 1.79, 100], ['2011-11-30', False, '85099B', 2.08, 13], ['2011-09-23', False, '23343', 4.13, 1], ['2011-04-04', False, '21928', 4.13, 1], ['2011-06-27', False, '23199', 2.08, 9], ['2011-06-20', True, '22411', 2.08, 20]]}",0


# Bulk Inference

The initial test went perfectly.  Now let's throw some more data into our pipeline.  We'll draw 10 random rows from our spreadsheet and perform an inference from that.

In [14]:
# Let's do 10 rows at once (drawn randomly)
ix = numpy.random.choice(purchases.shape[0], size=10, replace=False)
output = demandcurve_pipeline.infer(conversion.pandas_to_dict(purchases.iloc[ix,: ]))

In [15]:
output[0].data()

[array([33.12532316,  6.77154593,  6.77154593, 40.57067889, 40.57067889,
         6.77154593, 33.12532316,  6.77154593,  9.11087115, 40.57067889])]

In [16]:
demandcurve_pipeline.logs()

Timestamp,Output,Input,Anomalies
2022-28-Mar 16:25:09,[array([6.68025519])],"{'colnames': ['Date', 'cust_known', 'StockCode', 'UnitPrice', 'UnitsSold'], 'query': [['2010-12-01', False, '21928', 4.21, 1]]}",0
2022-28-Mar 16:25:20,"[array([ 6.77154593, 49.73419364, 6.77154593, 0. , 49.73419364,  9.11087115, 6.77154593, 6.77154593, 9.11087115, 33.12532316])]","{'colnames': ['Date', 'cust_known', 'StockCode', 'UnitPrice', 'UnitsSold'], 'query': [['2011-02-01', False, '85099F', 4.13, 1], ['2011-09-22', True, '85099F', 1.79, 20], ['2011-07-13', False, '22386', 4.13, 7], ['2011-10-10', True, '21931', 4.13, 1], ['2011-06-10', True, '22386', 1.79, 100], ['2011-11-30', False, '85099B', 2.08, 13], ['2011-09-23', False, '23343', 4.13, 1], ['2011-04-04', False, '21928', 4.13, 1], ['2011-06-27', False, '23199', 2.08, 9], ['2011-06-20', True, '22411', 2.08, 20]]}",0
