# Summarization (text) Pipeline Example

## Imports

In [1]:
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework

import pyarrow as pa
import numpy as np
import pandas as pd


login

In [2]:
#wl = wallaroo.Client(auth_type="sso", interactive=True)

wallarooPrefix = "doc-test."
wallarooSuffix = "wallaroocommunity.ninja"

wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}api.{wallarooSuffix}", 
                    auth_endpoint=f"https://{wallarooPrefix}keycloak.{wallarooSuffix}", 
                    auth_type="sso")

Please log into the following URL in a web browser:

	https://doc-test.keycloak.wallaroocommunity.ninja/auth/realms/master/device?user_code=WVNJ-KUKM

Login successful!


### Configure PyArrow Schema

You can find more info on the available inputs under [TextSummarizationInputs](https://github.com/WallarooLabs/platform/blob/main/conductor/model-auto-conversion/flavors/hugging-face/src/io/pipeline_inputs/text_summarization_inputs.py#L14) or under the [official source code](https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/pipelines/text2text_generation.py#L241) from `🤗 Hugging Face`.

In [3]:
input_schema = pa.schema([
    pa.field('inputs', pa.string()),
    pa.field('return_text', pa.bool_()),
    pa.field('return_tensors', pa.bool_()),
    pa.field('clean_up_tokenization_spaces', pa.bool_()),
    # pa.field('generate_kwargs', pa.map_(pa.string(), pa.null())), # dictionaries are not currently supported by the engine
])

output_schema = pa.schema([
    pa.field('summary_text', pa.string()),
])

### Get Model

In [5]:
model = wl.upload_model('hf-summarization-demoyns2', 
                        'model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip', 
                        framework=Framework.HUGGING_FACE_SUMMARIZATION, 
                        input_schema=input_schema, 
                        output_schema=output_schema
                        )
model

Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime.....................successful

Ready


0,1
Name,hf-summarization-demoyns2
Version,48d72805-dbc0-42be-a4b9-4af0e11568bc
File Name,model-auto-conversion_hugging-face_complex-pipelines_hf-summarisation-bart-large-samsun.zip
SHA,ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Status,ready
Image Path,proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.3.0-main-3731
Updated At,2023-25-Aug 15:51:53


In [10]:
# def get_model(mname):
#     modellist = wl.get_current_workspace().models()
#     model = [m.versions()[0] for m in modellist if m.name() == mname]
#     if len(model) <= 0:
#         raise KeyError(f"model {mname} not found in this workspace")
#     return model[0]

In [11]:
# model = get_model('hf-summarization-demoyns')
# model

0,1
Name,hf-summarization-demoyns
Version,12ed6066-1708-42c1-906d-46e1c7673f36
File Name,hf-summarisation-bart-large-samsun.zip
SHA,ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Status,error
Image Path,
Updated At,2023-22-Aug 17:57:21


In [13]:
model.status()

'ready'

## Configure Model

In [13]:
#model.configure(runtime="mlflow", input_schema=input_schema, output_schema=output_schema)

0,1
Name,hf-summarization-demoyns
Version,12ed6066-1708-42c1-906d-46e1c7673f36
File Name,hf-summarisation-bart-large-samsun.zip
SHA,ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Status,error
Image Path,
Updated At,2023-22-Aug 17:57:21


## Deploy Pipeline

In [14]:
deployment_config = DeploymentConfigBuilder() \
    .cpus(0.25).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, "8Gi") \
    .build()

In [15]:
pipeline_name = "hf-summarization-pipeline-edge"

In [16]:
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(model)

pipeline.deploy(deployment_config=deployment_config)

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.3.87',
   'name': 'engine-845875546b-zdwxp',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'hf-summarization-pipeline-edge',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'hf-summarization-demoyns2',
      'version': '48d72805-dbc0-42be-a4b9-4af0e11568bc',
      'sha': 'ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.244.4.104',
   'name': 'engine-lb-584f54c899-spzfs',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.3.88',
   'name': 'engine-sidekick-hf-summarization-demoyns2-4-5fd489bb6d-x2v8b',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

In [17]:
pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.3.87',
   'name': 'engine-845875546b-zdwxp',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'hf-summarization-pipeline-edge',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'hf-summarization-demoyns2',
      'version': '48d72805-dbc0-42be-a4b9-4af0e11568bc',
      'sha': 'ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.244.4.104',
   'name': 'engine-lb-584f54c899-spzfs',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.3.88',
   'name': 'engine-sidekick-hf-summarization-demoyns2-4-5fd489bb6d-x2v8b',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

In [35]:
# pipeline=wl.pipelines_by_name(pipeline_name)[0]
# pipeline

0,1
name,hf-summarization-pipeline-ynstest
created,2023-08-22 17:37:02.588226+00:00
last_updated,2023-08-23 01:45:08.347653+00:00
deployed,False
tags,
versions,"1d15fa20-47f3-40a4-8c85-101eda0639db, daf4209c-b789-4519-a541-1ff586ef428b, 63033389-d2f4-4b7c-8da6-926e52e2560f, 5dc8db1a-f722-4dc5-aba9-60c5fd0ec8a9, 11802776-e2c7-48b7-8f9c-13c118d3d16e, 51f88106-c710-41ac-8d19-c317da3f92b4, 34b633ab-bfe4-4c61-97d1-eeeefb94f33a, 684f985c-05c3-4b5d-bc28-7ade9ba60ac6, 82612dbb-eae7-49a4-b6b9-175e53ec4cc9"
steps,hf-summarization-demo
published,True


In [18]:
## This may still show an error status despite but if both containers show running it should be good to go
pipeline.publish(deployment_config)

Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is Publishing........Published.


0,1
ID,4
Pipeline Version,c8d94cce-b237-4d03-bef4-eca89d8d5c88
Status,Published
Engine URL,ghcr.io/wallaroolabs/doc-samples/engine:v2023.3.0-main-3731
Pipeline URL,ghcr.io/wallaroolabs/doc-samples/pipelines/hf-summarization-pipeline-edge:c8d94cce-b237-4d03-bef4-eca89d8d5c88
Helm Chart URL,ghcr.io/wallaroolabs/doc-samples/charts/hf-summarization-pipeline-edge
Helm Chart Reference,ghcr.io/wallaroolabs/doc-samples/charts@sha256:f69f767dab856c6507ad8686e5f8fdf3e4b3698181735bdd8b7a3601cb5a3e2a
Helm Chart Version,0.0.1-c8d94cce-b237-4d03-bef4-eca89d8d5c88
Engine Config,"{'engine': {'resources': {'limits': {'cpu': 0.25, 'memory': '1Gi'}, 'requests': {'cpu': 0.25, 'memory': '1Gi'}}}, 'engineAux': {'images': {'hf-summarization-demoyns2-4': {'resources': {'limits': {'cpu': 4.0, 'memory': '8Gi'}, 'requests': {'cpu': 4.0, 'memory': '8Gi'}}}}}, 'enginelb': {}}"
Created By,john.hummel@wallaroo.ai


## Run inference

In [31]:
input_data = {
        "inputs": ["LinkedIn (/lɪŋktˈɪn/) is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. It is now owned by Microsoft. The platform is primarily used for professional networking and career development, and allows jobseekers to post their CVs and employers to post jobs. From 2015 most of the company's revenue came from selling access to information about its members to recruiters and sales professionals. Since December 2016, it has been a wholly owned subsidiary of Microsoft. As of March 2023, LinkedIn has more than 900 million registered members from over 200 countries and territories. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships. Members can invite anyone (whether an existing member or not) to become a connection. LinkedIn can also be used to organize offline events, join groups, write articles, publish job postings, post photos and videos, and more"], # required
        "return_text": [True], # optional: using the defaults, similar to not passing this parameter
        "return_tensors": [False], # optional: using the defaults, similar to not passing this parameter
        "clean_up_tokenization_spaces": [False], # optional: using the defaults, similar to not passing this parameter
}
dataframe = pd.DataFrame(input_data)
dataframe

Unnamed: 0,inputs,return_text,return_tensors,clean_up_tokenization_spaces
0,LinkedIn (/lɪŋktˈɪn/) is a business and employ...,True,False,False


In [32]:
dataframe.to_json('./data/test_summarization.df.json', orient="records")

In [20]:
# Adjust timeout as needed, started liberally with a 10 min timeout
out = pipeline.infer(dataframe, timeout=600)
out

Unnamed: 0,time,in.clean_up_tokenization_spaces,in.inputs,in.return_tensors,in.return_text,out.summary_text,check_failures
0,2023-08-25 16:35:18.658,False,LinkedIn (/lɪŋktˈɪn/) is a business and employ...,False,True,LinkedIn is a business and employment-focused ...,0


In [21]:
out = pipeline.infer_from_file('test_summarization.json', timeout=600)
out

Unnamed: 0,time,in.clean_up_tokenization_spaces,in.inputs,in.return_tensors,in.return_text,out.summary_text,check_failures
0,2023-08-24 18:01:19.372,False,LinkedIn (/lɪŋktˈɪn/) is a business and employ...,False,True,LinkedIn is a business and employment-focused ...,0


In [21]:
out["out.summary_text"][0]

'LinkedIn is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships.'

## Undeploy Pipeline

In [22]:
pipeline.undeploy()

0,1
name,hf-summarization-pipeline-edge
created,2023-08-25 15:52:02.329988+00:00
last_updated,2023-08-25 16:24:04.045981+00:00
deployed,False
tags,
versions,"c8d94cce-b237-4d03-bef4-eca89d8d5c88, c7a067bc-997b-47c2-89c7-29ddd507cf7d, c1164da4-e044-49d3-a079-2c6c6a8cdc3f, 28176ea4-5717-4c60-b9c0-91a695bfb78d, 2d55d49d-45d6-4d88-9c6b-a3225a2ba565, 55760fa6-3919-4790-93a2-121be29d1962"
steps,hf-summarization-demoyns2
published,True


### Edge: Publish pipeline 

In [23]:
pipeline.url()

'http://engine-lb.hf-summarization-pipeline-edge-6:29502/pipelines/hf-summarization-pipeline-edge'

In [31]:
!curl -X POST http://testboy.hf-summarization-pipeline-edge-6:8080/pipelines/hf-summarization-pipeline-edge -H "Content-Type: application/json; format=pandas-records" -d @test_summarization.json

[{"time":1692900759318,"in":{"clean_up_tokenization_spaces":[false],"inputs":["LinkedIn (/lɪŋktˈɪn/) is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. It is now owned by Microsoft. The platform is primarily used for professional networking and career development, and allows jobseekers to post their CVs and employers to post jobs. From 2015 most of the company's revenue came from selling access to information about its members to recruiters and sales professionals. Since December 2016, it has been a wholly owned subsidiary of Microsoft. As of March 2023, LinkedIn has more than 900 million registered members from over 200 countries and territories. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships. Members can invite anyone (whether an existing member or not) to become a co

In [24]:
!curl http://testboy.local/pipelines

curl: (7) Failed to connect to testboy.local port 80 after 23 ms: Couldn't connect to server


In [36]:
import json
import requests

# set the content type and accept headers
headers = {
    'Content-Type': 'application/json; format=pandas-records'
}

# Submit arrow file
dataFile="./data/test_summarization.df.json"

data = json.load(open(dataFile))

host = 'http://testboy.local:8080'

deployurl = f'{host}/pipelines/hf-summarization-pipeline-edge'

response = requests.post(
                    deployurl, 
                    headers=headers, 
                    json=data, 
                    verify=True
                )

# display(response)
display(response.json()[0]['outputs'][0]['String']['data'][0])


'LinkedIn is a business and employment-focused social media platform that works through websites and mobile apps. It launched on May 5, 2003. LinkedIn allows members (both workers and employers) to create profiles and connect with each other in an online social network which may represent real-world professional relationships.'