## Pachyderm preprocessing -> training -> artifacts -> Seldon deployment

In [225]:
!pip3 install numpy seaborn pandas scikit-learn pyarrow seldon-core werkzeug==2.0.3

In [111]:
!pachctl version

COMPONENT           VERSION             
pachctl             2.2.2               
pachd               2.2.2               


In [11]:
!python3 regression.py --help

usage: regression.py [-h] [--input INPUT] [--target-col TARGET_COL]
                     [--output DIR]

Structured data regression

options:
  -h, --help            show this help message and exit
  --input INPUT         csv file with all examples
  --target-col TARGET_COL
                        column with target values
  --output DIR          output directory


## Step 1: Create an input data repository

In [10]:
!pachctl create repo housing_data

In [11]:
!pachctl list repo

NAME         CREATED       SIZE (MASTER) DESCRIPTION                       
housing_data 7 seconds ago ≤ 0B                                            
count        5 hours ago   ≤ 22B         Output repo for pipeline count.   
data         6 hours ago   ≤ 728B                                          
reduce       22 hours ago  ≤ 6.545KiB    Output repo for pipeline reduce.  
map          22 hours ago  ≤ 8.583KiB    Output repo for pipeline map.     
scraper      22 hours ago  ≤ 333.5KiB    Output repo for pipeline scraper. 
urls         22 hours ago  ≤ 119B                                          


## Step 2: Create the regression pipeline

In [12]:
!cat regression.json

{
    "pipeline": {
        "name": "regression"
    },
    "description": "A pipeline that trains produces a regression model for housing prices.",
    "input": {
        "pfs": {
            "glob": "/*",
            "repo": "housing_data"
        }
    },
    "transform": {
        "cmd": [
            "python", "regression.py",
            "--input", "/pfs/housing_data/",
            "--target-col", "MEDV",
            "--output", "/pfs/out/"
        ],
        "image": "pachyderm/housing-prices:1.11.0"
    }
}

In [13]:
!pachctl create pipeline -f regression.json

The pipeline writes the output to a PFS repo (/pfs/out/ in the pipeline json) created with the same name as the pipeline.

## Step 3: Add the housing dataset to the repo

Now we can add the data, which will kick off the processing automatically. If we update the data with a new commit, then the pipeline will automatically re-run.

In [24]:
!pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv



In [26]:
!pachctl list file housing_data@master

NAME                    TYPE SIZE     
/housing-simplified.csv file 2.482KiB 


In [28]:
!pachctl list job

ID                               SUBJOBS PROGRESS CREATED            MODIFIED
f8fa49a2838c495eaa51c1675684f82c 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m About a minute ago About a minute ago 
aa9373420c4146d393164a2857c0385a 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 3 minutes ago      3 minutes ago      
e168bea3fdbf49d2849354c2dc833dd9 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
e35d00004c5b4288b6580c5c0519cc80 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
ebee0fc1176c4e01a8093559cb893a5c 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
f3b7d09fb53f49acb727ce0010027b9f 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇

## Step 4: Download files once the pipeline has finished

In [30]:
!pachctl list file regression@master

NAME                                  TYPE SIZE     
/housing-simplified_corr_matrix.png   file 18.66KiB 
/housing-simplified_cv_reg_output.png file 77.1KiB  
/housing-simplified_model.sav         file 798.5KiB 
/housing-simplified_pairplot.png      file 100.8KiB 


In [31]:
!pachctl get file regression@master:/ --recursive --output .

## Step 5: Update Dataset

Here's where Pachyderm truly starts to shine. To update our dataset we can run the following command (note that we could also append new examples to the existing file, but in this example we're simply overwriting our previous file to one with more data):

In [33]:
!pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv



In [47]:
!pachctl list commit housing_data@master

REPO         BRANCH COMMIT                           FINISHED           SIZE     ORIGIN DESCRIPTION
housing_data master 68e7175eac3f4654b141e05c3769807d About a minute ago 12.14KiB USER    
housing_data master f8fa49a2838c495eaa51c1675684f82c 4 minutes ago      2.482KiB USER    
housing_data master aa9373420c4146d393164a2857c0385a 7 minutes ago      0B       AUTO    


In [45]:
!pachctl list file housing_data@master

NAME                    TYPE SIZE     
/housing-simplified.csv file 12.14KiB 


In [46]:
!pachctl list file housing_data@master^1

NAME                    TYPE SIZE     
/housing-simplified.csv file 2.482KiB 


In [49]:
!pachctl list commit regression@master

REPO       BRANCH COMMIT                           FINISHED      SIZE     ORIGIN DESCRIPTION
regression master 68e7175eac3f4654b141e05c3769807d 2 minutes ago 4.029MiB AUTO    
regression master f8fa49a2838c495eaa51c1675684f82c 5 minutes ago 995.1KiB AUTO    
regression master aa9373420c4146d393164a2857c0385a 7 minutes ago 0B       AUTO    


## Test the model

In [110]:
import joblib

model = joblib.load('housing-simplified_model.sav')
model.predict([[1,1,1]])

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


array([542787.])

## Deploy Seldon service

More information on SKlearn server: https://docs.seldon.io/projects/seldon-core/en/latest/servers/sklearn.html

In [85]:
%%writefile secret.yaml
    
apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ENV_AUTH: "false"
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: ""
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: ""
  RCLONE_CONFIG_S3_ENDPOINT: http://pachd.pachyderm.svc.cluster.local:30600

Overwriting secret.yaml


In [86]:
!kubectl -n seldon apply -f secret.yaml

secret/seldon-init-container-secret configured


In [None]:
seldon sklearn server requires model to be named model.joblib

In [72]:
!pachctl create repo seldon_models
!pachctl put file seldon_models@master:model.joblib -f housing-simplified_model.sav

cannot start a commit on an output branch: regression@master


In [117]:
!pachctl list file seldon_models@master

NAME          TYPE SIZE     
/model.joblib file 798.5KiB 


## More details on SKlearn server: https://docs.seldon.io/projects/seldon-core/en/latest/servers/sklearn.html

If you want to build your own Docker image don't forget to forward your minikube docker env!
https://stackoverflow.com/questions/42564058/how-to-use-local-docker-images-with-minikube

In [185]:
%%writefile deploy.yaml

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: housing-regressor
spec:
  predictors:
  - name: default
    replicas: 1
    graph:
      name: regressor
      implementation: SKLEARN_SERVER
      modelUri: s3://master.seldon_models
      storageInitializerImage: seldonio/rclone-storage-initializer:1.14.0-dev
      envSecretRefName: seldon-init-container-secret
      parameters:
        - name: method
          type: STRING
          value: predict

Overwriting deploy.yaml


In [195]:
!kubectl -n seldon apply -f deploy.yaml

seldondeployment.machinelearning.seldon.io/housing-regressor created


In [217]:
!kubectl -n seldon get po

NAME                                                     READY   STATUS    RESTARTS   AGE
housing-regressor-default-0-regressor-64c95db89f-tdbmq   2/2     Running   0          26s


In [181]:
!kubectl logs housing-regressor-default-0-regressor-64c95db89f-mq4s4 -n seldon

Defaulted container "regressor" out of: regressor, seldon-container-engine, regressor-model-initializer (init)
starting microservice
2022-06-15 10:09:08,444 - seldon_core.microservice:main:203 - INFO:  Starting microservice.py:main
2022-06-15 10:09:08,444 - seldon_core.microservice:main:204 - INFO:  Seldon Core version: 1.9.0
2022-06-15 10:09:08,446 - seldon_core.microservice:main:345 - INFO:  Parse JAEGER_EXTRA_TAGS []
2022-06-15 10:09:08,446 - seldon_core.microservice:load_annotations:155 - INFO:  Found annotation kubernetes.io/config.seen:2022-06-15T10:09:05.225418137Z 
2022-06-15 10:09:08,446 - seldon_core.microservice:load_annotations:155 - INFO:  Found annotation kubernetes.io/config.source:api 
2022-06-15 10:09:08,446 - seldon_core.microservice:load_annotations:155 - INFO:  Found annotation prometheus.io/path:/prometheus 
2022-06-15 10:09:08,446 - seldon_core.microservice:load_annotations:155 - INFO:  Found annotation prometheus.io/scrape:true 
2022-06-15 10:09:08,446 - seldon_c

## Don't forget to forward http and grpc services!

In [221]:
!kubectl describe sdep housing-regressor -n seldon

Name:         housing-regressor
Namespace:    seldon
Labels:       <none>
Annotations:  <none>
API Version:  machinelearning.seldon.io/v1
Kind:         SeldonDeployment
Metadata:
  Creation Timestamp:  2022-06-15T10:14:07Z
  Generation:          1
  Managed Fields:
    API Version:  machinelearning.seldon.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-06-15T10:14:07Z
    API Version:  machinelearning.seldon.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:predictors:
      f:status:
        .:
        f:address:
          .:
          f:url:
        f:deploymentStatus:
          .:
          f:housing-regressor-default-0-regressor:
            .:
            f:availableReplicas:
            f:replicas:
        f:replicas:
        f:serviceStatus:

In [222]:
!kubectl get svc -n seldon

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
housing-regressor-default             ClusterIP   10.101.139.187   <none>        8000/TCP,5001/TCP   97s
housing-regressor-default-regressor   ClusterIP   10.105.72.26     <none>        9000/TCP,9500/TCP   2m2s


In [None]:
# In terminal: kubectl port-forward svc/housing-regressor-default-regressor 7000:9000 7500:9500 -n seldon

In [152]:
!head data/housing-simplified-1.csv -n 2

RM,LSTAT,PTRATIO,MEDV
6.575,4.98,15.3,504000.0


### REST

In [8]:
%%bash
curl -s -X POST -H 'Content-Type: application/json' \
    -d '{"data":{"ndarray":[[6.575, 4.98, 15.3]]}}' \
    http://localhost:7000/api/v1.0/predictions

{"data":{"names":[],"ndarray":[522921.0]},"meta":{"requestPath":{"regressor":"seldonio/sklearnserver:1.9.0"}}}


### GRPC

In [10]:
import grpc
from google.protobuf import struct_pb2
from seldon_core.proto import prediction_pb2, prediction_pb2_grpc

channel = grpc.insecure_channel(f"localhost:7500")
stub = prediction_pb2_grpc.ModelStub(channel)

batch = struct_pb2.ListValue()
batch.append([6.575, 4.98, 15.3])
data = prediction_pb2.DefaultData(ndarray=batch)
seldon_request = prediction_pb2.SeldonMessage(data=data)
response = stub.Predict(seldon_request)
response

meta {
  requestPath {
    key: "regressor"
    value: "seldonio/sklearnserver:1.9.0"
  }
}
data {
  ndarray {
    values {
      number_value: 522921.0
    }
  }
}