In [126]:
!pip3 install numpy seaborn pandas scikit-learn pyarrow

In [111]:
!pachctl version

COMPONENT           VERSION             
pachctl             2.2.2               
pachd               2.2.2               


In [9]:
!python3 regression.py --help

usage: regression.py [-h] [--input INPUT] [--target-col TARGET_COL]
                     [--output DIR]

Structured data regression

options:
  -h, --help            show this help message and exit
  --input INPUT         csv file with all examples
  --target-col TARGET_COL
                        column with target values
  --output DIR          output directory


## Step 1: Create an input data repository

In [10]:
!pachctl create repo housing_data

In [11]:
!pachctl list repo

NAME         CREATED       SIZE (MASTER) DESCRIPTION                       
housing_data 7 seconds ago ≤ 0B                                            
count        5 hours ago   ≤ 22B         Output repo for pipeline count.   
data         6 hours ago   ≤ 728B                                          
reduce       22 hours ago  ≤ 6.545KiB    Output repo for pipeline reduce.  
map          22 hours ago  ≤ 8.583KiB    Output repo for pipeline map.     
scraper      22 hours ago  ≤ 333.5KiB    Output repo for pipeline scraper. 
urls         22 hours ago  ≤ 119B                                          


## Step 2: Create the regression pipeline

In [12]:
!cat regression.json

{
    "pipeline": {
        "name": "regression"
    },
    "description": "A pipeline that trains produces a regression model for housing prices.",
    "input": {
        "pfs": {
            "glob": "/*",
            "repo": "housing_data"
        }
    },
    "transform": {
        "cmd": [
            "python", "regression.py",
            "--input", "/pfs/housing_data/",
            "--target-col", "MEDV",
            "--output", "/pfs/out/"
        ],
        "image": "pachyderm/housing-prices:1.11.0"
    }
}

In [13]:
!pachctl create pipeline -f regression.json

The pipeline writes the output to a PFS repo (/pfs/out/ in the pipeline json) created with the same name as the pipeline.

## Step 3: Add the housing dataset to the repo

Now we can add the data, which will kick off the processing automatically. If we update the data with a new commit, then the pipeline will automatically re-run.

In [24]:
!pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv



In [26]:
!pachctl list file housing_data@master

NAME                    TYPE SIZE     
/housing-simplified.csv file 2.482KiB 


In [28]:
!pachctl list job

ID                               SUBJOBS PROGRESS CREATED            MODIFIED
f8fa49a2838c495eaa51c1675684f82c 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m About a minute ago About a minute ago 
aa9373420c4146d393164a2857c0385a 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 3 minutes ago      3 minutes ago      
e168bea3fdbf49d2849354c2dc833dd9 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
e35d00004c5b4288b6580c5c0519cc80 1       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
ebee0fc1176c4e01a8093559cb893a5c 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m 6 hours ago        6 hours ago        
f3b7d09fb53f49acb727ce0010027b9f 3       [32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇[0m[32m▇

## Step 4: Download files once the pipeline has finished

In [30]:
!pachctl list file regression@master

NAME                                  TYPE SIZE     
/housing-simplified_corr_matrix.png   file 18.66KiB 
/housing-simplified_cv_reg_output.png file 77.1KiB  
/housing-simplified_model.sav         file 798.5KiB 
/housing-simplified_pairplot.png      file 100.8KiB 


In [31]:
!pachctl get file regression@master:/ --recursive --output .

## Step 5: Update Dataset

Here's where Pachyderm truly starts to shine. To update our dataset we can run the following command (note that we could also append new examples to the existing file, but in this example we're simply overwriting our previous file to one with more data):

In [33]:
!pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv



In [47]:
!pachctl list commit housing_data@master

REPO         BRANCH COMMIT                           FINISHED           SIZE     ORIGIN DESCRIPTION
housing_data master 68e7175eac3f4654b141e05c3769807d About a minute ago 12.14KiB USER    
housing_data master f8fa49a2838c495eaa51c1675684f82c 4 minutes ago      2.482KiB USER    
housing_data master aa9373420c4146d393164a2857c0385a 7 minutes ago      0B       AUTO    


In [45]:
!pachctl list file housing_data@master

NAME                    TYPE SIZE     
/housing-simplified.csv file 12.14KiB 


In [46]:
!pachctl list file housing_data@master^1

NAME                    TYPE SIZE     
/housing-simplified.csv file 2.482KiB 


In [49]:
!pachctl list commit regression@master

REPO       BRANCH COMMIT                           FINISHED      SIZE     ORIGIN DESCRIPTION
regression master 68e7175eac3f4654b141e05c3769807d 2 minutes ago 4.029MiB AUTO    
regression master f8fa49a2838c495eaa51c1675684f82c 5 minutes ago 995.1KiB AUTO    
regression master aa9373420c4146d393164a2857c0385a 7 minutes ago 0B       AUTO    


## Test the model

In [110]:
import joblib

model = joblib.load('housing-simplified_model.sav')
model.predict([[1,1,1]])

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


array([542787.])

## Deploy Seldon service

In [85]:
%%writefile secret.yaml
    
apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ENV_AUTH: "false"
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: ""
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: ""
  RCLONE_CONFIG_S3_ENDPOINT: http://pachd.pachyderm.svc.cluster.local:30600

Overwriting secret.yaml


In [86]:
!kubectl -n seldon apply -f secret.yaml

secret/seldon-init-container-secret configured


In [None]:
seldon sklearn server requires model to be named model.joblib

In [72]:
!pachctl create repo seldon_models
!pachctl put file seldon_models@master:model.joblib -f housing-simplified_model.sav

cannot start a commit on an output branch: regression@master


In [117]:
!pachctl list file seldon_models@master

NAME          TYPE SIZE     
/model.joblib file 798.5KiB 


In [88]:
%%writefile deploy.yaml

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: housing-regressor
spec:
  predictors:
  - name: default
    replicas: 1
    graph:
      name: regressor
      implementation: SKLEARN_SERVER
      modelUri: s3://master.seldon_models
      storageInitializerImage: seldonio/rclone-storage-initializer:1.14.0-dev
      envSecretRefName: seldon-init-container-secret

Overwriting deploy.yaml


In [89]:
!kubectl -n seldon apply -f deploy.yaml

seldondeployment.machinelearning.seldon.io "housing-regressor" deleted


In [91]:
!kubectl -n seldon get po

NAME                                                     READY   STATUS    RESTARTS   AGE
housing-regressor-default-0-regressor-846bb666f7-kczjm   2/2     Running   0          74s


In [105]:
%%bash
curl -s -X POST -H 'Content-Type: application/json' \
    -d '{"data":{"ndarray":[[1,1,1]]}}' \
    http://localhost:7000/api/v1.0/predictions

## Upload files for the next demo

In [127]:
import pandas as pd
pd.read_parquet('data/house_dataset_main.parquet').head()

Unnamed: 0,HouseId,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,MedHouseVal,EventTimestamp,Created
0,1,2.4792,24.0,3.454704,1.134146,2251.0,3.921603,2.0,2021-12-11 18:40:03,2022-04-12 12:04:13
1,2,3.463,8.0,6.363636,1.166297,1307.0,2.898004,2.017,2021-12-11 18:57:30,2022-04-12 12:04:13
2,3,3.75,16.0,5.768719,1.023295,1478.0,2.459235,1.473,2021-12-11 19:00:05,2022-04-12 12:04:13
3,4,2.8542,34.0,3.858779,1.045802,1164.0,4.442748,1.469,2021-12-11 19:23:36,2022-04-12 12:04:13
4,5,1.3375,18.0,4.567625,1.087327,2707.0,2.882854,0.596,2021-12-11 19:23:53,2022-04-12 12:04:13


In [128]:
import pandas as pd
pd.read_parquet('data/house_dataset_lat_lon.parquet').head()

Unnamed: 0,HouseId,Latitude,Longitude,EventTimestamp,Created
0,1,34.18,-118.38,2021-12-11 18:40:03,2022-02-13 22:43:53
1,2,39.08,-121.04,2021-12-11 18:57:30,2022-02-13 22:43:53
2,3,38.68,-121.28,2021-12-11 19:00:05,2022-02-13 22:43:53
3,4,34.04,-118.19,2021-12-11 19:23:36,2022-02-13 22:43:53
4,5,39.13,-121.54,2021-12-11 19:23:53,2022-02-13 22:43:53


In [120]:
!pachctl create repo feast
!pachctl put file feast@master:house_dataset_lat_lon.parquet -f data/house_dataset_lat_lon.parquet
!pachctl put file feast@master:house_dataset_main.parquet -f data/house_dataset_main.parquet



In [123]:
!pachctl list file feast@master

NAME                           TYPE SIZE     
/house_dataset_lat_lon.parquet file 354.4KiB 
/house_dataset_main.parquet    file 1.012MiB 
