Skip to content

Latest commit

 

History

History
207 lines (151 loc) · 9.06 KB

datascience.adoc

File metadata and controls

207 lines (151 loc) · 9.06 KB

CDSW Experiments and Models

Although this workshop doesn’t involve CDF components, we have made it available to explain how the CDSW model endpoint used in other workshops is implemented.

In this workshop you will run Experiments in CDSW, choose the model that yielded the best experiment results and deploy that model in production.

Labs summary

  • Lab 1 - CDSW: Train the model.

  • Lab 2 - Deploy the Model.

Lab 1 - CDSW: Train the model

In this and the following lab, you will wear the hat of a Data Scientist. You will write the model code, train it several times and finally deploy the model to Production. All within 30 minutes!

STEP 1: Configure CDSW

  1. Open CDSW Web UI and log in as admin, if you haven’t yet done so.

  2. Navigate to the CDSW Admin page to fine tune the environment:

    1. In the Engines tab, add in Engines Profiles a new engine (docker image) with 2 vCPUs and 4 GB RAM, while deleting the default engine.

    2. Check if the following variable already exists under Environmental Variables. If not, add it:

      HADOOP_CONF_DIR=/etc/hadoop/conf/
      engines

STEP 2: Create the project

  1. Return to the main page and click on New Project, using this GitHub project as the source: https://github.com/cloudera-labs/edge2ai-workshop

    create project
  2. Now that your project has been created, click on Open Workbench and start a Python3 session:

    open workbench
  3. Once the Engine is ready, run the following command to install some required libraries:

    !pip3 install --upgrade pip scikit-learn
  4. The project comes with a historical dataset. Copy this dataset into HDFS:

    !hdfs dfs -put -f data/historical_iot.txt /user/$HADOOP_USER_NAME
    session prep
  5. You’re now ready to run the Experiment to train the model on your historical data.

  6. You can stop the Engine at this point.

STEP 3: Examine cdsw.iot_exp.py

Open the file cdsw.iot_exp.py. This is a python program that builds a model to predict machine failure (the likelihood that this machine is going to fail). There is a dataset available on hdfs with customer data, including a failure indicator field.

The program is going to build a failure prediction model using the Random Forest algorithm. Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.

spark.mllib supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features. spark.mllib implements random forests using the existing decision tree implementation. Please see the decision tree guide for more information on trees.

The Random Forest algorithm expects a couple of parameters:

  • numTrees: Number of trees in the forest.

    Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. Training time increases roughly linearly in the number of trees.

  • maxDepth: Maximum depth of each tree in the forest.

    Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting. In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).

In the cdsw.iot_exp.py program, these parameters can be passed to the program at runtime, to these python variables:

param_numTrees = int(sys.argv[1])
param_maxDepth = int(sys.argv[2])

Also note the quality indicator for the Random Forest model, are written back to the Data Science Workbench repository:

cdsw.track_metric("auroc", auroc)
cdsw.track_metric("ap", ap)

These indicators will show up later in the Experiments dashboard.

STEP 4: Run the experiment for the first time

  1. Now, run the experiment using the following parameters:

    numTrees = 20 numDepth = 20
  2. From the menu, select Run → Run Experiments…​. Now, in the background, the Data Science Workbench environment will spin up a new docker container, where this program will run.

    run experiment
  3. Go back to the Projects page in CDSW, and hit the Experiments button.

  4. If the Status indicates Running, you have to wait till the run is completed. In case the status is Build Failed or Failed , check the log information. This is accessible by clicking on the run number of your experiments. There you can find the session log, as well as the build information.

    experiment details
  5. In case your status indicates Success, you should be able to see the auroc (Area Under the Curve) model quality indicator. It might be that this value is hidden by the CDSW user interface. In that case, click on the ‘3 metrics’ links, and select the auroc field. It might be needed to de-select some other fields, since the interface can only show 3 metrics at the same time.

    exp metrics
  6. In this example, ~0.8383. Not bad, but maybe there are better hyper parameter values available.

STEP 5: Re-run the experiment several times

  1. Go back to the Workbench and run the experiment 2 more times and try different values for NumTrees and NumDepth. Try the following values:

    NumTrees NumDepth
    15       25
    25       20
  2. When all runs have completed successfully, check which parameters had the best quality (best predictive value). This is represented by the highest area under the curve: auroc metric.

    best model

STEP 6: Save the best model to your environment

  1. Select the run number with the best predictive value (in the example above, experiment 4).

  2. In the Overview screen of the experiment, you can see that the model, in Pickle format (.pkl), is captured in the file iot_model.pkl. Select this file and hit the Add to Project button. This will copy the model to your project directory.

    save model
    model saved

Lab 2 - CDSW: Deploy the model

STEP 1: Examine the program cdsw.iot_model.py

  1. Open the project you created in the previous lab and examine the file in the Workbench. This PySpark program uses the pickle.load mechanism to deploy models. The model is loaded from the iot_modelf.pkl file, which was saved in the previous lab from the experiment with the best predictive model.

    There program also contains the predict definition, which is the function that calls the model, passing the features as parameters, and will return a result variable.

  2. Before deploying the model, try it out in the Workbench: launch a Python3 engine and run the code in file cdsw.iot_model.py. Then call the predict() method from the prompt:

    predict({"feature": "0, 65, 0, 137, 21.95, 83, 19.42, 111, 9.4, 6, 3.43, 4"})
    predict
  3. The functions returns successfully, so we know we can now deploy the model. You can now stop the engine.

STEP 2: Deploy the model

  1. From the main page of your project, select the Models button. Select New Model and specify the following configuration:

    Name:          IoT Prediction Model
    Description:   IoT Prediction Model
    File:          cdsw.iot_model.py
    Function:      predict
    Example Input: {"feature": "0, 65, 0, 137, 21.95, 83, 19.42, 111, 9.4, 6, 3.43, 4"}
    Kernel:        Python 3
    Engine:        2 vCPU / 4 GB Memory
    Replicas:      1

    create model

  2. After all parameters are set, click on the Deploy Model button. Wait till the model is deployed. This can take several minutes.

STEP 3: Test the deployed model

  1. When your model status change to Deployed, click on the model name link to go to the Model’s Overview page. From the that page, click on the Test button to check if the model is working.

  2. The green circle with the success status indicates that our REST call to the model is working. The 1 in the response {"result": 1}, means that the machine from where these temperature readings were collected is unlikely to experience a failure.

    test model
  3. Now, lets change the input parameters and call the predict function again. Put the following values in the Input field:

    {
      "feature": "0, 95, 0, 88, 26.62, 75, 21.05, 115, 8.65, 5, 3.32, 3"
    }
  4. With these input parameters, the model returns 0, which means that the machine is likely to break.