d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Applying ML with UDFs
## Module 4, Lesson 6

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>
* Apply a pre-trained Linear Regression model to predict response times
* Identify which types of calls or neighborhoods are anticipated to have the longest response time

In [0]:
%run ../Includes/Classroom-Setup

## Create UDF

MLflow can create a User Defined Function for us to use in PySpark or SQL.  This allows for custom code (that is, functionality not in core Spark) to be run on Spark.

You can use `spark.udf.register` to register this Python UDF in the SQL namespace and call it `predictUDF`.

In [0]:
%python
try:
  import mlflow
  from mlflow.pyfunc import spark_udf

  model_path = "/dbfs/mnt/davis/fire-calls/models/firecalls_pipeline"
  predict = spark_udf(spark, model_path, result_type="string")

  spark.udf.register("predictUDF", predict)
except:
  print("ERROR: This cell did not run, likely because you're not running the correct version of software. Please use a cluster with `DBR 5.5 ML` rather than `DBR 5.5` or a different cluster version.")

## Import the Data

Create a temporary view called `fireCallsParquet`

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW fireCallsParquet
USING Parquet 
OPTIONS (
    path "/mnt/davis/fire-calls/fire-calls-1p.parquet"
  )

## Save Predictions

We are going to save our predictions to a table called `predictions`.

In [0]:
%sql
USE Databricks;
DROP TABLE IF EXISTS predictions;

CREATE TEMPORARY VIEW predictions AS (
  SELECT cast(predictUDF(Call_Type, Fire_Prevention_District, `Neighborhooods_-_Analysis_Boundaries`, 
                Number_of_Alarms, Original_Priority, Unit_Type) as double) as prediction, *
  FROM fireCallsParquet
  LIMIT 10000)

## Average Prediction by Neighborhood

Let's see which district in San Francisco has the highest predicted average response time! Do you remember why we are setting the shuffle partitions here?

In [0]:
%sql
SET spark.sql.shuffle.partitions=8;

key,value
spark.sql.shuffle.partitions,8


In [0]:
%sql
SELECT avg(prediction) as avgPrediction, `Neighborhooods_-_Analysis_Boundaries`
FROM predictions
GROUP BY `Neighborhooods_-_Analysis_Boundaries`
ORDER BY avgPrediction DESC

avgPrediction,Neighborhooods_-_Analysis_Boundaries
3.959233790123725,Treasure Island
3.5527230505189418,Twin Peaks
3.5325006367259086,Lakeshore
3.5072855376879155,
3.443746473732504,Excelsior
3.349685528889479,Visitacion Valley
3.318567531352264,Presidio Heights
3.3070010336485427,Noe Valley
3.3006035218794,Russian Hill
3.300176490789972,West of Twin Peaks


## San Francisco Districts

![](https://files.training.databricks.com/images/eLearning/ucdavis/sfneighborhoods.gif)

## Standard Deviation on Prediction by Neighborhood

In [0]:
%sql
SELECT stddev(prediction) as stddevPrediction, `Neighborhooods_-_Analysis_Boundaries`
FROM predictions
GROUP BY `Neighborhooods_-_Analysis_Boundaries`
ORDER BY stddevPrediction DESC

stddevPrediction,Neighborhooods_-_Analysis_Boundaries
0.8260478153912544,Golden Gate Park
0.7583865587486988,Outer Mission
0.737081989048434,Outer Richmond
0.7160910787325376,Bayview Hunters Point
0.7148216073697868,Treasure Island
0.7109234596130071,Inner Sunset
0.7057118301150535,Haight Ashbury
0.7054040836188077,Tenderloin
0.6993739017096834,Nob Hill
0.6954477553180514,South of Market


## Average Prediction by Call Type

In [0]:
%sql
SELECT avg(prediction) as avgPrediction, Call_Type
FROM predictions
GROUP BY Call_Type
ORDER BY avgPrediction DESC


avgPrediction,Call_Type
5.466770296408227,HazMat
4.340380483229641,Odor (Strange / Unknown)
4.125713275500113,Fuel Spill
4.038747753239171,Electrical Hazard
3.9464883853063553,Citizen Assist / Service Call
3.895505267898613,Smoke Investigation (Outside)
3.880954755578602,Industrial Accidents
3.857064063900725,Elevator / Escalator Rescue
3.8277058379189777,Gas Leak (Natural and LP Gases)
3.5508992914355084,Other


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>