# Training and scoring within a SQL Big Data Cluster

In this notebook you will train a model, use it to score data that has been uploaded to HDFS, and save the scored result to an external table.

Wide World Importers has refrigerated trucks to deliver temperature-sensitive products. These are high-profit, and high-expense items. In the past, there have been failures in the cooling systems, and the primary culprit has been the deep-cycle batteries used in the system.

WWI began replacing the batteries every three months as a preventative measure, but this has a high cost. Recently, the taxes on recycling batteries has increased dramatically. The CEO has asked the Data Science team if they can investigate creating a Predictive Maintenance system to more accurately tell the maintenance staff how long a battery will last, rather than relying on a flat 3 month cycle.

The trucks have sensors that transmit data to a file location. The trips are also logged. In this Jupyter Notebook, you'll create, train and store a Machine Learning model using SciKit-Learn, so that it can be deployed to multiple hosts.

Begin by running the following cell. You can run any code cell by placing your cursor within its region and then selecting the play icon (a triangle within a circle) that appears on the left.

In [18]:
# Import the standard modules we need
import pickle 
import pandas as pd
import numpy as np
import datetime as dt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

First, download the sensor data from the location where it is transmitted from the trucks, and load it into a Spark DataFrame.

In [19]:
df = pd.read_csv('https://raw.githubusercontent.com/solliancenet/tech-immersion-data-ai/master/environment-setup/data/2/training-formatted.csv', header=0)
df.dropna()
print(df.shape)
print(list(df.columns))

(10000, 74)
['Survival_In_Days', 'Province', 'Region', 'Trip_Length_Mean', 'Trip_Length_Sigma', 'Trips_Per_Day_Mean', 'Trips_Per_Day_Sigma', 'Battery_Rated_Cycles', 'Manufacture_Month', 'Manufacture_Year', 'Alternator_Efficiency', 'Car_Has_EcoStart', 'Twelve_hourly_temperature_history_for_last_31_days_before_death_last_recording_first', 'Sensor_Reading_1', 'Sensor_Reading_2', 'Sensor_Reading_3', 'Sensor_Reading_4', 'Sensor_Reading_5', 'Sensor_Reading_6', 'Sensor_Reading_7', 'Sensor_Reading_8', 'Sensor_Reading_9', 'Sensor_Reading_10', 'Sensor_Reading_11', 'Sensor_Reading_12', 'Sensor_Reading_13', 'Sensor_Reading_14', 'Sensor_Reading_15', 'Sensor_Reading_16', 'Sensor_Reading_17', 'Sensor_Reading_18', 'Sensor_Reading_19', 'Sensor_Reading_20', 'Sensor_Reading_21', 'Sensor_Reading_22', 'Sensor_Reading_23', 'Sensor_Reading_24', 'Sensor_Reading_25', 'Sensor_Reading_26', 'Sensor_Reading_27', 'Sensor_Reading_28', 'Sensor_Reading_29', 'Sensor_Reading_30', 'Sensor_Reading_31', 'Sensor_Reading_32'

After examining the data, the Data Science team selects certain columns that they believe are highly predictive of the battery life.

Now, you will pick out the features and labels from the training data. Run the following cell.

In [20]:
# Select the features used for predicting battery life
x = df.iloc[:,1:74]
x = x.iloc[:,np.r_[2:7, 9:73]]
x = x.interpolate() 

# Select the labels only (the measured battery life) 
y = df.iloc[:,0].values.flatten()

Run the following cell to view the features that will be used to train the model.

In [21]:
# Examine the features selected 
print(list(x.columns))

['Trip_Length_Mean', 'Trip_Length_Sigma', 'Trips_Per_Day_Mean', 'Trips_Per_Day_Sigma', 'Battery_Rated_Cycles', 'Alternator_Efficiency', 'Car_Has_EcoStart', 'Twelve_hourly_temperature_history_for_last_31_days_before_death_last_recording_first', 'Sensor_Reading_1', 'Sensor_Reading_2', 'Sensor_Reading_3', 'Sensor_Reading_4', 'Sensor_Reading_5', 'Sensor_Reading_6', 'Sensor_Reading_7', 'Sensor_Reading_8', 'Sensor_Reading_9', 'Sensor_Reading_10', 'Sensor_Reading_11', 'Sensor_Reading_12', 'Sensor_Reading_13', 'Sensor_Reading_14', 'Sensor_Reading_15', 'Sensor_Reading_16', 'Sensor_Reading_17', 'Sensor_Reading_18', 'Sensor_Reading_19', 'Sensor_Reading_20', 'Sensor_Reading_21', 'Sensor_Reading_22', 'Sensor_Reading_23', 'Sensor_Reading_24', 'Sensor_Reading_25', 'Sensor_Reading_26', 'Sensor_Reading_27', 'Sensor_Reading_28', 'Sensor_Reading_29', 'Sensor_Reading_30', 'Sensor_Reading_31', 'Sensor_Reading_32', 'Sensor_Reading_33', 'Sensor_Reading_34', 'Sensor_Reading_35', 'Sensor_Reading_36', 'Sensor_R

The lead Data Scientist believes that a standard Regression algorithm would do the best predictions.

In the following cell, you train a model using a GradientBoostingRegressor, providing it the features (X) and the label values (Y). Run the following cell.

In [22]:
# Train a regression model 
from sklearn.ensemble import GradientBoostingRegressor 
model = GradientBoostingRegressor() 
model.fit(x,y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

Now try making a single prediction with the trained model. Run the following cell.

In [23]:
# Try making a single prediction and observe the result
model.predict(x.iloc[0:1])

array([1323.39791998])

With a trained model in hand, you are now ready to score battery life predictions against a new set of vehicle telemetry data. The output of the cell will be predicted battery life for each vehicle. Run the following cell.

In [24]:
# access the test data from HDFS by reading into a Spark DataFrame 
test_data = pd.read_csv('https://raw.githubusercontent.com/solliancenet/tech-immersion-data-ai/master/environment-setup/data/2/fleet-formatted.csv', header=0)
test_data.dropna()

# prepare the test data (dropping unused columns) 
test_data = test_data.drop(columns=["Car_ID", "Battery_Age"])
test_data = test_data.iloc[:,np.r_[2:7, 9:73]]
test_data.rename(columns={'Twelve_hourly_temperature_forecast_for_next_31_days _reversed': 'Twelve_hourly_temperature_history_for_last_31_days_before_death_l ast_recording_first'}, inplace=True) 
# make the battery life predictions for each of the vehicles in the test data 
battery_life_predictions = model.predict(test_data) 
# examine the prediction 
battery_life_predictions

array([1472.91111228, 1340.08897725, 1421.38601032, 1473.79033215,
       1651.66584142, 1412.85617044, 1842.81351408, 1264.22762055,
       1930.45602533, 1681.86345995])

Now you can package up the predictions along with the vehicle telemetry into a single DataFrame so that you can export it back out to HDFS as a CSV.

In [28]:
# prepare one data frame that includes predictions for each vehicle
scored_data = test_data
scored_data["Estimated_Battery_Life"] = battery_life_predictions

df_scored = spark.createDataFrame(scored_data)

df_scored.coalesce(1).write.option("header", "true").csv("/data/battery-life.csv")

The above command creates a folder called `battery-life.csv`, which contains one CSV file that you can create an external table from, which will enable you to query the predictions for each vehicle from SQL. Return to the lab instructions to learn how to create an external table you can use for querying this data using SQL.


## Optional - export and operationalize trained model

Once you are satisfied with the Model, you can save it out using the "Pickle" library for deployment to other systems.

In [29]:
pickle_file = open('/tmp/pdm.pkl', 'wb')
pickle.dump(model, pickle_file)
import os
print(os.getcwd())
os.listdir('///tmp')

/tmp/nm-local-dir/usercache/root/appcache/application_1561062262695_0003/container_1561062262695_0003_01_000001
['Jetty_localhost_45411_datanode____.dlby3q', 'tmp1a_q3d2g', 'pdm.pkl', 'hsperfdata_root', 'Jetty_0_0_0_0_8042_node____19tj0x', 'nm-local-dir']

You could export this model and [run it at the edge or in SQL Server directly](https://azure.microsoft.com/en-us/services/sql-database-edge/). Here's an example of what that code could look like:

```sql
DECLARE @query_string nvarchar(max) -- Query Truck Data
SET @query_string='
SELECT ['Trip_Length_Mean', 'Trip_Length_Sigma', 'Trips_Per_Day_Mean', 'Trips_Per_Day_Sigma', 'Battery_Rated_Cycles', 'Alternator_Efficiency', 'Car_Has_EcoStart', 'Twelve_hourly_temperature_history_for_last_31_days_before_death_last_recording_first', 'Sensor_Reading_1', 'Sensor_Reading_2', 'Sensor_Reading_3', 'Sensor_Reading_4', 'Sensor_Reading_5', 'Sensor_Reading_6', 'Sensor_Reading_7', 'Sensor_Reading_8', 'Sensor_Reading_9', 'Sensor_Reading_10', 'Sensor_Reading_11', 'Sensor_Reading_12', 'Sensor_Reading_13', 'Sensor_Reading_14', 'Sensor_Reading_15', 'Sensor_Reading_16', 'Sensor_Reading_17', 'Sensor_Reading_18', 'Sensor_Reading_19', 'Sensor_Reading_20', 'Sensor_Reading_21', 'Sensor_Reading_22', 'Sensor_Reading_23', 'Sensor_Reading_24', 'Sensor_Reading_25', 'Sensor_Reading_26', 'Sensor_Reading_27', 'Sensor_Reading_28', 'Sensor_Reading_29', 'Sensor_Reading_30', 'Sensor_Reading_31', 'Sensor_Reading_32', 'Sensor_Reading_33', 'Sensor_Reading_34', 'Sensor_Reading_35', 'Sensor_Reading_36', 'Sensor_Reading_37', 'Sensor_Reading_38', 'Sensor_Reading_39', 'Sensor_Reading_40', 'Sensor_Reading_41', 'Sensor_Reading_42', 'Sensor_Reading_43', 'Sensor_Reading_44', 'Sensor_Reading_45', 'Sensor_Reading_46', 'Sensor_Reading_47', 'Sensor_Reading_48', 'Sensor_Reading_49', 'Sensor_Reading_50', 'Sensor_Reading_51', 'Sensor_Reading_52', 'Sensor_Reading_53', 'Sensor_Reading_54', 'Sensor_Reading_55', 'Sensor_Reading_56', 'Sensor_Reading_57', 'Sensor_Reading_58', 'Sensor_Reading_59', 'Sensor_Reading_60', 'Sensor_Reading_61']
FROM Truck_Sensor_Readings'
EXEC [dbo].[PredictBattLife] 'pdm', @query_string;
```