###General Instructions
The UCI Machine Learning Repository makes available a popular dataset identifying various properties of three cultivars of Italian wine grapes: https://archive.ics.uci.edu/ml/datasets/Wine. These can be used to build a multi-class identifier with which measurements of these properties can be used to predict which cultivar is being observed.

The values in this dataset are:</p>

0. Cultivar
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline

For this exercise, use sklearn.tree.DecisionTreeClassifier to build a multi-class predictor, identifying the grape cultivar based on the provided attributes.  To prevent overfitting, train on 70% of the provided data and test on the remaining 30%. 

For tuning, use hyperopt to distribute your training work.  Split your original training dataset into an 80:20 training and validation sets for the purposes of tuning.  You can use an exhaustive tuning method but you may want to use hyperopt to make this more efficient.  If you use hyperopt, be careful that values taken from the search space are transformed into integer values before being applied to your model.

Once the model is tuned, train your final model using the optimized hyperparameter values.  Use the training and tesing set from your first split to train and then evaluate this model.

No data transformations should be performed for this exercise. There are no missing values in the dataset and the dataset is well stratified across the three cultivars. Be sure to provide accuracy scores where indicated below.

In [0]:
# notebook config
USER_NAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
FILE_STORE_ROOT = '/FileStore/shared_uploads/'+USER_NAME

In [0]:
# examine the file
file_name = FILE_STORE_ROOT+'/wine/wine.csv'
dbutils.fs.head(file_name)

Out[2]: 'cultivar,alchohol,malicacid,ash,alcalinity,magnesium,phenols,flavanoids,nonflavanoids,proanthocyanins,colorintensity,hue,od280,proline\n1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065\n1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050\n1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185\n1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480\n1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735\n1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450\n1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290\n1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295\n1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045\n1,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,1045\n1,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,1510\n1,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,1280\n1,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,1320\n1,14.75,1.73,2.39,11.4,9

In [0]:
# read the data to a pandas DataFrame and assemble feature and label arrays
from pyspark.sql.types import *

schema = StructType([
  StructField('cultivar', IntegerType()),
  StructField('alcohol', FloatType()),
  StructField('malicacid', FloatType()),
  StructField('ash', FloatType()),
  StructField('alcalinity', FloatType()),
  StructField('magnesium', FloatType()),
  StructField('phenols', FloatType()),
  StructField('flavanoids', FloatType()),
  StructField('nonflavanoids', FloatType()),
  StructField('proanthocyanins', FloatType()),
  StructField('colorintensity', FloatType()),
  StructField('hue', FloatType()),
  StructField('od280', FloatType()),
  StructField('proline', FloatType())
  ])
 
wine = (
  (
   spark
    .read
    .csv(file_name, sep=',', header=True, schema=schema)
  ).toPandas()
  )
 
wine.head()

Unnamed: 0,cultivar,alcohol,malicacid,ash,alcalinity,magnesium,phenols,flavanoids,nonflavanoids,proanthocyanins,colorintensity,hue,od280,proline
0,1,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,1,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,1,14.37,1.95,2.5,16.799999,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,1,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [0]:
# split the data into training and test data sets
from sklearn.model_selection import train_test_split
 
# extract data for fitting
X = wine.drop('cultivar', axis=1) # features
y = wine['cultivar'] # labels
 
X_train, X_test, y_train, y_test = train_test_split(
  X, y, 
  test_size = 0.3, 
  random_state=42, 
  stratify=y
  )

In [0]:
# split the training data into training and validation datasets
X_train_train, X_train_validate, y_train_train, y_train_validate = train_test_split(
  X_train, y_train, 
  stratify=y_train, train_size=0.8
  ) 

In [0]:
# tune your model on the training/validation sets, leveraging hyperopt
# determine an optimal value for:
# max_depth between 1 and 10
# max_features between 1 and 13
# all other features are allowed to remain at their defaults
from hyperopt import hp, fmin, tpe, SparkTrials, STATUS_OK, space_eval
 
#define hyperopt seach space
search_space = {
  'max_depth' : hp.quniform('max_depth', 1, 10, 1),
  'max_features' : hp.quniform('max_features', 1, 13, 1)
}
 
#send copies of training and validation sets to workers in cluster 
X_train_train_broadcast = sc.broadcast(X_train_train)
y_train_train_broadcast = sc.broadcast(y_train_train)
X_train_validate_broadcast = sc.broadcast(X_train_validate)
y_train_validate_broadcast = sc.broadcast(y_train_validate)

In [0]:
# train your model using the optimized parameters and your first training set
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score

def evaluate_model(hyperopt_params):
  
  # accesss replicated input data
  X_train_input = X_train_train_broadcast.value
  y_train_input = y_train_train_broadcast.value
  X_validate_input = X_train_validate_broadcast.value
  y_validate_input = y_train_validate_broadcast.value  
  
  # configure model parameters
  params = hyperopt_params
 
  # adjust hyperopt-supplied params
  if 'max_depth' in params: params['max_depth']=int(params['max_depth'])   # hyperopt supplies values as float but must be int
  if 'max_features' in params: params['max_features']=int(params['max_features']) # hyperopt supplies values as float but must be int
  
  # instantiate model with parameters 
  dt = DecisionTreeClassifier(**params)
  
  # train and predict X validate input
  dt.fit(X_train_input, y_train_input)
  pred = dt.predict(X_validate_input)
  
  # get accuracy score
  acc = accuracy_score(y_validate_input, pred)
  
  # invert metric for hyperopt
  loss = -1 * acc
  
  # return results 
  return {'loss': loss, 'status': STATUS_OK}
 
# utilize fmin 
argmin = fmin(
  fn=evaluate_model,
  space=search_space,
  algo=tpe.suggest,  # algorithm controlling how hyperopt navigates the search space
  max_evals=20,
  trials=SparkTrials(parallelism=4),
  verbose=True
  )
 
print(argmin)

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.


  0%|          | 0/20 [00:00<?, ?trial/s, best loss=?]  5%|▌         | 1/20 [00:06<01:56,  6.13s/trial, best loss: -0.88] 10%|█         | 2/20 [00:09<01:18,  4.34s/trial, best loss: -0.92] 15%|█▌        | 3/20 [00:10<00:48,  2.85s/trial, best loss: -0.96] 20%|██        | 4/20 [00:12<00:40,  2.52s/trial, best loss: -0.96] 25%|██▌       | 5/20 [00:13<00:29,  1.97s/trial, best loss: -0.96] 30%|███       | 6/20 [00:16<00:32,  2.33s/trial, best loss: -0.96] 35%|███▌      | 7/20 [00:18<00:28,  2.22s/trial, best loss: -0.96] 45%|████▌     | 9/20 [00:20<00:17,  1.63s/trial, best loss: -0.96] 50%|█████     | 10/20 [00:23<00:19,  1.99s/trial, best loss: -0.96] 55%|█████▌    | 11/20 [00:25<00:17,  1.99s/trial, best loss: -0.96] 60%|██████    | 12/20 [00:27<00:15,  2.00s/trial, best loss: -0.96] 65%|██████▌   | 13/20 [00:28<00:12,  1.72s/trial, best loss: -0.96] 70%|███████   | 14/20 [00:30<00:10,  1.80s/trial, best loss: -0.96] 75%|███████▌  | 15/20 [00:32<00:09,  1.86s/trial, best

Total Trials: 20: 20 succeeded, 0 failed, 0 cancelled.


{'max_depth': 3.0, 'max_features': 6.0}


In [0]:
# score your model using the test data
# configure model parameters
params = argmin

# adjust hyperopt-supplied params 
if 'max_depth' in params: params['max_depth']=int(params['max_depth'])   # hyperopt supplies values as float but must be int
if 'max_features' in params: params['max_features']=int(params['max_features']) # hyperopt supplies values as float but must be int
dt = DecisionTreeClassifier(**params)
dt.fit(X_train, y_train)
 
# score model 
pred = dt.predict(X_test)
acc = accuracy_score(y_test, pred)
 
print(acc)
 
# release broadcast from memory 
X_train_train_broadcast.destroy()
y_train_train_broadcast.destroy()
X_train_validate_broadcast.destroy()
y_train_validate_broadcast.destroy()

0.9259259259259259
