###General Instructions
The UCI Machine Learning Repository makes available a popular dataset identifying various properties of three cultivars of Italian wine grapes: https://archive.ics.uci.edu/ml/datasets/Wine. These can be used to build a multi-class identifier with which measurements of these properties can be used to predict which cultivar is being observed.

We have not explictly addressed multi-class classification in our labs, but SciKit-Learn makes available multiple algorithms for building multi-class predictors. For this exercise, use sklearn.tree.DecisionTreeClassifier to build a multi-class predictor, identifying the grape cultivar based on the provided attributes.  To prevent overfitting, train on 70% of the provided data and test on the remaining 30%. No data transformations should be performed for this exercise. There are no missing values in the dataset and the dataset is well stratified across the three cultivars. Be sure to provide accuracy scores where indicated below.

In [0]:
# notebook config
USER_NAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
FILE_STORE_ROOT = '/FileStore/shared_uploads/'+USER_NAME

In [0]:
df = spark.read.csv(FILE_STORE_ROOT + '/wine/wine.csv', sep=',', header=True, inferSchema=True)
display(df)

cultivar,alchohol,malicacid,ash,alcalinity,magnesium,phenols,flavanoids,nonflavanoids,proanthocyanins,colorintensity,hue,od280,proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


In [0]:
# read the data to a pandas DataFrame and assemble feature and label arrays
import pandas as pd

# Create pandas DataFrame
df = df.toPandas()
display(df)

# Features
X = df.drop('cultivar', axis=1).values

# Labels
y = df['cultivar'].values



cultivar,alchohol,malicacid,ash,alcalinity,magnesium,phenols,flavanoids,nonflavanoids,proanthocyanins,colorintensity,hue,od280,proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


In [0]:
# split the data into training and test data sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, y,
  test_size=0.3,
  random_state=42,
  stratify=y 
  )

In [0]:
# train your model using the training data
from sklearn.tree import DecisionTreeClassifier

# Create the decision tree
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

Out[5]: DecisionTreeClassifier()

In [0]:
# score your model using the test data
score = dtc.score(X_test, y_test)
print('Linear Regression Model Score: %.2f%% ' % (score*100))

Linear Regression Model Score: 94.44% 
