# Demo 2: Naive Bayes and DataStax Analytics
------
<img src="images/drinkWine.jpeg" width="300" height="500">


#### Dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

## What are we trying to learn from this dataset? 

# QUESTION:  Can Naive Bayes be used to classify if a wine is a good wine (score 9+) by its attributes?

In [101]:
%matplotlib inline
import matplotlib.pyplot as plt

In [102]:
import pandas
import cassandra
import pyspark
import re
import os
import random
from random import randint, randrange
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

#### Helper function to have nicer formatting of Spark DataFrames

In [103]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

# DataStax Enterprise Analytics
<img src="images/dselogo.png" width="400" height="200">

## Creating Tables and Loading Tables

### Connect to DSE Analytics Cluster

In [123]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.01'])
session = cluster.connect()

### Create Demo Keyspace 

In [124]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS accelerate 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x107effb38>

### Set keyspace 

In [125]:
session.set_keyspace('accelerate')

### Create table called `wines`. Our PRIMARY will be a unique key (wineid) we generate for each row.  This will have two datasets "white" and "red"

In [126]:
query = "CREATE TABLE IF NOT EXISTS wines \
                                   (wineid int, fixedAcidity float, volatileAcidity float, citricAcid float, sugar float, \
                                   chlorides float, freeSulfur float, totalSulfur float, density float, ph float, \
                                   sulphates float, alcohol float, quality float, \
                                   PRIMARY KEY (wineid))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x107f08080>

### Create table called `wineWhite`. Our PRIMARY will be a unique key (wineid) we generate for each row.  This is only for the white wines.

In [131]:
query = "CREATE TABLE IF NOT EXISTS wineWhite \
                                   (wineid int, fixedAcidity float, volatileAcidity float, citricAcid float, sugar float, \
                                   chlorides float, freeSulfur float, totalSulfur float, density float, ph float, \
                                   sulphates float, alcohol float, quality float, \
                                   PRIMARY KEY (wineid))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x107f26198>

### Create table called `wineRed`. Our PRIMARY will be a unique key (wineid) we generate for each row.  This is only for the red wines.

In [127]:
query = "CREATE TABLE IF NOT EXISTS wineRed \
                                   (wineid int, fixedAcidity float, volatileAcidity float, citricAcid float, sugar float, \
                                   chlorides float, freeSulfur float, totalSulfur float, density float, ph float, \
                                   sulphates float, alcohol float, quality float, \
                                   PRIMARY KEY (wineid))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x107f05470>

### What do these of these 12 columns represent: 

* **Fixed acidity**
* **Volatile acidity**
* **Citric Acid**
* **Residual Sugar** 
* **Chlorides**
* **Free sulfur dioxide**     
* **Total sulfur dioxide**
* **Density** 
* **pH**
* **Sulphates**
* **Alcohol**
* **Quality**

### Load 2 Wine Dataset -- White and Red
<img src="images/whiteAndRed.jpeg" width="300" height="300">

### Load Wine datasets from CSV file (winequality-red.csv winequality-white.csv)
* No clean up was requried! How nice :)

#### Insert all the Wine Data into the DSE table `wines`

In [128]:
fileName = 'data/winequality-red.csv'
input_file = open(fileName, 'r')
i = 1
for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wines (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1

fileName = 'data/winequality-white.csv'
input_file = open(fileName, 'r')

for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wines (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1
    

### Load Wine datasets from CSV file (winequality-red.csv)
* No clean up was requried! How nice :)

#### Insert all the Wine Data into the DSE table `wineRed`

In [129]:
fileName = 'data/winequality-red.csv'
input_file = open(fileName, 'r')
i = 1
for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wineRed (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1

### Load Wine datasets from CSV file (winequality-white.csv)
* No clean up was requried! How nice :)

#### Insert all the Wine Data into the DSE table `wineWhite`

In [132]:
fileName = 'data/winequality-white.csv'
input_file = open(fileName, 'r')

for line in input_file:
    wineid = i
    row = line.split(';')
        
    query = "INSERT INTO wineWhite (wineid, fixedAcidity, volatileAcidity, citricAcid, sugar, \
                               chlorides, freeSulfur, totalSulfur, density, ph, \
                               sulphates, alcohol, quality)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (wineid, float(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), float(row[8]), float(row[9]), float(row[10]), float(row[11])))
    i = i + 1

## Machine Learning with DSE Analytics and Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

#### Create a spark session that is connected to DSE. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [133]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()


wineDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="wines", keyspace="accelerate").load()

print ("Table Row Count: ")
print (wineDF.count())

Table Row Count: 
6497


In [134]:
wineWhiteDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="winewhite", keyspace="accelerate").load()

print ("Table Row Count: ")
print (wineWhiteDF.count())

wineRedDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="winered", keyspace="accelerate").load()

print ("Table Row Count: ")
print (wineRedDF.count())

Table Row Count: 
4898
Table Row Count: 
1599


In [135]:
showDF(wineDF)

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity
0,3088,9.5,0.044,0.74,0.9972,6.5,68.0,3.18,6.0,13.3,0.54,224.0,0.26
1,6395,10.7,0.035,0.29,0.99142,6.4,44.0,3.17,7.0,1.1,0.55,140.0,0.105
2,381,9.4,0.08,0.42,0.9974,8.3,11.0,3.21,6.0,2.0,0.8,27.0,0.26
3,3638,9.0,0.029,0.27,0.9949,5.5,22.0,3.34,5.0,4.6,0.44,104.0,0.14
4,4845,9.8,0.059,0.29,0.99328,7.6,37.0,3.09,5.0,2.5,0.37,115.0,0.27


In [147]:
wineDF.select('quality').distinct().show()

+-------+
|quality|
+-------+
|    9.0|
|    5.0|
|    7.0|
|    3.0|
|    6.0|
|    8.0|
|    4.0|
+-------+



In [186]:
wine6DF = wineDF.filter("quality > 5")
showDF(wine6DF)

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity
0,5691,10.0,0.057,0.28,0.99425,6.4,21.0,3.26,6.0,7.9,0.36,82.0,0.14
1,6490,11.8,0.036,0.29,0.98938,6.1,25.0,3.06,6.0,2.2,0.44,100.0,0.34
2,1939,10.2,0.049,0.35,0.9934,6.6,49.0,3.43,7.0,1.5,0.85,141.0,0.18
3,1958,10.4,0.05,0.39,0.994,10.0,19.0,3.0,6.0,1.4,0.42,152.0,0.2
4,4641,9.4,0.037,0.41,0.99882,6.2,58.0,3.25,6.0,16.799999,0.57,173.0,0.33


In [187]:
assembler = VectorAssembler(
    inputCols=['alcohol', 'chlorides', 'citricacid', 'density', 'fixedacidity', 'ph', 'freesulfur', 'sugar', 'sulphates', 'totalsulfur', 'volatileacidity'],
    outputCol='features')

trainingData = assembler.transform(wine6DF)

labelIndexer = StringIndexer(inputCol="quality", outputCol="label", handleInvalid='keep')
trainingData1 = labelIndexer.fit(trainingData).transform(trainingData)

showDF(trainingData1)
print(trainingData1.count())

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label
0,4317,10.8,0.046,0.29,0.99518,6.8,59.0,3.2,6.0,10.4,0.4,143.0,0.16,"[10.800000190734863, 0.04600000008940697, 0.28...",0.0
1,3372,10.9,0.059,0.26,0.9955,7.8,32.0,3.04,6.0,9.5,0.43,178.0,0.4,"[10.899999618530273, 0.05900000035762787, 0.25...",0.0
2,4830,9.4,0.056,0.57,0.99548,6.7,60.0,2.96,6.0,6.6,0.43,150.0,0.13,"[9.399999618530273, 0.0560000017285347, 0.5699...",0.0
3,2731,9.7,0.047,0.34,0.9944,6.9,24.0,3.2,6.0,4.0,0.52,128.0,0.23,"[9.699999809265137, 0.04699999839067459, 0.340...",0.0
4,769,9.7,0.082,0.02,0.99744,7.1,24.0,3.55,6.0,2.3,0.53,94.0,0.59,"[9.699999809265137, 0.0820000022649765, 0.0199...",0.0


4113


In [188]:
# Split the data into train and test
splits = trainingData1.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Dataframe Row Count: ")
print (test.count())

Train Dataframe Row Count: 
3361
Test Dataframe Row Count: 
747


In [189]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

predictions = model.transform(test)
#predictions.show()
print (predictions.count())
showDF(predictions)

754


Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label,rawPrediction,probability,prediction
0,4,9.8,0.075,0.56,0.998,11.2,17.0,3.16,6.0,1.9,0.58,60.0,0.28,"[9.800000190734863, 0.07500000298023224, 0.560...",0.0,"[-151.74201897512327, -151.70634753693966, -15...","[0.47485839019681814, 0.49210301362791764, 0.0...",1.0
1,20,9.2,0.341,0.51,0.9969,7.9,17.0,3.04,6.0,1.8,1.08,56.0,0.32,"[9.199999809265137, 0.3409999907016754, 0.5099...",0.0,"[-141.99093131088983, -142.08960604375696, -14...","[0.5027512220566402, 0.4555113921185903, 0.037...",0.0
2,36,9.6,0.086,0.0,0.9986,7.8,5.0,3.4,6.0,5.5,0.55,18.0,0.645,"[9.600000381469727, 0.0860000029206276, 0.0, 0...",0.0,"[-114.19035526960248, -114.23438228169499, -11...","[0.49207924558317717, 0.47088446170473713, 0.0...",0.0
3,149,10.2,0.074,0.1,0.9959,6.9,12.0,3.42,6.0,2.3,0.58,30.0,0.49,"[10.199999809265137, 0.07400000095367432, 0.10...",0.0,"[-118.79690164273661, -118.53986826348948, -12...","[0.4157655516238488, 0.5376215655052325, 0.044...",1.0
4,236,9.0,0.097,0.0,0.99675,7.2,14.0,3.37,6.0,1.9,0.58,38.0,0.63,"[9.0, 0.09700000286102295, 0.0, 0.996749997138...",0.0,"[-122.04033351455976, -121.99281667417738, -12...","[0.46622145223103306, 0.4889095881184109, 0.04...",1.0


In [190]:
showDF(predictions.select("quality", "label", "prediction", "probability"))

Unnamed: 0,quality,label,prediction,probability
0,6.0,0.0,1.0,"[0.47485839019681814, 0.49210301362791764, 0.0..."
1,6.0,0.0,0.0,"[0.5027512220566402, 0.4555113921185903, 0.037..."
2,6.0,0.0,0.0,"[0.49207924558317717, 0.47088446170473713, 0.0..."
3,6.0,0.0,1.0,"[0.4157655516238488, 0.5376215655052325, 0.044..."
4,6.0,0.0,1.0,"[0.46622145223103306, 0.4889095881184109, 0.04..."


### We can now use the MutliclassClassifciationEvaluator to evalute the accurancy of our predictions. 

In [191]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.6184738955823293


In [144]:
assembler = VectorAssembler(
    inputCols=['alcohol', 'chlorides', 'citricacid', 'density', 'fixedacidity', 'ph', 'freesulfur', 'sugar', 'sulphates', 'totalsulfur', 'volatileacidity'],
    outputCol='features')

trainingData = assembler.transform(wineRedDF)

labelIndexer = StringIndexer(inputCol="quality", outputCol="label", handleInvalid='keep')
trainingData1 = labelIndexer.fit(trainingData).transform(trainingData)

showDF(trainingData1)
print(trainingData1.count())

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label
0,728,9.5,0.067,0.02,0.997,6.4,4.0,3.46,5.0,1.8,0.68,11.0,0.57,"[9.5, 0.06700000166893005, 0.01999999955296516...",0.0
1,208,9.3,0.069,0.31,0.99625,7.8,26.0,3.29,5.0,1.8,0.53,120.0,0.57,"[9.300000190734863, 0.0689999982714653, 0.3100...",0.0
2,1501,9.6,0.076,0.04,0.99508,7.5,8.0,3.26,5.0,1.5,0.53,15.0,0.725,"[9.600000381469727, 0.07599999755620956, 0.039...",0.0
3,156,10.5,0.071,0.42,0.9973,7.1,28.0,3.42,5.0,5.5,0.71,128.0,0.43,"[10.5, 0.07100000232458115, 0.4199999868869781...",0.0
4,522,9.1,0.088,0.49,0.998,7.6,16.0,3.48,5.0,2.0,0.64,43.0,0.41,"[9.100000381469727, 0.08799999952316284, 0.490...",0.0


1599


In [68]:
# Split the data into train and test
splits = trainingData1.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Datafram Row Count: ")
print (test.count())

Train Dataframe Row Count: 
1338
Test Datafram Row Count: 
267


In [69]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

predictions = model.transform(test)
#predictions.show()
print (predictions.count())
showDF(predictions)

264


Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label,rawPrediction,probability,prediction
0,4,9.8,0.075,0.56,0.998,11.2,17.0,3.16,6.0,1.9,0.58,60.0,0.28,"[9.800000190734863, 0.07500000298023224, 0.560...",1.0,"[-146.51336539130497, -147.93001649210237, -15...","[0.7867106496176279, 0.19079688206878553, 0.01...",0.0
1,6,9.4,0.075,0.0,0.9978,7.4,13.0,3.51,5.0,1.8,0.56,40.0,0.66,"[9.399999618530273, 0.07500000298023224, 0.0, ...",0.0,"[-116.9175112611004, -116.66580062673108, -118...","[0.38220684615257977, 0.49160353993109357, 0.0...",1.0
2,7,9.4,0.069,0.06,0.9964,7.9,15.0,3.3,5.0,1.6,0.46,59.0,0.6,"[9.399999618530273, 0.0689999982714653, 0.0599...",0.0,"[-130.58018488336597, -132.96108995954322, -13...","[0.908754724481785, 0.08402968687088798, 0.004...",0.0
3,47,9.2,0.114,0.43,0.997,7.7,22.0,3.25,5.0,2.2,0.73,114.0,0.935,"[9.199999809265137, 0.11400000005960464, 0.430...",0.0,"[-181.01989601043678, -190.76675085111285, -19...","[0.9999411649397871, 5.847485779054613e-05, 7....",0.0
4,85,10.3,0.069,0.48,0.9959,6.3,18.0,3.44,6.0,1.8,0.78,61.0,0.3,"[10.300000190734863, 0.0689999982714653, 0.479...",1.0,"[-138.8991527417724, -141.13373873835337, -144...","[0.8975943117329699, 0.09607529442426721, 0.00...",0.0


In [70]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.43018867924528303


In [71]:
assembler = VectorAssembler(
    inputCols=['alcohol', 'chlorides', 'citricacid', 'density', 'fixedacidity', 'ph', 'freesulfur', 'sugar', 'sulphates', 'totalsulfur', 'volatileacidity'],
    outputCol='features')

trainingData = assembler.transform(wineWhiteDF)

labelIndexer = StringIndexer(inputCol="quality", outputCol="label", handleInvalid='keep')
trainingData1 = labelIndexer.fit(trainingData).transform(trainingData)

showDF(trainingData1)
print(trainingData1.count())

Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label
0,5691,10.0,0.057,0.28,0.99425,6.4,21.0,3.26,6.0,7.9,0.36,82.0,0.14,"[10.0, 0.05700000002980232, 0.2800000011920929...",0.0
1,6490,11.8,0.036,0.29,0.98938,6.1,25.0,3.06,6.0,2.2,0.44,100.0,0.34,"[11.800000190734863, 0.035999998450279236, 0.2...",0.0
2,1939,10.2,0.049,0.35,0.9934,6.6,49.0,3.43,7.0,1.5,0.85,141.0,0.18,"[10.199999809265137, 0.04899999871850014, 0.34...",2.0
3,1958,10.4,0.05,0.39,0.994,10.0,19.0,3.0,6.0,1.4,0.42,152.0,0.2,"[10.399999618530273, 0.05000000074505806, 0.38...",0.0
4,4641,9.4,0.037,0.41,0.99882,6.2,58.0,3.25,6.0,16.799999,0.57,173.0,0.33,"[9.399999618530273, 0.03700000047683716, 0.409...",0.0


4898


In [72]:
# Split the data into train and test
splits = trainingData1.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Datafram Row Count: ")
print (test.count())

Train Dataframe Row Count: 
4009
Test Datafram Row Count: 
895


In [73]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

predictions = model.transform(test)
#predictions.show()
print (predictions.count())
showDF(predictions)

889


Unnamed: 0,wineid,alcohol,chlorides,citricacid,density,fixedacidity,freesulfur,ph,quality,sugar,sulphates,totalsulfur,volatileacidity,features,label,rawPrediction,probability,prediction
0,1600,8.8,0.045,0.36,1.001,7.0,45.0,3.0,6.0,20.700001,0.45,170.0,0.27,"[8.800000190734863, 0.04500000178813934, 0.360...",0.0,"[-290.47397982261174, -289.610438844386, -294....","[0.2926049106735601, 0.6939252526290564, 0.007...",1.0
1,1605,10.1,0.05,0.4,0.9951,8.1,30.0,3.26,6.0,6.9,0.44,97.0,0.28,"[10.100000381469727, 0.05000000074505806, 0.40...",0.0,"[-197.6379193125336, -199.03367226695718, -198...","[0.5087274925467365, 0.12598458279059105, 0.27...",0.0
2,1606,9.6,0.045,0.16,0.9949,6.2,30.0,3.18,6.0,7.0,0.47,136.0,0.32,"[9.600000381469727, 0.04500000178813934, 0.159...",0.0,"[-203.73923332288857, -203.90106817316314, -20...","[0.46606092411104416, 0.39642287741368154, 0.0...",0.0
3,1629,12.3,0.033,0.36,0.9906,7.2,37.0,3.1,7.0,2.0,0.71,114.0,0.32,"[12.300000190734863, 0.032999999821186066, 0.3...",2.0,"[-203.77522316691557, -205.6659199399471, -203...","[0.3808210122870497, 0.057491246859710986, 0.4...",2.0
4,1661,8.9,0.048,0.26,0.9972,6.0,50.0,3.3,6.0,12.4,0.36,147.0,0.19,"[8.899999618530273, 0.04800000041723251, 0.259...",0.0,"[-258.1470729407925, -258.6094812111399, -260....","[0.5407530770802788, 0.3405474551392079, 0.069...",0.0


In [74]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.387458006718925


In [122]:
session.execute("""drop table wines""")
session.execute("""drop table wineRed""")
session.execute("""drop table wineWhite""")

<cassandra.cluster.ResultSet at 0x107f0f668>