<img src="images/datastaxdevs_banner.png" width="600" height="200">

# Algorithm 2: Naive Bayes
------
<img src="images/drinkWine.jpeg" width="300" height="500">


#### Dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

## What are we trying to learn from this dataset? 

### Can Naive Bayes be used to guess a wine's rating score from its attributes?

In [None]:
import os
import pandas
from pyspark.sql import SparkSession
#
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import NaiveBayes
#
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
#
from dotenv import load_dotenv, find_dotenv

from tools import showDF, examineCassandraTable

In [None]:
# read .env file for connection params
dotenv_file = find_dotenv('.env')
load_dotenv(dotenv_file)
astraUsername = os.environ['ASTRA_DB_CLIENT_ID']
astraPassword = os.environ['ASTRA_DB_CLIENT_SECRET']
astraSecureConnect = os.environ['ASTRA_DB_SECURE_BUNDLE_PATH']
astraKeyspace = os.environ['ASTRA_DB_KEYSPACE']

## Inspect input data: Table(s)

### Connect to Cassandra

In [None]:
cloud_config = {
    'secure_connect_bundle': '/home/jovyan/' + astraSecureConnect
}
auth_provider = PlainTextAuthProvider(username=astraUsername, password=astraPassword)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

### Set keyspace 

In [None]:
session.set_keyspace(astraKeyspace)

### Examine table `wines` (structure and contents)

In [None]:
print(examineCassandraTable(session, astraKeyspace, 'wines'))

### What do these 12 columns represent: 

* **Fixed acidity**
* **Volatile acidity**
* **Citric Acid**
* **Residual Sugar** 
* **Chlorides**
* **Free sulfur dioxide**     
* **Total sulfur dioxide**
* **Density** 
* **pH**
* **Sulphates**
* **Alcohol**
* **Quality**

<img src="images/whiteAndRed.jpeg" width="300" height="300">

# Machine Learning with Apache Cassandra & Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

### Create a Spark session that is connected to the database. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [None]:
spark = SparkSession \
    .builder \
    .appName('demo') \
    .master('local') \
    .config( \
        'spark.cassandra.connection.config.cloud.path', \
        'file:' + '/home/jovyan/' + astraSecureConnect) \
    .config('spark.cassandra.auth.username', astraUsername) \
    .config('spark.cassandra.auth.password', astraPassword) \
    .getOrCreate()

In [None]:
wineDF = spark. \
    read. \
    format('org.apache.spark.sql.cassandra') \
    .options(table='wines', keyspace=astraKeyspace).load()

print ('Table Wine Row Count:')
print (wineDF.count())

In [None]:
showDF(wineDF)

#### Let's filter out only wines that have been rated 6.0 or higher and create a new dataframe with that information 

In [None]:
wine6DF = wineDF.filter('quality > 5')
showDF(wine6DF)

#### Create a Vector with all elements of the wine 

In [None]:
assembler = VectorAssembler(
    inputCols=['alcohol', 'chlorides', 'citricacid', 'density', 'fixedacidity', 'ph',
               'freesulfur', 'sugar', 'sulphates', 'totalsulfur', 'volatileacidity'],
    outputCol='features',
)

trainingData = assembler.transform(wine6DF)

labelIndexer = StringIndexer(inputCol='quality', outputCol='label', handleInvalid='keep')
trainingData1 = labelIndexer.fit(trainingData).transform(trainingData)

showDF(trainingData1)
print(trainingData1.count())

We need to split up our dataset in to a training and test set. We'll split 80/20.

In [None]:
# Split the data into train and test
splits = trainingData1.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ('Train Dataframe Row Count:')
print (train.count())
print ('Test Dataframe Row Count:')
print (test.count())

### Now it's time to to use Naive Bayes. We will train the model, then run the model on the test dataset, to get the accuracy of our predictions. 
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#naive-bayes

In [None]:
nb = NaiveBayes(smoothing=1.0, modelType='multinomial')

# train the model
model = nb.fit(train)

predictions = model.transform(test)
print (predictions.count())
showDF(predictions)

In [None]:
showDF(predictions.select('quality', 'label', 'prediction', 'probability'), limitRows=15)

### We can now use the `MulticlassClassificationEvaluator` to evalute the accuracy of our predictions. 

In [None]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(
    labelCol='label',
    predictionCol='prediction',
    metricName='accuracy',
)
accuracy = evaluator.evaluate(predictions)
print('Test set accuracy = %.4f' % accuracy)

## Example model usage

_Note: in real life, your input is probably massive (as opposed to a single row); also, it is likely read from the database._

In [None]:
def predict_wine_quality(**kwargs):
    input_df = pandas.DataFrame([kwargs])
    spark_input = spark.createDataFrame(input_df)
    spark_with_features = assembler.transform(spark_input)
    predicted = model.transform(spark_with_features)
    collected = predicted.collect()
    #
    return {
        'prediction': collected[0].prediction,
        'probability': list(collected[0].probability),
    }

In [None]:
predict_wine_quality(
    alcohol=15.0,
    chlorides=0.017,
    citricacid=0.31,
    density=0.99100,
    fixedacidity=5.1,
    freesulfur=12.0,
    ph=4.14,
    sugar=1.90,
    sulphates=0.50,
    totalsulfur=103.0,
    volatileacidity=0.170,
)

#### Stop the Spark session

In [None]:
spark.stop()