# Classification and Clustering Algorithms paired with Wine and Chocolate
------
<img src="images/wineAndChocolate.jpg" width="500" height="500">

## A demo using DataStax Enterprise Analytics, Apache Cassandra, Apache Spark, Python, Jupyter Notebooks, Spark MlLib, and Naive Bayes. 

#### Real dataset: https://www.kaggle.com/rtatman/chocolate-bar-ratings

#### Most recent updates by Manhattan Cholocate Society: http://flavorsofcacao.com/index.html

## What are we trying to learn from this dataset? 

# QUESTION:  If I have some information about a chocolate bar can I predict which country produced this cholocate? 


## Import python packages -- all are required
* Need to tell Jupyter to display with %matplotlib otherwise you will generate the plot but not display it

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
import pandas
import cassandra
import pyspark
import re
import os
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

### Helper function to have nicer formatting of Spark DataFrames
* Credit to the folks are Netflix for this pretty code.

In [3]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

# DataStax Enterprise Analytics
<img src="images/dselogo.png" width="400" height="200">

## Creating Tables and Loading Tables

## Connect to DSE Analytics Cluster

In [4]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.01'])
session = cluster.connect()

## Create Demo Keyspace 

In [5]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS wineChocolate 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x119f11450>

## Set keyspace 

In [6]:
session.set_keyspace('winechocolate')

### Create table called chocoloate. Our PRIMARY will be a unique key we generate for each row. This will result in an even distribution of the data but we will have to utilize that PRIMARY KEY in our WHERE clause in any of our CQL queries. 

In [7]:
query = "CREATE TABLE IF NOT EXISTS chocolate \
                                   (chocolateid int, company text, bar_location text, ref int, \
                                   review_date int, cocoa_percent float, company_location text, rating float, \
                                   bean_type text, bean_origin text, \
                                   PRIMARY KEY (chocolateid))"
session.execute(query)


<cassandra.cluster.ResultSet at 0x117f7c510>

## Columns 
* Chocoloate Id: Unique id to each bar 
* Company: Company Name of the company manufacturing the bar.
* Bar Location: Specific Bean Origin
* REF: Id when the review was entered in the database. Higher = more recent.
* Review Date: Review Date
* Cocoa Percentage: Cocoa percentage (darkness) of the chocolate bar being reviewed.
* Company Location: Manufacturer base country.
* Rating: Expert rating for the bar. (1-5)
* Bean Type: The variety (breed) of bean used, if provided.
* Bean Origin: The broad geo-region of origin for the bean.

## Load Flavors of Cacao Dataset
<img src="images/chocolatePic.jpeg" width="300" height="300">


## Load from CSV file (chocolateFinal.csv)

### Had to do some data cleaning to load this data. Removed `%`, removed extra commas, removed ``"``, removed extra commons that existed in the company name. This preprocessing was done by me and is availble in chocolateFinal.csv

#### Insert all the Cholocate Data into the Apache Cassandra table `chocolate`

In [8]:
fileName = 'data/chocolateFinal.csv'
input_file = open(fileName, 'r')
i = 1

for line in input_file:
    chocolateId = i 
    columns = line.split(',')
    if columns[7] == "\xc2\xa0":
        columns[7] = ""
    columns[8] = columns[8].rstrip("\r\n")
    
    query = "INSERT INTO chocolate (chocolateid, company, bar_location, ref, review_date, cocoa_percent, \
                                    company_location, rating, bean_type, bean_origin)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (chocolateId, columns[0], columns[1], int(columns[2]), int(columns[3]), 
                    float(columns[4]), columns[5], float(columns[6]), columns[7], columns[8]))
    i = i + 1

### Do a select * on each table and verify that the data have been inserted into the Apache Cassandra table

In [9]:
query = 'SELECT * FROM chocolate WHERE chocolateid=200'
rows = session.execute(query)
for row in rows:
    print (row.chocolateid, row.bar_location, row.ref, row.review_date, row.cocoa_percent, row.company_location,
          row.rating, row.bean_type, row.bean_origin)

(200, u'Sambiao 2009', 565, 2010, 70.0, u'U.S.A.', 3.0, u'Tiitaio', u'Madagasca')


## DSE Analytics with Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

### Finally time for Apache Spark! 

#### Create a spark session that is connected to Apache Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [10]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()


chocolateTable = spark.read.format("org.apache.spark.sql.cassandra").options(table="chocolate", keyspace="winechocolate").load()

print ("Table Row Count: ")
print (chocolateTable.count())

Table Row Count: 
1795


In [11]:
showDF(chocolateTable)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date
0,728,Kaua'I Alea Estate +wold,Hawaii,Bled,85.0,Gade Islad,U.S.A.,2.5,1367,2014
1,208,Pueto Plata,Domiica Republic,,68.0,Bittesweet Oigis,U.S.A.,3.75,233,2008
2,1501,Solomo Islad w/ ibs,Solomo Islads,,75.0,Solomos Gold,New Zealad,3.25,1796,2016
3,156,Sambiao,Madagasca,Tiitaio,70.0,Ba Au Chocolat,U.S.A.,3.75,983,2012
4,522,Budibugyo Distict,Ugada,Foasteo,70.0,De Villies,South Afica,2.5,1832,2016


### We will be training a model with Naive Bays, and because of this we need to split up our dataset in to a training and test set. Will split 80/20. 

In [12]:
# Split the data into train and test
splits = chocolateTable.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Datafram Row Count: ")
print (test.count())

Train Dataframe Row Count: 
1493
Test Datafram Row Count: 
304


### Naive Bayes is a classifier algorthim, that can predict a label from a model built from known labels. 

### Requires that all values passed to the function be a float and  Vectorized.  We will create unique indexes for the Bean Origin, and our Label will be Company Location (what we are trying to predict). 

https://spark.apache.org/docs/latest/ml-features.html#stringindexer

#### StringIndexers of Bean Origin and Company Location 

In [13]:
# Convert target into numerical categories

labelIndexer = StringIndexer(inputCol="bean_origin", outputCol="origin", handleInvalid='keep')
training1 = labelIndexer.fit(train).transform(train)

labelIndexer2 = StringIndexer(inputCol="company_location", outputCol="label", handleInvalid='keep')
training2 = labelIndexer2.fit(training1).transform(training1)

#### Vectorization -- in this case we will assume that if I have the cocoa_percentage, the rating, the chocolate origin, and the review date that we can use that data to figure out the Country that produced this chocolate.  We end up with a new DataFrame called training  data. 
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

In [14]:
assembler = VectorAssembler(
    inputCols=['cocoa_percent', 'rating', 'origin', 'review_date'],
    outputCol='features')

trainingData = assembler.transform(training2)
showDF(trainingData)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,14,Equateu,Ecuado,,70.0,A. Moi,Face,3.75,1011,2013,1.0,1.0,"[70.0, 3.75, 1.0, 2013.0]"
1,17,Papua New Guiea,Papua New Guiea,,70.0,A. Moi,Face,3.25,1015,2013,10.0,1.0,"[70.0, 3.25, 10.0, 2013.0]"
2,21,Chachamayo Povice,Peu,,63.0,A. Moi,Face,4.0,1019,2013,2.0,1.0,"[63.0, 4.0, 2.0, 2013.0]"
3,24,Chulucaas El Plataal,Peu,,70.0,Acalli,U.S.A.,3.75,1462,2015,2.0,0.0,"[70.0, 3.75, 2.0, 2015.0]"
4,25,Tumbes Noadio,Peu,Ciollo,70.0,Acalli,U.S.A.,3.75,1470,2015,2.0,0.0,"[70.0, 3.75, 2.0, 2015.0]"


#### We must also do this with the testing set

In [15]:
labelIndexer2 = StringIndexer(inputCol="bean_origin", outputCol="origin", handleInvalid='keep')
testing1 = labelIndexer2.fit(test).transform(test)

labelIndexer4 = StringIndexer(inputCol="company_location", outputCol="label", handleInvalid='keep')
testing2 = labelIndexer4.fit(testing1).transform(testing1)

In [16]:
assembler1 = VectorAssembler(
    inputCols=['cocoa_percent', 'rating', 'origin', 'review_date'],
    outputCol='features')

testingData = assembler1.transform(testing2)

showDF(testingData)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,4,Akata,Togo,,70.0,A. Moi,Face,3.5,1680,2015,26.0,1.0,"[70.0, 3.5, 26.0, 2015.0]"
1,6,Caeeo,Veezuela,Ciollo,70.0,A. Moi,Face,2.75,1315,2014,1.0,1.0,"[70.0, 2.75, 1.0, 2014.0]"
2,7,Cuba,Cuba,,70.0,A. Moi,Face,3.5,1315,2014,24.0,1.0,"[70.0, 3.5, 24.0, 2014.0]"
3,47,CIAAB Coop,Bolivia,,60.0,Altus aka Cao Atisa,U.S.A.,2.5,1732,2016,17.0,0.0,"[60.0, 2.5, 17.0, 2016.0]"
4,85,Tiidad,Tiidad,Tiitaio,70.0,Amedei,Italy,3.5,129,2007,13.0,6.0,"[70.0, 3.5, 13.0, 2007.0]"


### Now it's time to to use NaiveBayes. We will train the model, then use that model with out testing data to get our predictions. 
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#naive-bayes

In [17]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(trainingData)

predictions = model.transform(testingData)
#predictions.show()
print (predictions.count())
showDF(predictions)

305


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features,rawPrediction,probability,prediction
0,3,Atsae,Togo,,70.0,A. Moi,Face,3.0,1676,2015,26.0,1.0,"[70.0, 3.0, 26.0, 2015.0]","[-478.5409844021181, -475.45945158361974, -476...","[0.0026919031923872787, 0.05866136121298531, 0...",11.0
1,9,Pueto Cabello,Veezuela,Ciollo,70.0,A. Moi,Face,3.75,1319,2014,1.0,1.0,"[70.0, 3.75, 1.0, 2014.0]","[-350.2989441658598, -355.65440421331976, -355...","[0.008310619648087082, 3.924521021984312e-05, ...",5.0
2,12,Madagasca,Madagasca,Ciollo,70.0,A. Moi,Face,3.0,1011,2013,4.0,1.0,"[70.0, 3.0, 4.0, 2013.0]","[-361.35121907085187, -365.7159195570009, -365...","[0.3308132751730429, 0.004207436718091643, 0.0...",0.0
3,35,Mote Alege D. Badeo,Bazil,Foasteo,75.0,Akesso's (Palus),Switzelad,2.75,508,2010,5.0,10.0,"[75.0, 2.75, 5.0, 2010.0]","[-381.8288587139367, -385.84861985986, -385.43...","[0.4875144584590192, 0.008754420759597814, 0.0...",0.0
4,87,Toscao Black,,Bled,70.0,Amedei,Italy,5.0,40,2006,7.0,6.0,"[70.0, 5.0, 7.0, 2006.0]","[-390.033451120477, -393.29388602039035, -393....","[0.6029937735273052, 0.023137900121070202, 0.0...",0.0


#### Let's just look at a few. 

In [18]:
showDF(predictions.select("company_location", "label", "prediction", "probability"))

Unnamed: 0,company_location,label,prediction,probability
0,Face,1.0,11.0,"[0.0026413761427303696, 0.05862142316959855, 0..."
1,Face,1.0,5.0,"[0.008214428098094532, 3.73993255231028e-05, 7..."
2,Face,1.0,11.0,"[0.010321946934497134, 0.11661989061094824, 0...."
3,U.S.A.,0.0,0.0,"[0.2129656778473464, 0.21197638091380983, 0.12..."
4,Italy,6.0,0.0,"[0.4472119921620541, 0.12366782638216954, 0.10..."


### We can now use the MutliclassClassifciationEvaluator to evalute the accurancy of our predictions. 

In [19]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.184466019417


### X amount of the time if we know the cocoa percentage, where the bean was grown, and the rating we can figure out what country produced this candy bar. Pretty interesting. 

#### Go and try to see what else you can predict? Can you predict the rating based on these attributes? 

### Reference: 
* https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3783546674231736/4413065072037724/latest.html
* https://www.kaggle.com/rtatman/chocolate-bar-ratings
* https://en.wikipedia.org/wiki/Naive_Bayes_classifier