# Classification and Clustering Algorithms paired with Wine and Chocolate
------
<img src="images/wineAndChocolate.jpg" width="500" height="500">

## A demo using Apache Cassandra, Apache Spark, the Cassandra-Spark Connector, Python, Jupyter Notebooks, Spark MlLib, and Naive Bayes. 

#### Real dataset: https://www.kaggle.com/rtatman/chocolate-bar-ratings

#### Most recent updates by Manhattan Cholocate Society: http://flavorsofcacao.com/index.html

## What are we trying to learn from this dataset? 

# QUESTION:  If I have some information about a chocolate bar can I predict which country produced this cholocate? 


## Import python packages -- all are required
* Need to tell Jupyter to display with %matplotlib otherwise you will generate the plot but not display it

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
import pandas
import cassandra
import pyspark
import re
import os
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

### Helper function to have nicer formatting of Spark DataFrames
* Credit to the folks are Netflix for this pretty code.

In [3]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

# Apache Cassandra 
<img src="images/cassandralogo.png" width="200" height="200">

## Creating Tables and Loading Tables

## Connect to Apache Cassandra Local Instance

In [4]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.01'])
session = cluster.connect()

## Create Demo Keyspace 

In [5]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS wineChocolate 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x112b366d8>

## Set keyspace 

In [6]:
session.set_keyspace('winechocolate')

### Create table called chocoloate. Our PRIMARY will be a unique key we generate for each row. This will result in an even distribution of the data but we will have to utilize that PRIMARY KEY in our WHERE clause in any of our CQL queries. 

In [7]:
query = "CREATE TABLE IF NOT EXISTS chocolate \
                                   (chocolateid int, company text, bar_location text, ref int, \
                                   review_date int, cocoa_percent float, company_location text, rating float, \
                                   bean_type text, bean_origin text, \
                                   PRIMARY KEY (chocolateid))"
session.execute(query)


<cassandra.cluster.ResultSet at 0x113cb85f8>

## Columns 
* Chocoloate Id: Unique id to each bar 
* Company: Company Name of the company manufacturing the bar.
* Bar Location: Specific Bean Origin
* REF: Id when the review was entered in the database. Higher = more recent.
* Review Date: Review Date
* Cocoa Percentage: Cocoa percentage (darkness) of the chocolate bar being reviewed.
* Company Location: Manufacturer base country.
* Rating: Expert rating for the bar. (1-5)
* Bean Type: The variety (breed) of bean used, if provided.
* Bean Origin: The broad geo-region of origin for the bean.

## Load Flavors of Cacao Dataset
<img src="images/chocolatePic.jpeg" width="300" height="300">


## Load from CSV file (chocolateFinal.csv)

### Had to do some data cleaning to load this data. Removed `%`, removed extra commas, removed ``"``, removed extra commons that existed in the company name. This preprocessing was done by me and is availble in chocolateFinal.csv

#### Insert all the Cholocate Data into the Apache Cassandra table `chocolate`

In [8]:
fileName = 'data/chocolateFinal.csv'
input_file = open(fileName, 'r')
i = 1

for line in input_file:
    chocolateId = i 
    columns = line.split(',')
    if columns[7] == "\xc2\xa0":
        columns[7] = ""
    columns[8] = columns[8].rstrip("\r\n")
    
    query = "INSERT INTO chocolate (chocolateid, company, bar_location, ref, review_date, cocoa_percent, \
                                    company_location, rating, bean_type, bean_origin)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (chocolateId, columns[0], columns[1], int(columns[2]), int(columns[3]), 
                    float(columns[4]), columns[5], float(columns[6]), columns[7], columns[8]))
    i = i + 1

### Do a select * on each table and verify that the data have been inserted into the Apache Cassandra table

In [9]:
query = 'SELECT * FROM chocolate WHERE chocolateid=200'
rows = session.execute(query)
for row in rows:
    print (row.chocolateid, row.bar_location, row.ref, row.review_date, row.cocoa_percent, row.company_location,
          row.rating, row.bean_type, row.bean_origin)

200 Sambiao 2009 565 2010 70.0 U.S.A. 3.0 Tiitaio Madagasca


## Machine Learning with Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

### Finally time for Apache Spark! 

#### Create a spark session that is connected to Apache Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [10]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()


chocolateTable = spark.read.format("org.apache.spark.sql.cassandra").options(table="chocolate", keyspace="winechocolate").load()

print ("Table Row Count: ")
print (chocolateTable.count())

Table Row Count: 
1795


In [11]:
showDF(chocolateTable)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date
0,678,Maao Cayo Fotuato No. 4,Peu,Foasteo (Nacioal),70.0,Fech Boad,U.S.A.,3.5,781,2011
1,665,Costa Rica,Costa Rica,,70.0,Fedeic Blodeel,Belgium,3.5,1538,2015
2,455,Los Rios H. Iaa,Ecuado,Nacioal,72.0,Coppeeu,Gemay,2.5,558,2010
3,1439,Atilles (Ti/Ge/DR/Ve),Caibea,Bled,75.0,Schaffe Bege,U.S.A.,3.0,188,2007
4,1764,Madagasca Batch 2,Madagasca,Tiitaio,70.0,Zak's,U.S.A.,3.25,1578,2015


### Naive Bayes is a classifier algorthim, that can predict a label from a model built from known labels. 

### Requires that all values passed to the function be a float and  Vectorized.  We will create unique indexes for the Bean Origin, and our Label will be Company Location (what we are trying to predict). 

https://spark.apache.org/docs/latest/ml-features.html#stringindexer

#### StringIndexers of Bean Origin and Company Location 

In [12]:
# Convert target into numerical categories

labelIndexer = StringIndexer(inputCol="bean_origin", outputCol="origin", handleInvalid='keep')
training1 = labelIndexer.fit(chocolateTable).transform(chocolateTable)

labelIndexer2 = StringIndexer(inputCol="company_location", outputCol="label", handleInvalid='keep')
training2 = labelIndexer2.fit(training1).transform(training1)

#### Vectorization -- in this case we will assume that if I have the cocoa_percentage, the rating, the chocolate origin, and the review date that we can use that data to figure out the Country that produced this chocolate.  We end up with a new DataFrame called training  data. 
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

In [13]:
assembler = VectorAssembler(
    inputCols=['cocoa_percent', 'rating', 'origin', 'review_date'],
    outputCol='features')

trainingData = assembler.transform(training2)
showDF(trainingData)
print(trainingData.count())

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,678,Maao Cayo Fotuato No. 4,Peu,Foasteo (Nacioal),70.0,Fech Boad,U.S.A.,3.5,781,2011,2.0,0.0,"[70.0, 3.5, 2.0, 2011.0]"
1,665,Costa Rica,Costa Rica,,70.0,Fedeic Blodeel,Belgium,3.5,1538,2015,13.0,7.0,"[70.0, 3.5, 13.0, 2015.0]"
2,455,Los Rios H. Iaa,Ecuado,Nacioal,72.0,Coppeeu,Gemay,2.5,558,2010,1.0,9.0,"[72.0, 2.5, 1.0, 2010.0]"
3,1439,Atilles (Ti/Ge/DR/Ve),Caibea,Bled,75.0,Schaffe Bege,U.S.A.,3.0,188,2007,30.0,0.0,"[75.0, 3.0, 30.0, 2007.0]"
4,1764,Madagasca Batch 2,Madagasca,Tiitaio,70.0,Zak's,U.S.A.,3.25,1578,2015,3.0,0.0,"[70.0, 3.25, 3.0, 2015.0]"


1795


### We will be training a model with Naive Bays, and because of this we need to split up our dataset in to a training and test set. Will split 80/20. 

In [14]:
# Split the data into train and test
splits = trainingData.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Datafram Row Count: ")
print (test.count())

Train Dataframe Row Count: 
1456
Test Datafram Row Count: 
339


### Now it's time to to use NaiveBayes. We will train the model, then use that model with out testing data to get our predictions. 
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#naive-bayes

In [15]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)

predictions = model.transform(test)
#predictions.show()
print (predictions.count())
showDF(predictions)

339


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features,rawPrediction,probability,prediction
0,4,Akata,Togo,,70.0,A. Moi,Face,3.5,1680,2015,49.0,1.0,"[70.0, 3.5, 49.0, 2015.0]","[-604.0545643021429, -590.1303001840056, -598....","[1.34200518155597e-14, 1.496184887452294e-08, ...",52.0
1,18,Chuao,Veezuela,Tiitaio,70.0,A. Moi,Face,4.0,1015,2013,0.0,1.0,"[70.0, 4.0, 0.0, 2013.0]","[-346.63124271474675, -353.64051466795036, -35...","[0.0005523351135647801, 4.9901605902197e-07, 8...",14.0
2,47,CIAAB Coop,Bolivia,,60.0,Altus aka Cao Atisa,U.S.A.,2.5,1732,2016,7.0,0.0,"[60.0, 2.5, 7.0, 2016.0]","[-340.45439733661397, -344.52226105550676, -34...","[0.6150256804353286, 0.01052549390014852, 0.03...",0.0
3,314,Veezuela,Veezuela,,71.0,Cacao Sampaka,Spai,3.5,537,2010,0.0,11.0,"[71.0, 3.5, 0.0, 2010.0]","[-346.63723989821494, -353.6492612700649, -350...","[0.0005581755339317312, 5.029080735234311e-07,...",5.0
4,523,Sambiao Valley batch 2477,Madagasca,,70.0,De Villies,South Afica,3.0,1832,2016,3.0,44.0,"[70.0, 3.0, 3.0, 2016.0]","[-356.21325300976605, -361.97196724289165, -35...","[0.12331549688324803, 0.0003890808382743227, 0...",13.0


#### Let's just look at a few. 

In [16]:
showDF(predictions.select("company_location", "label", "prediction", "probability"))

Unnamed: 0,company_location,label,prediction,probability
0,Face,1.0,52.0,"[1.34200518155597e-14, 1.496184887452294e-08, ..."
1,Face,1.0,14.0,"[0.0005523351135647801, 4.9901605902197e-07, 8..."
2,U.S.A.,0.0,0.0,"[0.6150256804353286, 0.01052549390014852, 0.03..."
3,Spai,11.0,5.0,"[0.0005581755339317312, 5.029080735234311e-07,..."
4,South Afica,44.0,13.0,"[0.12331549688324803, 0.0003890808382743227, 0..."


### We can now use the MutliclassClassifciationEvaluator to evalute the accurancy of our predictions. 

In [17]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.22418879056047197


### X amount of the time if we know the cocoa percentage, where the bean was grown, and the rating we can figure out what country produced this candy bar. Pretty interesting. 

### Let's try to get a more accurate results. Let's try looking at just the USA chocolate bars. 

In [18]:
USAtrain=train.filter(train.label == 0.0)
USAtest=test.filter(test.label == 0.0)
showDF(USAtrain)
print(USAtrain.count())
showDF(USAtest)
print(USAtest.count())

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,51,Coacado,Domiica Republic,Tiitaio,60.0,Altus aka Cao Atisa,U.S.A.,3.0,1125,2013,4.0,0.0,"[60.0, 3.0, 4.0, 2013.0]"
1,94,Elvesia,Domiica Republic,,75.0,Aahata,U.S.A.,3.0,1259,2014,4.0,0.0,"[75.0, 3.0, 4.0, 2014.0]"
2,143,Xocousco,Mexico,Tiitaio,75.0,Askiosie,U.S.A.,2.5,141,2007,18.0,0.0,"[75.0, 2.5, 18.0, 2007.0]"
3,218,Kokoa Kamili,Tazaia,,75.0,Blue Badaa,U.S.A.,3.5,1752,2016,17.0,0.0,"[75.0, 3.5, 17.0, 2016.0]"
4,266,Maya Moutai,Belize,Tiitaio,80.0,Baze,U.S.A.,3.25,1518,2015,9.0,0.0,"[80.0, 3.25, 9.0, 2015.0]"


610


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,47,CIAAB Coop,Bolivia,,60.0,Altus aka Cao Atisa,U.S.A.,2.5,1732,2016,7.0,0.0,"[60.0, 2.5, 7.0, 2016.0]"
1,533,Bolivia,Bolivia,,80.0,DeVies,U.S.A.,2.75,241,2008,7.0,0.0,"[80.0, 2.75, 7.0, 2008.0]"
2,652,Domiica Republic,Domiica Republic,,80.0,Fica,U.S.A.,3.0,1283,2014,4.0,0.0,"[80.0, 3.0, 4.0, 2014.0]"
3,794,Los Rios Hacieda Limo Oecao 2014,Ecuado,EET,67.0,Heiloom Cacao Pesevatio (Guittad),U.S.A.,3.75,1243,2014,1.0,0.0,"[67.0, 3.75, 1.0, 2014.0]"
4,1189,Veezuela,Veezuela,,70.0,Noi d' Ebie,U.S.A.,3.0,837,2012,0.0,0.0,"[70.0, 3.0, 0.0, 2012.0]"


154


In [19]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
modelUSA = nb.fit(USAtrain)

predictionsUSA = modelUSA.transform(USAtest)
#predictionsUSA.show()
print (predictionsUSA.count())
showDF(predictionsUSA)

154


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features,rawPrediction,probability,prediction
0,47,CIAAB Coop,Bolivia,,60.0,Altus aka Cao Atisa,U.S.A.,2.5,1732,2016,7.0,0.0,"[60.0, 2.5, 7.0, 2016.0]",[-339.5469838617882],[1.0],0.0
1,533,Bolivia,Bolivia,,80.0,DeVies,U.S.A.,2.75,241,2008,7.0,0.0,"[80.0, 2.75, 7.0, 2008.0]",[-408.40657650990767],[1.0],0.0
2,652,Domiica Republic,Domiica Republic,,80.0,Fica,U.S.A.,3.0,1283,2014,4.0,0.0,"[80.0, 3.0, 4.0, 2014.0]",[-394.32443286292363],[1.0],0.0
3,794,Los Rios Hacieda Limo Oecao 2014,Ecuado,EET,67.0,Heiloom Cacao Pesevatio (Guittad),U.S.A.,3.75,1243,2014,1.0,0.0,"[67.0, 3.75, 1.0, 2014.0]",[-339.3242550110049],[1.0],0.0
4,1189,Veezuela,Veezuela,,70.0,Noi d' Ebie,U.S.A.,3.0,837,2012,0.0,0.0,"[70.0, 3.0, 0.0, 2012.0]",[-339.18598432573134],[1.0],0.0


In [20]:
showDF(predictionsUSA.select("company_location", "label", "prediction", "probability"))

Unnamed: 0,company_location,label,prediction,probability
0,U.S.A.,0.0,0.0,[1.0]
1,U.S.A.,0.0,0.0,[1.0]
2,U.S.A.,0.0,0.0,[1.0]
3,U.S.A.,0.0,0.0,[1.0]
4,U.S.A.,0.0,0.0,[1.0]


In [21]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictionsUSA)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 1.0


## USA!!! But wait ... perfect accurancy?? 

#### I don't show it here but I tried this again with Frace (or FACE) and got 0.0% accurancy... that can't be right. TO the internet: https://stackoverflow.com/questions/33708532/why-does-spark-ml-naivebayes-output-labels-that-are-different-from-the-training
#### Long story short, your lables must start with 0.0 -> and go from there. There is no validation of this and all results returned will be of label "0.0" this just HAPPENED to be the USA's label. So let's try with USA and France and see if we can get better accurancy. 

In [22]:
USAtrain=train.filter(train.label == 0.0)
francetrain = train.filter(train.label == 1.0)
franceUSA = USAtrain.union(francetrain)

USAtest=test.filter(test.label == 0.0)
francetest=test.filter(test.label == 1.0)
franceUSAtest = USAtest.union(francetest)
#showDF(franceUSA)
print(franceUSA.count())
#showDF(USAtest)
print(franceUSAtest.count())

741
179


In [23]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model1 = nb.fit(franceUSA)

predictions = model1.transform(franceUSAtest)
#predictions.show()
print (predictions.count())
showDF(predictions)

179


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features,rawPrediction,probability,prediction
0,47,CIAAB Coop,Bolivia,,60.0,Altus aka Cao Atisa,U.S.A.,2.5,1732,2016,7.0,0.0,"[60.0, 2.5, 7.0, 2016.0]","[-339.74258294733437, -343.8104466662271]","[0.983174048212234, 0.016825951787765945]",0.0
1,533,Bolivia,Bolivia,,80.0,DeVies,U.S.A.,2.75,241,2008,7.0,0.0,"[80.0, 2.75, 7.0, 2008.0]","[-408.60217559545384, -412.6266434149103]","[0.9824408993831146, 0.017559100616885256]",0.0
2,652,Domiica Republic,Domiica Republic,,80.0,Fica,U.S.A.,3.0,1283,2014,4.0,0.0,"[80.0, 3.0, 4.0, 2014.0]","[-394.5200319484698, -399.8380369532251]","[0.9951213930015532, 0.00487860699844666]",0.0
3,794,Los Rios Hacieda Limo Oecao 2014,Ecuado,EET,67.0,Heiloom Cacao Pesevatio (Guittad),U.S.A.,3.75,1243,2014,1.0,0.0,"[67.0, 3.75, 1.0, 2014.0]","[-339.51985409655106, -346.112556965846]","[0.9986315441100022, 0.0013684558899978252]",0.0
4,1189,Veezuela,Veezuela,,70.0,Noi d' Ebie,U.S.A.,3.0,837,2012,0.0,0.0,"[70.0, 3.0, 0.0, 2012.0]","[-339.3815834112775, -346.4118354736507]","[0.9991160732670148, 0.000883926732985361]",0.0


In [24]:
showDF(predictions.select("company_location", "label", "prediction", "probability"))

Unnamed: 0,company_location,label,prediction,probability
0,U.S.A.,0.0,0.0,"[0.983174048212234, 0.016825951787765945]"
1,U.S.A.,0.0,0.0,"[0.9824408993831146, 0.017559100616885256]"
2,U.S.A.,0.0,0.0,"[0.9951213930015532, 0.00487860699844666]"
3,U.S.A.,0.0,0.0,"[0.9986315441100022, 0.0013684558899978252]"
4,U.S.A.,0.0,0.0,"[0.9991160732670148, 0.000883926732985361]"


In [25]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.6983240223463687


#### Go and try to see what else you can predict? Can you predict the rating based on these attributes? 

### Reference: 
* https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3783546674231736/4413065072037724/latest.html
* https://www.kaggle.com/rtatman/chocolate-bar-ratings
* https://en.wikipedia.org/wiki/Naive_Bayes_classifier