# Classification and Clustering Algorithms paired with Wine and Chocolate
------
<img src="images/wineAndChocolate.jpg" width="500" height="500">

## A demo using DataStax Enterprise Analytics, Apache Cassandra, Apache Spark, Python, Jupyter Notebooks, Spark MlLib, and Naive Bayes. 

#### Real dataset: https://www.kaggle.com/rtatman/chocolate-bar-ratings

#### Most recent updates by Manhattan Cholocate Society: http://flavorsofcacao.com/index.html

## What are we trying to learn from this dataset? 

# QUESTION:  If I have some information about a chocolate bar can I predict which country produced this cholocate? 


## Import python packages -- all are required
* Need to tell Jupyter to display with %matplotlib otherwise you will generate the plot but not display it

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
import pandas
import cassandra
import pyspark
import re
import os
#from uuid import uuid1
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
#from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover
#from pyspark.sql.functions import col, udf
#from pyspark.sql.types import IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

### Helper function to have nicer formatting of Spark DataFrames
* Credit to the folks are Netflix for this pretty code.

In [3]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

# DataStax Enterprise Analytics
<img src="images/dselogo.png" width="400" height="200">

## Creating Tables and Loading Tables

## Connect to DSE Analytics Cluster

In [4]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.01'])
session = cluster.connect()

## Create Demo Keyspace 

In [5]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS wineChocolate 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x116f16450>

## Set keyspace 

In [6]:
session.set_keyspace('winechocolate')

### Create table called chocoloate. Our PRIMARY will be a unique key we generate for each row. This will result in an even distribution of the data but we will have to utilize that PRIMARY KEY in our WHERE clause in any of our CQL queries. 

In [7]:
query = "CREATE TABLE IF NOT EXISTS chocolate \
                                   (chocolateid int, company text, bar_location text, ref int, \
                                   review_date int, cocoa_percent float, company_location text, rating float, \
                                   bean_type text, bean_origin text, \
                                   PRIMARY KEY (chocolateid))"
session.execute(query)


<cassandra.cluster.ResultSet at 0x11751d690>

## Columns 
* Chocoloate Id: Unique id to each bar 
* Company: Company Name of the company manufacturing the bar.
* Bar Location: Specific Bean Origin
* REF: Id when the review was entered in the database. Higher = more recent.
* Review Date: Review Date
* Cocoa Percentage: Cocoa percentage (darkness) of the chocolate bar being reviewed.
* Company Location: Manufacturer base country.
* Rating: Expert rating for the bar. (1-5)
* Bean Type: The variety (breed) of bean used, if provided.
* Bean Origin: The broad geo-region of origin for the bean.

## Load Flavors of Cacao Dataset
<img src="images/chocolatePic.jpeg" width="300" height="300">


## Load from CSV file (chocolateFinal.csv)

### Had to do some data cleaning to load this data. Removed `%`, removed extra commas, removed ``"``, removed extra commons that existed in the company name. This preprocessing was done by me and is availble in chocolateFinal.csv

#### Insert all the Cholocate Data into the Apache Cassandra table `chocolate`

In [8]:
fileName = 'data/chocolateFinal.csv'
input_file = open(fileName, 'r')
i = 1

for line in input_file:
    chocolateId = i 
    columns = line.split(',')
    if columns[7] == "\xc2\xa0":
        columns[7] = ""
    columns[8] = columns[8].rstrip("\r\n")
    
    query = "INSERT INTO chocolate (chocolateid, company, bar_location, ref, review_date, cocoa_percent, \
                                    company_location, rating, bean_type, bean_origin)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (chocolateId, columns[0], columns[1], int(columns[2]), int(columns[3]), 
                    float(columns[4]), columns[5], float(columns[6]), columns[7], columns[8]))
    i = i + 1

### Do a select * on each table and verify that the data have been inserted into the Apache Cassandra table

In [9]:
query = 'SELECT * FROM chocolate WHERE chocolateid=200'
rows = session.execute(query)
for row in rows:
    print (row.chocolateid, row.bar_location, row.ref, row.review_date, row.cocoa_percent, row.company_location,
          row.rating, row.bean_type, row.bean_origin)

(200, u'Sambiao 2009', 565, 2010, 70.0, u'U.S.A.', 3.0, u'Tiitaio', u'Madagasca')


## DSE Analytics with Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

### Finally time for Apache Spark! 

#### Create a spark session that is connected to Apache Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [10]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()


chocolateTable = spark.read.format("org.apache.spark.sql.cassandra").options(table="chocolate", keyspace="winechocolate").load()

print ("Table Row Count: ")
print (chocolateTable.count())

Table Row Count: 
1795


In [11]:
showDF(chocolateTable)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date
0,381,la Amistad,Costa Rica,,70.0,Chequessett,U.S.A.,3.5,1235,2014
1,671,Guatemala,Guatemala,,73.0,Fech Boad,U.S.A.,3.5,1634,2015
2,622,Wampusipi batch 007,Hoduas,,75.0,ENNA,U.S.A.,3.25,1916,2016
3,444,Ocumae,Veezuela,Ciollo,70.0,Compaia de Chocolate (Salgado),Agetia,3.75,292,2008
4,746,Ghaa,Ghaa,Foasteo,64.0,Guido Castaga,Italy,3.0,355,2009


### We will be training a model with Naive Bays, and because of this we need to split up our dataset in to a training and test set. Will split 80/20. 

In [12]:
# Split the data into train and test
splits = chocolateTable.randomSplit([0.8, 0.2], 1234)
train = splits[0]
test = splits[1]

print ("Train Dataframe Row Count: ")
print (train.count())
print ("Test Datafram Row Count: ")
print (test.count())

Train Dataframe Row Count: 
1490
Test Datafram Row Count: 
302


### Naive Bayes is a classifier algorthim, that can predict a label from a model built from known lables. 

### Requires that all values passed to the function be a float and  Vectorized.  We will Create unique indexes for the Bean Origin, and our Label will be Compnay Location (what we are trying to predict). 

https://spark.apache.org/docs/latest/ml-features.html#stringindexer

#### StringIndexers of Bean Origin and Company Location 

In [13]:
# Convert target into numerical categories

labelIndexer = StringIndexer(inputCol="bean_origin", outputCol="origin", handleInvalid='keep')
training1 = labelIndexer.fit(train).transform(train)

labelIndexer2 = StringIndexer(inputCol="company_location", outputCol="label", handleInvalid='keep')
training2 = labelIndexer2.fit(training1).transform(training1)

#### Vectorization -- in this case we will assume that if I have the cocoa_percentage, the rating, the chocolate origin, and the review date that we can use that data to figure out the Country that produced this chocolate.  We end up with a new DataFrame called training  data. 

In [14]:
assembler = VectorAssembler(
    inputCols=['cocoa_percent', 'rating', 'origin', 'review_date'],
    outputCol='features')

trainingData = assembler.transform(training2)
showDF(trainingData)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,15,Colombie,Colombia,,70.0,A. Moi,Face,2.75,1015,2013,13.0,1.0,"[70.0, 2.75, 13.0, 2013.0]"
1,18,Chuao,Veezuela,Tiitaio,70.0,A. Moi,Face,4.0,1015,2013,0.0,1.0,"[70.0, 4.0, 0.0, 2013.0]"
2,20,Chachamayo Povice,Peu,,70.0,A. Moi,Face,3.5,1019,2013,2.0,1.0,"[70.0, 3.5, 2.0, 2013.0]"
3,22,Bolivia,Bolivia,,70.0,A. Moi,Face,3.5,797,2012,7.0,1.0,"[70.0, 3.5, 7.0, 2012.0]"
4,27,Vaua Levu Toto-A,Fiji,Tiitaio,80.0,Adi,Fiji,3.25,705,2011,28.0,35.0,"[80.0, 3.25, 28.0, 2011.0]"


#### We must also do this with the testing set

In [15]:
labelIndexer2 = StringIndexer(inputCol="bean_origin", outputCol="origin", handleInvalid='keep')
testing1 = labelIndexer2.fit(test).transform(test)

labelIndexer4 = StringIndexer(inputCol="company_location", outputCol="label", handleInvalid='keep')
testing2 = labelIndexer4.fit(testing1).transform(testing1)

In [16]:
assembler1 = VectorAssembler(
    inputCols=['cocoa_percent', 'rating', 'origin', 'review_date'],
    outputCol='features')

testingData = assembler1.transform(testing2)

showDF(testingData)

Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features
0,1,Agua Gade,Sao Tome,,63.0,A. Moi,Face,3.75,1876,2016,51.0,1.0,"[63.0, 3.75, 51.0, 2016.0]"
1,2,Kpime,Togo,,70.0,A. Moi,Face,2.75,1676,2015,29.0,1.0,"[70.0, 2.75, 29.0, 2015.0]"
2,5,Quilla,Peu,,70.0,A. Moi,Face,3.5,1704,2015,4.0,1.0,"[70.0, 3.5, 4.0, 2015.0]"
3,42,La Dalia Matagalpa,"Tiitaio""","""Ciollo",70.0,Alexade,Nethelads,3.5,1944,2017,20.0,42.0,"[70.0, 3.5, 20.0, 2017.0]"
4,72,Domiica Republic,Domiica Republic,,75.0,Ambosia,Caada,3.25,1498,2015,2.0,2.0,"[75.0, 3.25, 2.0, 2015.0]"


### Now it's time to to use NaiveBayes. We will train the model, then use that model with out testing data to get our predictions. 

In [17]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(trainingData)

predictions = model.transform(testingData)
#predictions.show()
print (predictions.count())
showDF(predictions)

305


Unnamed: 0,chocolateid,bar_location,bean_origin,bean_type,cocoa_percent,company,company_location,rating,ref,review_date,origin,label,features,rawPrediction,probability,prediction
0,3,Atsae,Togo,,70.0,A. Moi,Face,3.0,1676,2015,29.0,1.0,"[70.0, 3.0, 29.0, 2015.0]","[-494.17894241814815, -489.6389009437287, -490...","[0.00040624835244631704, 0.03806331180893738, ...",32.0
1,9,Pueto Cabello,Veezuela,Ciollo,70.0,A. Moi,Face,3.75,1319,2014,0.0,1.0,"[70.0, 3.75, 0.0, 2014.0]","[-345.1534570458592, -351.46319758086054, -350...","[0.0005383699666437076, 9.790284850230996e-07,...",14.0
2,12,Madagasca,Madagasca,Ciollo,70.0,A. Moi,Face,3.0,1011,2013,3.0,1.0,"[70.0, 3.0, 3.0, 2013.0]","[-356.1543960082415, -361.3653635555305, -360....","[0.13769173165316512, 0.0007513000376837426, 0...",0.0
3,35,Mote Alege D. Badeo,Bazil,Foasteo,75.0,Akesso's (Palus),Switzelad,2.75,508,2010,5.0,13.0,"[75.0, 2.75, 5.0, 2010.0]","[-381.9132058595776, -386.3733212292389, -385....","[0.4531109859949299, 0.00523842943892278, 0.00...",0.0
4,87,Toscao Black,,Bled,70.0,Amedei,Italy,5.0,40,2006,6.0,5.0,"[70.0, 5.0, 6.0, 2006.0]","[-384.77849504114675, -388.7684266746865, -388...","[0.5220959518827601, 0.009659286201578963, 0.0...",0.0


#### Let's just look at a few. 

In [18]:
showDF(predictions.select("company_location", "label", "prediction", "probability"))

Unnamed: 0,company_location,label,prediction,probability
0,Face,1.0,32.0,"[0.0003991177552727826, 0.03813408584982848, 0..."
1,Face,1.0,14.0,"[0.0005411606073564617, 9.46337425274115e-07, ..."
2,Face,1.0,54.0,"[4.00613856396407e-12, 1.5536045038919817e-07,..."
3,U.S.A.,0.0,1.0,"[0.22001870744766508, 0.2221413678357579, 0.13..."
4,Italy,5.0,1.0,"[0.11365312043965578, 0.2598296948109482, 0.15..."


### We can now use the MutliclassClassifciationEvaluator to evalute the accurancy of our predictions. 

In [19]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.200657894737


### X amount of the time if we know the cocoa percentage, where the bean was grown, and the rating we can figure out what country produced this candy bar. Pretty interesting. 

#### Go and try to see what else you can predict? Can you predict the rating based on these attributes? 

### Reference: 
* https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3783546674231736/4413065072037724/latest.html
* https://www.kaggle.com/rtatman/chocolate-bar-ratings
* https://en.wikipedia.org/wiki/Naive_Bayes_classifier