<img src="images/datastaxdevs_banner.png" width="600" height="200">

# Algorithm 1: K-Means Clustering
------
<img src="images/socialMedia.jpeg" width="400" height="500">

#### Dataset: https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand

## What are we trying to learn from this dataset? 

### Can K-Means be used to do social media analysis?
### Can we group together different types of media by the reaction they received?

In [None]:
%matplotlib inline

In [None]:
import os
import pandas
from pyspark.sql import SparkSession
#
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.clustering import KMeans
#
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
#
from dotenv import load_dotenv, find_dotenv

from tools import showDF, examineCassandraTable

In [None]:
# read .env file for connection params
dotenv_file = find_dotenv('.env')
load_dotenv(dotenv_file)
astraUsername = os.environ['ASTRA_DB_CLIENT_ID']
astraPassword = os.environ['ASTRA_DB_CLIENT_SECRET']
astraSecureConnect = os.environ['ASTRA_DB_SECURE_BUNDLE_PATH']
astraKeyspace = os.environ['ASTRA_DB_KEYSPACE']

## Inspect input data: Table(s)

### Connect to Cassandra

In [None]:
cloud_config = {
    'secure_connect_bundle': '/home/jovyan/' + astraSecureConnect
}
auth_provider = PlainTextAuthProvider(username=astraUsername, password=astraPassword)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

### Connectivity test

In [None]:
rows = session.execute('SELECT key, cluster_name, data_center FROM system.local;')
local = rows.one()
if local:
    print('    ** Connected to cluster \'%s\' at data center \'%s\' **' % (
        local.cluster_name,
        local.data_center,
    ))
else:

    print('Error: could not read \'system.local\' table!')

### Set keyspace 

In [None]:
session.set_keyspace(astraKeyspace)

### Examine table `socialmedia` (structure and contents)

In [None]:
print(examineCassandraTable(session, astraKeyspace, 'socialmedia'))

### What do these 11 columns represent: 

* **Status_id**: Unique key created for each row
* **Num Reactions**
* **Num Comments**
* **Num Shares**
* **Num Likes**
* **Num Loves**
* **Num Wows**
* **Num Hahas**
* **Num Sads**
* **Num Angrys**
* **Social Type**: Picture or Video


<img src="images/getTheLikes.png" width="300" height="300">

# Machine Learning with Apache Cassandra & Apache Spark
<img src="images/sparklogo.png" width="150" height="200">

### Create a Spark session that is connected to the database. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [None]:
spark = SparkSession \
    .builder \
    .appName('demo') \
    .master('local') \
    .config( \
        'spark.cassandra.connection.config.cloud.path', \
        'file:' + '/home/jovyan/' + astraSecureConnect) \
    .config('spark.cassandra.auth.username', astraUsername) \
    .config('spark.cassandra.auth.password', astraPassword) \
    .getOrCreate()

socialDF = spark \
    .read \
    .format('org.apache.spark.sql.cassandra') \
    .options(table='socialmedia', keyspace=astraKeyspace) \
    .load()

print ('Table Row Count:')
print (socialDF.count())

In [None]:
showDF(socialDF)

**Note**: when working with `STRING` data types you need to turn those `STRING` types into `FLOAT` types, thereby creating labels that **K-means** and **Apache Spark** can understand.

In [None]:
labelIndexer = StringIndexer(inputCol='social_type', outputCol='label', handleInvalid='keep')
training = labelIndexer.fit(socialDF).transform(socialDF)

showDF(training)

In [None]:
showDF(training.select('social_type', 'label'))

In [None]:
training.groupBy('social_type').count().show()

### Let's visualize this data with a scatter plot:

- The x axis will be number of likes 
- The y axis will be number of comments
- The color of the dot will be assigned based on its "cluster" Photo or Video

Note: These attributes are what might be a strong attributes to finding clusters (Photo - Video)
Note 1: Must move to a Pandas dataframe to do this visualization (be aware! This can't always be done as is, depending on your data size)

In [None]:
smPanda = training.toPandas()
smPanda.plot.scatter(x='num_likes', y='num_comments', c='label', figsize=(12,8), colormap='viridis')

### Two clusters here: Yellow = Video  and Purple = Pictures

From what we can see from these two attributes that Videos get less likes but more comments, while Pictures get less comments but more likes. 

## Let's see if K-Means can give us the same clustering

K-Means clustering is a simple unsupervised learning algorithm that is used to solve clustering problems. K-Means is very simple, but very powerful even on large datasets. It requires that all the input columns be vectorized. 

https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

In [None]:
assembler = VectorAssembler(
    inputCols=['num_likes', 'num_comments'],
    outputCol='features',
)

trainingData = assembler.transform(training)

### We need to set the K for K-Means which we will set at 2. One of the downsides of unsupervised learning is that we normally will not have predefined clusters (well, in this case "secretly we do"). K-Means will happily split the data into as many clusters as you set. 

### First we will generate the model and then make predictions based on that model 

In [None]:
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(trainingData)

# Make predictions
predictions = model.transform(trainingData)

showDF(predictions)

### In this case because we are actually performing surpervised learnings (since we do have the cluster labels) we can do some comparisons to see if our predictions are correct. 

Here we simply compare the counts for each cluster for the labels vs. the prediction:

In [None]:
predictions.groupBy('prediction').count().show()
training.groupBy('social_type').count().show()

### Let's create another scatter plot to see if this lines up with our orignal scatter plot. 

Everything is the same except now our dots will represent the color of the prediction (instead of the orginal cluster)

In [None]:
car_df = predictions.toPandas()
car_df.plot.scatter(x='num_likes', y='num_comments', c='prediction', figsize=(12,8), colormap='viridis')

### Videos are represented in yellow and pictures in purple.

K-Means struggles when you add many variables, so adding more variables is unlikely to help. 

#### _Remember: Data Science/Analytics is an iterative process! It's a science! Hypothesis, test, analysis, and loop again!_

## Example model usage

_Note: in real life, your input is probably massive (as opposed to a single row); also, it is likely read from the database._

In [None]:
def predict_socialmedia_cluster(**kwargs):
    input_df = pandas.DataFrame([kwargs])
    spark_input = spark.createDataFrame(input_df)
    spark_with_features = assembler.transform(spark_input)
    predicted = model.transform(spark_with_features)
    collected = predicted.collect()
    #
    return {
        'prediction': collected[0].prediction,
    }

In [None]:
predict_socialmedia_cluster(
    num_likes=120,
    num_comments=3200,
)

#### Stop the Spark session

In [None]:
spark.stop()