# SI 618 - Lab 10 - Classification using MLLib

## Objectives

- Be able to perform classifications and regressions using the following methods and interpret their results.
  - Naive Bayes (classification)
  - Random Forest (classification & regression) 
- Understand why and how to split the data into training and testing set.
- Know how to choose and rank features using Random Forest's feature importance measure.

## Submission Instructions:
Please turn in your completed Databricks notebook in .html format as well as the link to the published version of if via Canvas.

In [2]:
#import pandas as pd
import numpy as np
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
ACCESS_KEY = 
SECRET_KEY = 
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "umsi-data-science-west"
MOUNT_NAME = "umsi-data-science"
try:
  dbutils.fs.unmount("/mnt/%s/" % MOUNT_NAME)
except:
  print("Could not unmount %s, but that's ok." % MOUNT_NAME)
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/umsi-data-science/si618wn2017"))

## 1. Naive Bayes Classifier

### 1.1 Example code

### Goal: 
We will show you an example of how to train a naive Bayes classifier to classify iris species. Here are the specifications:
- __Objective__: predict which species an iris instance belongs to.
- __Possible classes__: "setosa", "versicolor", and "virginica"
- __Features__: all four features: sepal_length, sepal_width, petal_length, petal_width

#### Data Summary:
The iris data set contains 3 classes (setosa, versicolor, and virginica) of 50 instances each, where each class refers to a type of iris plant.

The data is downloaded from UCI Machine Learning Repository. For more information, refer to:
https://archive.ics.uci.edu/ml/datasets/iris

#### Load and show the iris dataset.

**NOTE** Need to explain what we're doing with sns here: The Iris data is most easily obtained by loaded the Seaborn package, which is within the package.

In [9]:
import seaborn as sns
df_iris = sns.load_dataset('iris')

In [10]:
df_iris.head(5)

In [11]:
pandas_df_iris = sns.load_dataset('iris')
df_iris = sqlContext.createDataFrame(pandas_df_iris)
df_iris.show(5)

#### Train and test a Naive Bayes classifier on iris data.

In [13]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [14]:
# Split the data into train and test, let's try a 60/40 split and set a random seed so we can reproduce the results
splits = df_iris.randomSplit([0.6, 0.4], 1234)

# Training gets the 60%
iris_train = splits[0]

# Testing gets the 40%
iris_test = splits[1]

# encodes the column of species to a column of species indices
iris_string_indexer = StringIndexer(inputCol="species", outputCol="indexed_species")

In [15]:
iris_string_indexer

In [16]:
# merges multiple columns into a vector column
iris_assembler = VectorAssembler(
    inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"],
    outputCol="features"
)

# create the trainer and set its parameters; set smoothing=10.0, which represents add-one smoothing
iris_nb = NaiveBayes(featuresCol='features', labelCol='indexed_species', smoothing=1.0, modelType="multinomial")

# maps the column of prediction indices back to the original species labels -- WHY?
iris_labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=iris_string_indexer.fit(df_iris).labels)

# Chain indexers and forest in a Pipeline
iris_pipeline = Pipeline(stages=[iris_string_indexer, iris_assembler, iris_nb, iris_labelConverter])

# Train the pipeline on the train dataset
iris_model = iris_pipeline.fit(iris_train)

# select example rows to display.
iris_predictions = iris_model.transform(iris_test)

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="indexed_species", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(iris_predictions)
print("Test set accuracy = " + str(accuracy))

In [17]:
display(iris_predictions.select("prediction","indexed_species"))

### 1.2 Do it yourself:  Use a Naive Bayes Classifier to identify spam email messages

### Goal:
Use the example above (1.1) as the basis for your work.   We will train a Naive Bayes classifier to detect e-mail spam. Here are the specifications:
- __Objective__: predict whether each e-mail is spam or not.
- __Possible classes__: spam: 1, non-spam: 0
- __Features__: all 57 features

#### Data Summary:
The last column of the spam dataset denotes whether the e-mail was 
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  
Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail.

The data is downloaded from UCI Machine Learning Repository. For more information, refer to:
https://archive.ics.uci.edu/ml/datasets/spambase

#### Load and show the spam dataset.

In [22]:
# a bunch of columns of data, no header
spam = spark.read.csv("/mnt/umsi-data-science/si618wn2017/spam.csv", inferSchema=True)
# variable names are stored in another csv file
spam_variable_name = spark.read.csv("/mnt/umsi-data-science/si618wn2017/spam_variable_name.csv")
# this puts the data together
for i in np.arange(len(spam.columns)):
  spam = spam.withColumnRenamed(spam.columns[i], spam_variable_name.toPandas().iloc[i][0].strip())
display(spam)

### Step 1: Split the data using a 70%-30% training-testing split, and set the seed number of the random number generator to be 1.

In [24]:
splits = spam.randomSplit([0.7, 0.3], 1)
spam_train = splits[0]
spam_test = splits[1]

### Step 2: Combine all 57 features into one column named "features" using VectorAssembler(), excluding the target column "spam_or_not".

In [26]:
cols = spam.columns[:-1]
spam_assembler = VectorAssembler(
    inputCols=cols,
    outputCol="features")

### Step 3: Create a NaiveBayes classifier with add-one smoothing.

In [28]:
spam_nb = NaiveBayes(featuresCol='features', labelCol='spam_or_not', smoothing=1.0, modelType="multinomial")

### Step 4: Create a pipeline that includes assembler and nb defined above in Step 2 - Step 3.

In [30]:
spam_pipeline = Pipeline(stages=[spam_assembler, spam_nb])

### Step 5: Train the pipeline object on the training set, and make predictions on the testing set.

In [32]:
spam_model = spam_pipeline.fit(spam_train)
spam_predictions = spam_model.transform(spam_test)

### Step 6: Display the predictions and the true spam labels.

In [34]:
display(spam_predictions.select("prediction","spam_or_not"))

### Step 7: Evaluate how good the spam detection classifier is.

In [36]:
evaluator = MulticlassClassificationEvaluator(labelCol="spam_or_not", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(spam_predictions)
print("Test set accuracy = " + str(accuracy))

## 2. Random Forest

### 2.1 Example code:

In [38]:
from pyspark.ml.classification import RandomForestClassifier

#### Let's use the iris data again as an example.

In [40]:
df_iris.show(5)

#### Train and test a Random Forest classifier on iris data.

In [42]:
# create the trainer and set its parameters; set smoothing=10.0, which represents add-one smoothing
iris_rf = RandomForestClassifier(labelCol="indexed_species", featuresCol="features", numTrees=10)

# Chain indexers and forest in a Pipeline
iris_pipeline = Pipeline(stages=[iris_string_indexer, iris_assembler, iris_rf, iris_labelConverter])

# Train the pipeline on the train dataset
iris_model = iris_pipeline.fit(iris_train)

# select example rows to display.
iris_predictions = iris_model.transform(iris_test)

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="indexed_species", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(iris_predictions)
print("Test set accuracy = " + str(accuracy))

# compute feature importance of each predictor
featureImportances = pd.DataFrame({"index":df_iris.columns[:-1],"featureImportances":iris_model.stages[2].featureImportances})\
  .sort_values("featureImportances", ascending=False)

# visualize the feature importance using seaborn
f, ax = plt.subplots(figsize=(12, 7))
sns.set(style="darkgrid")
sns.barplot(y="index",x="featureImportances",data=featureImportances)

display(f.figure)

### 2.2 Do it yourself!

### Goal:
Model After the above example code. Try applying a Decision Tree classifier to detect the spam e-mails.

### Step 1: Create a DecisionTree classifier.

In [45]:
spam_rf = RandomForestClassifier(labelCol="spam_or_not", featuresCol="features", numTrees=10)

### Step 2: Create a pipeline that includes spam_assembler and spam_rf defined above.

In [47]:
spam_pipeline = Pipeline(stages=[spam_assembler, spam_rf])

### Step 3: Train the pipeline object on the training set, and make predictions on the testing set.

In [49]:
spam_model = spam_pipeline.fit(spam_train)
spam_predictions = spam_model.transform(spam_test)

### Step 4: Display the predictions and the true spam labels.

In [51]:
# You might need to change the following line
display(spam_predictions.select("prediction","spam_or_not"))

### Step 5: Evaluate how good the spam detection classifier is.

In [53]:
evaluator = MulticlassClassificationEvaluator(labelCol="spam_or_not", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(spam_predictions)
print("Test set accuracy = " + str(accuracy))

### Step 6: List every predictor and their feature importance in descending order.

In [55]:
featureImportances = pd.DataFrame({"index":spam.columns[:-1],"featureImportances":spam_model.stages[1].featureImportances})\
  .sort_values("featureImportances", ascending=False)
featureImportances.head(20)  

### Step 7: Visualize the feature importance using Seaborn. Briefly interpret the plot and describe what you think are the important factors that determine if an e-mail is spam or not.

#### For your reference, here is some of the data description:
|Feature|Description|
|--|--|
|word_freq_WORD|percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) total number of words in e-mail.  A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.|
|char_freq_CHAR|percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail|
|capital_run_length_average|average length of uninterrupted sequences of capital letters|
|capital_run_length_longest|length of longest uninterrupted sequence of capital letters|
|capital_run_length_total|total number of capital letters in the e-mail |

In [58]:
# visualize the feature importance using seaborn
f, ax = plt.subplots(figsize=(12, 14))
sns.set(style="darkgrid")
sns.barplot(y="index",x="featureImportances", data=featureImportances)

display(f.figure)

The most important factors that predict whether an email message is spam are the frequency of characters not common to how people speak, like $ and !, and words like free, money, and credit. This makes sense, as most spam messages are trying to sell you something.

## End of Lab 10