# Part 3) Text Classification

In this part, we will train a text classifier from the features extracted in Part 2. The goal is to learn a model that can predict the product category from a review's text.

To this end, we will extend the pipeline from Part 2 such that a **LinearSVC** classifier is trained. Since we are dealing with multi-class problems, we have to make sure to put a strategy in place that allows binary classifiers to be applicable. Apply vector length normalization before feeding the feature vectors into the classifier, using Normalizer with L2 norm.

Using machine learning experiment design to investigate the effects of parameter settings with the help of the functions provided by Spark:

- Split the review data into training, validation, and test set
- Make experiments reproducible
- Use a grid search for parameter optimization:
    - Compare different LinearSVC settings by varying the regularization parameter, standardization of training features and maximum number of iterations
- Use the MulticlassClassificationEvaluator to estimate performance of trained classifiers on the test set, using F1 measure as criterion.

*Commands used in terminal in order to execute this notebook:*
- **jupyter nbconvert Part3.ipynb --to script** *(for converting jupyter notebook file to python file in order to execute it via spark-submit)*
- **spark-submit --executor-memory 8G --num-executors 4 --total-executor-cores 16 --conf spark.ui.port=5051 Part3.py**

For this part, we will first set the configurations and initialize the session and context which we will be using.

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName("Part 3) Text Classification")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

We will read the data as a json file and select *category* and *reviewText* for further analyses. We will also load the PipelineModel we created in the part 2 of this assignment and save it into a variable called `pipeline_load`

In [None]:
from pyspark.ml import PipelineModel

# Read the data as json file and select category and reviewText
data = spark.read.json("hdfs:///user/data/reviews_devset.json").select('category', 'reviewText')
# Load PipelineModel created in part 2 of the assignment
pipeline_load =  PipelineModel.load('/user/Solution/pipeline')

In order to make experiments reproducible we will set a seed random number `rnd = 1234` which we will use in spliting.

We will the split the dataset into three different train, validation and test sets. 
- 80% will be split to training set and 20% to test set. The 80% of the training set we will further split it down into 80% for training and 20% for validation set. That will leave us with 64% of the entire dataset for training, 16% for validation and 20% for testing.

*Note*: we will use only 15000 random rows from the dataset because otherwise it is taking too much time to finish.

In [None]:
# set a seed for reproducability
rnd = 1234
# spliting the data into train, test and validate sets -> 64% for training, 16% for validation and 20% for testing
# Note: we will be using only 15000 random rows from the dataset because of the time it takes to train the classifier
train, validate, test = spark.createDataFrame(data.rdd.takeSample(False, 15000, seed=rnd)).randomSplit([0.64, 0.16, 0.2], seed=rnd)

We will normalize the *tfidf* features extracted in the part 2 of the assignment using *L2 Normalizer*

In [None]:
from pyspark.ml.feature import Normalizer

# applying L2 Normalizer to the tfidf features extracted
normalizer = Normalizer(inputCol="tfidf", outputCol="features", p=2.0)

In this part we will initialize the `LinearSVC` classifier that we are going to use. 
In order to handle multi-class problems, we will use `OneVsRest` classifier.

In [None]:
from pyspark.ml.classification import LinearSVC, OneVsRest

# set the classifier to SVC
classifier = LinearSVC()
# use OneVsRest classifier in order to handle multi class binary
ml_classifier = OneVsRest(classifier=classifier)

We will extend the PipelineModel created in part 2 by adding normalizer and ml_classifier to it.

In [None]:
from pyspark.ml import Pipeline

# Extend pipeline from Part 2 by adding normalizer and ml_classifier
pipeline = Pipeline(stages=[pipeline_load, normalizer, ml_classifier])

In order to apply different parameters to the *LinearSVC* classifier we will use `ParamGridBuilder`. We will set
- maxIter = [10, 15]
- regParam = [0.001, 0.01, 0.1]
- standardization = [True, False]

For tuning, we will use *TrainValidationSplit* because it is less expensive compared to *CrossValidator*. In order to evaluate the performance of the trained classifiers, we will use `MulticlassClassificationEvaluator` with `f1` measure. We will set `trainRatio=0.8` so that 80% of the data to be used for training and 20% for validation.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator

# Apply different parameters to the SVC classifier
paramGrid = ParamGridBuilder() \
    .addGrid(classifier.maxIter, [10, 15]) \
    .addGrid(classifier.regParam, [0.001, 0.01, 0.1]) \
    .addGrid(classifier.standardization, [True, False]) \
    .build()

# Initializing evaluator for TrainValidationSplit
evaluator = MulticlassClassificationEvaluator(metricName="f1")
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=evaluator,
                           trainRatio=0.8) # 80% will be used for training, 20% for validation

Now we will fit the model with the train set we created and then save the best model for future use in the specified path:

In [None]:
tvsModel = tvs.fit(train)
tvsBestModel = tvsModel.bestModel
tvsBestModel.write().overwrite().save("/user/Solution/tvsBestModel")

We will use the best model to transform validation and test sets.

In [None]:
# Create prediction for validate data
validation = tvsBestModel.transform(validate)
# Create prediction for test data
prediction = tvsBestModel.transform(test)

In order to calculate the accuray of the models, we will use the MulticlassClassificationEvaluator we initialized before:
- Accuracy of validation using the best model is calculated => *evaluator.evaluate(validation)*
- Accuracy of prediction using the best model is calculated => *evaluator.evaluate(prediction)*

We will also get all the parameters used to create the best model and save all in *output_model.txt* file

In [None]:
writeToFile = "Accuracy of validation using the best model is = " + str(round(evaluator.evaluate(validation) * 100, 2)) + "%"
writeToFile += "\nAccuracy of prediction using the best model is = " + str(round(evaluator.evaluate(prediction) * 100, 2)) + "%"
writeToFile += "\n\nBest model parameters:"
writeToFile += "\nMaximum iterations = " + str(tvsBestModel.stages[-1].getClassifier().getMaxIter())
writeToFile += "\nRegularization = " + str(tvsBestModel.stages[-1].getClassifier().getRegParam())
writeToFile += "\nStandardization = " + str(tvsBestModel.stages[-1].getClassifier().getStandardization())

from pyspark.sql.types import *
# convert writeToFile string to a spark dataframe
writeFile = spark.createDataFrame([writeToFile], StringType())
# saving writeFile dataframe to a file
writeFile.coalesce(1).write.format("text").mode('overwrite').save("/user/Solution/output_model.txt")