# Validation 

In the validation step the trained model and pipeline was tested on data that the model has not seen before. The metric used to examine the results were ROC, accuracy, and f1 score. Also, recall and precision for one and zero was examined. 

### Import pyspark using Docker

In [1]:
import pyspark 
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import matplotlib.pyplot as plt
import numpy as np
from pyspark.ml.classification import LogisticRegression,LogisticRegressionModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import warnings
warnings.filterwarnings("ignore")

### Start Spark Session

In [2]:
spark = SparkSession.builder.appName('val').getOrCreate()

### Load Data

In [3]:
df = spark.read.csv('clean_val/part-00000-2661d739-2781-4738-9b1b-6b4c69096d9d-c000.csv', header = True).select('Text', 'verified')

In [4]:
### View data
df.show(10)

+--------------------+--------+
|                Text|verified|
+--------------------+--------+
|   really good movie|    true|
|review didnt like...|    true|
|shabby zombieposs...|    true|
|disturbing good a...|    true|
|                null|    true|
|love plot story l...|    true|
|great revenge mov...|    true|
|          worth time|    true|
|         great movie|    true|
|great movie inter...|    true|
+--------------------+--------+
only showing top 10 rows



In [5]:
#### look for nan values 
print('Null Text:', df.where((df["Text"].isNull())).count())
print('Null verified:', df.where((df["verified"].isNull())).count())

Null Text: 21625
Null verified: 0


In [6]:
### drop na's
df = df.na.drop()
df.count()

1729882

In [7]:
### create a Label column
df = df.withColumn('label', when(df.verified == 'true', 1.0).otherwise(0.0)).select('Text', 'label')
df.show(10)

+--------------------+-----+
|                Text|label|
+--------------------+-----+
|   really good movie|  1.0|
|review didnt like...|  1.0|
|shabby zombieposs...|  1.0|
|disturbing good a...|  1.0|
|love plot story l...|  1.0|
|great revenge mov...|  1.0|
|          worth time|  1.0|
|         great movie|  1.0|
|great movie inter...|  1.0|
|        agreed titty|  0.0|
+--------------------+-----+
only showing top 10 rows



### Load Pipeline & Model

Pipeline and trained model were imported in to be tested on the validation data.  

In [8]:
### Import pipeline 
from pyspark.ml import PipelineModel, Pipeline
load_pipline = PipelineModel.read().load('pipline_train')

In [9]:
### import model 
model = LogisticRegressionModel.load('LGmodel')

### Transform validation data

In [10]:
val = load_pipline.transform(df)
val.show(10)

+--------------------+-----+--------------------+--------------------+--------------------+--------------------+
|                Text|label|          token_text|         rawFeatures|                 idf|            features|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+
|   really good movie|  1.0|[really, good, mo...|(262144,[0,3,9],[...|(262144,[0,3,9],[...|(262144,[0,3,9],[...|
|review didnt like...|  1.0|[review, didnt, l...|(262144,[4,57,59,...|(262144,[4,57,59,...|(262144,[4,57,59,...|
|shabby zombieposs...|  1.0|[shabby, zombiepo...|(262144,[0,56,87,...|(262144,[0,56,87,...|(262144,[0,56,87,...|
|disturbing good a...|  1.0|[disturbing, good...|(262144,[3,47,114...|(262144,[3,47,114...|(262144,[3,47,114...|
|love plot story l...|  1.0|[love, plot, stor...|(262144,[5,7,56,6...|(262144,[5,7,56,6...|(262144,[5,7,56,6...|
|great revenge mov...|  1.0|[great, revenge, ...|(262144,[0,2,4,9,...|(262144,[0,2,4,9,...|(2621

### Predict with validation data

In [11]:
pred = model.transform(val)
pred.select('label', 'prediction', 'probability').show(10)

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  1.0|       1.0|[0.16254599207776...|
|  1.0|       1.0|[0.17328587700396...|
|  1.0|       1.0|[0.20860411877465...|
|  1.0|       1.0|[0.17162704128013...|
|  1.0|       1.0|[0.17139272015373...|
|  1.0|       1.0|[0.20543779656975...|
|  1.0|       1.0|[0.16300693263487...|
|  1.0|       1.0|[0.15696064858238...|
|  1.0|       1.0|[0.20239410797914...|
|  0.0|       1.0|[0.16685376261038...|
+-----+----------+--------------------+
only showing top 10 rows



### Metrics 

Metric used were ROC, accuracy, and f1 score. All three metric showed results that were better than the results on the training set. 

In [12]:
#### R0C
evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(pred))

Test Area Under ROC 0.8001782387143311


In [13]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
#### Accuracy 
acc = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
print('Accuracy:', acc.evaluate(pred))

Accuracy: 0.8516534653808757


In [14]:
#### F1 Score 
ff = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')
print('F1 score:', ff.evaluate(pred))

F1 score: 0.8217714404524895


### Recall , Precision, F1 score 

When looking at the recall and precision for both zero and one the recall for zero is not great. The model only gets about 0.25 of the actual zero(false) right. 

In [15]:
import pandas as pd
from sklearn import metrics as skmetrics

In [16]:
y_true = pred.select(['label']).collect()
y_pred = pred.select(['prediction']).collect()

In [17]:
#### Classification Report 
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         0.0       0.68      0.25      0.36    294987
         1.0       0.86      0.98      0.92   1434895

    accuracy                           0.85   1729882
   macro avg       0.77      0.61      0.64   1729882
weighted avg       0.83      0.85      0.82   1729882

