**Sentiment Analysis with [Spark NLP](https://nlp.johnsnowlabs.com/?gclid=CjwKCAjwr7X4BRA4EiwAUXjbt8SXPLqhOytb-o6ZpGC67FuhfJkiaI3GR2EvdTItYmQXEK2gIRfmlBoCzt8QAvD_BwE)** 


In this second part of our tutorial, we will use Spark NLP, an industry level open source NLP library. After implementing the preprocessing steps as we did last time with NLTK, we will use the pretrained sentiment_analyzer from Spark NLP to see an example of how to use a pretrained model for sentiment analysis. 

Our goal is to introduce you to one of the most robust NLP tools and libraries that you can continue learning more about as you keep experimenting with NLP techniques. 


*Please note, in order to have a full grasp of Spark NLP, as well as any other NLP library or tool, you will first need to get familiarized with their documentation and concepts. To learn more about Spark NLP visit the [documentation](https://nlp.johnsnowlabs.com/docs/en/concepts)*

+ We will first setup the necessary colab environment. 
  + Run this block only if you are inside Google Colab.

In [None]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.5.0

+ Next, we will mount Google colab by running the cell below and clicking on the URL to get the authorization code. 
  + If you are coding along, copy and paste your authorization code from the url that appears after you run the cell below to the provided box. If you are using Jupyter Notebook, you don't need to do this step. 

In [None]:
from google.colab import drive
drive.mount("content")

+ We have now set up our environment on Google colab and can continue with the next steps, using Spark NLP to do sentiment analysis. 

### 1. Sentiment Analysis Using the pretrained Pipeline

Using a pretrained pipeline with spark dataframes we can also use the pipeline through a spark dataframe. We just need to create first a spark dataframe with a column named “text” that will work as the input for the pipeline and then use the `.transform()` method to run the pipeline over that dataframe and store the outputs of the different components in a spark dataframe.

In this example, we are not doing any training or using a model that we created, but we simply use Spark NLP, out of the box, to tell us what the sentiment of a text that we give to it is. 

In [None]:
import sys
import sparknlp
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from pyspark.ml import Pipeline, PipelineModel

from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline

+ We will now start a spark nlp session, as well as check for versions of both Apache spark and spark NLP. Running this cell without an error means we have installed the necessary packages correctly. 

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

+ Load the predefined pipeline provided in Spark NLP containing all the annotators we need to run a sentiment analysis on a piece of raw text.

+ + The next step in the process is to initialize the pretrained model from Spark NLP. For sentiment analysis, we will use the named `analyze_sentiment` for the English language. 

+ In this example, we can simply use a text that could be provided by a user, a client, or any piece of text that you would like to get the sentiment associated to it.

In [None]:
pipeline = PretrainedPipeline("analyze_sentiment", lang="en")

+ Create random list of sentences that you would like the model to analyze. 




In [None]:
dataset = ["Since there is No Vaccine for COVID-19 I have no choice but to wear my mask to protect, my family, myself and others. fact is many people have died from COVID-19 are you willing to take that risk, and possibly even put your family in harms way?", "Their is NO Vaccine so wear the MASK!"]

# Alternatively, you can put this tiny data into a spark dataframe
# data = spark.createDataFrame([["Since there is No Vaccine for COVID-19 I have no choice but to wear my mask to protect, my family, myself and others. fact is many people have died from COVID-19 are you willing to take that risk, and possibly even put your family in harms way?", "Their is NO Vaccine so wear the MASK!"]]).toDF('text')

In [None]:
# Annotate our tiny dataset
result = pipeline.annotate(dataset)
[(r['sentence'], r['sentiment']) for r in result]

In [None]:
# We can also view each stage in the pipeline by simply printing it.
result

### 2. Sentiment Analysis Using A Pretrained Model, SentimentDL



In [None]:
import time
import sys
import os
import pandas as pd

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

+ Let's pull in our twitter dataset that we used last time. 

In [None]:
df = pd.read_csv('/content/content/My Drive/Colab Notebooks/nlp-tutorial-part-ii/data/covid19_tweets.csv')

In [None]:
df.head()

+ Pick only the relevant columns for the sentiment analysis task to reduce data size. 
  + pretrained pipelines expect the input column to be named “text”.

In [None]:
df = df[['tweet', 'sentiment']]
df = df.rename(columns={'tweet': 'text'})

In [None]:
df.head()

+ Split the dataset into train and test sets, save these subsets into two different csv files, using pandas and numpy
  + This can also be done with `scikit-learn` library as we did last time. This is simply another way of splitting our data if you are trying to reduce the overhead of your code.

In [None]:
import numpy as np

# Randomly select %80 of the dataset and use it for training. 
mask = np.random.rand(len(df)) < 0.8
trainDataset = df[mask]

# Take the complement of the training set we have split above (i.e %20 of the data for testing).
testDataset = df[~mask]

#save these subsets (train & test) into csv
trainDataset.to_csv('/content/content/My Drive/Colab Notebooks/nlp-tutorial-part-ii/data/trainDataset.csv', index=False)
testDataset.to_csv('/content/content/My Drive/Colab Notebooks/nlp-tutorial-part-ii/data/testDataset.csv', index=False)

+ See how many rows of data we have in training and testing sets

In [None]:
trainDataset.shape

In [None]:
testDataset.shape

In [None]:
trainDataset.head()

+ Convert the data into a pyspark dataframe to make it compatible with Spark NLP

In [None]:
spark_train = spark.createDataFrame(trainDataset.astype(str))

In [None]:
spark_test = spark.createDataFrame(testDataset.astype(str))

In [None]:
spark_train.show(n=10, truncate=True)

+ Setup the Pipeline for the model

+ With any new tool or library libray, there is often some specific terminology that you need to learn. In this case, the term we need to pay attention to is "pipeline,"
    + *In Machine Learning, a pipeline is often defined as a sequence of algorithms to process and learn from data. It is a sequence of stages, and in Spark NLP, each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. For more details on Spark Pipelines that Spark NLP uses, please visit [here](http://spark.apache.org/docs/latest/ml-pipeline.html).*

In [None]:
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in sentiment column
sentimentdl = SentimentDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("sentiment")\
  .setMaxEpochs(3)\
  .setEnableOutputLogs(True)


pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])

+ Train the model on our training dataset

In [None]:
pipelineModel = pipeline.fit(spark_train)

### Save and load pre-trained SentimentDL model


In [None]:
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')

+ Use our pre-trained SentimentDLModel in a pipeline

In [None]:
# In a new pipeline we can load it for prediction
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])

In [None]:
from pyspark.sql.types import StringType
dfTest = spark.createDataFrame([
    "I am glad I read this book on the latest trends in Natural Language Processing.",
    "This movie is ridiculous. I wish I hadn't come to watch it."
], StringType()).toDF("text")

In [None]:
prediction = pipeline.fit(dfTest).transform(dfTest)


In [None]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

## Evaluation 

Similar to other NLP libraries, we can use the evaluation metrics for NLP, evaluating our Spark NLP sentimentdl model. For this, we will first run the model on our test set. We leave it to you for practice to experiment with evaluations metrics in `scikit-learn` library. (Hint: Revisit Part I notebook)

In [None]:
predictions = pipelineModel.transform(spark_test)

In [None]:
predictions.select('sentiment','text',"class.result").show(20, truncate=50)

+ SentimentDL has the ability to accept a threshold to set a label on any result that is less than that number. By default the threshold is set on 0.6 and everything below that will be assigned as neutral. You can change this label with `setThresholdLabel` attribute.

+ We need to filter neutral results since we don't have any in the original test dataset to compare with.

In [None]:
predictions_df = predictions.select('sentiment','text',"class.result").toPandas()

In [None]:
predictions_df = predictions_df[predictions_df['result'] != 'neutral']

In [None]:
predictions_df.head()

In [1]:
from sklearn.metrics import accuracy_score

#alternatively
from sklearn.metrics import classification_report


# Your code here