**Sentiment Analysis on IMDB Movie Reviews**


In [None]:
# Mount Google Drive Directory
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-04-22 21:50:01--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.26
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.26|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-04-22 21:50:01--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1594 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-04-22 21:50:02 (28.6 MB/s) - written to stdout [1594/1594]

setup Colab for PySpark 3.0.2 and Spark NLP 3.0.2
[K     |████████████████████████████████|

In [None]:
# Import SparkNLP and Start SparkNLP Session
import sparknlp
spark = sparknlp.start()

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

from sparknlp.base import *
from sparknlp.annotator import *           

from pyspark.sql.functions import *

from sklearn.metrics import classification_report, accuracy_score

import pandas as pd

Spark NLP version: 3.0.2
Apache Spark version: 3.0.2


**IMDB Dataset Info**

https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

In [None]:
# Upload the dataset into a Pandas DataFrame
csv_file = "/content/drive/MyDrive/Colab Notebooks/imdb_dataset.csv"
reviews = pd.read_csv(csv_file, usecols=[0, 1], names=["text", "actual_sentiment"], skiprows=1)
reviews

Unnamed: 0,text,actual_sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
""" Create a Spark DataFrame from the Pandas DataFrame, keep index column, replace positive and negative sentiment tags from the dataset by pos and neg respectively, 
because sentiments analysis model outputs result as pos or neg hence it will be handy to check accuracy of the model later on. """

reviews_df = spark.createDataFrame(reviews.reset_index(drop=False))\
.withColumn("actual_sentiment", when(col("actual_sentiment") == "positive", "pos").otherwise("neg"))
reviews_df.show(truncate=100)

+-----+----------------------------------------------------------------------------------------------------+----------------+
|index|                                                                                                text|actual_sentiment|
+-----+----------------------------------------------------------------------------------------------------+----------------+
|    0|One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. ...|             pos|
|    1|A wonderful little production. <br /><br />The filming technique is very unassuming- very old-tim...|             pos|
|    2|I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air ...|             pos|
|    3|Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his ...|             neg|
|    4|Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offe...|          

**DocumentAssembler**

---

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: it creates the first annotation of type Document which may be used by annotators down the road. It can read either a String column or an Array[String]

In [None]:
# Document Assembler
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document").setCleanupMode("shrink_full")

**DocumentNormalizer (Text cleaning)**

---
Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence.


In [None]:
# Document Normalizer
clean_up_patterns = ["<[^>]*>"]

document_normalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalized_document") \
    .setAction("clean") \
    .setPatterns(clean_up_patterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(False)

**UniversalSentenceEncoder**

---

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

In [None]:
# Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en").setInputCols(["normalized_document"]).setOutputCol("sentence_embeddings")

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


**SentimentDL (Multi-class Sentiment Analysis annotator)**

---

SentimentDL is an annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets

In [None]:
# SentimentDL Model
sentimentdl = SentimentDLModel.pretrained(name="sentimentdl_use_imdb", lang="en").setInputCols(["sentence_embeddings"]).setOutputCol("sentiment")

sentimentdl_use_imdb download started this may take some time.
Approximate size to download 12 MB
[OK!]


**Pipeline**

---

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

In [None]:
# Pipeline
nlp_pipeline = Pipeline(
      stages = [
          document_assembler,
          document_normalizer,
          use,
          sentimentdl
      ])

In [None]:
# Fit the model on the dataset and get the results
empty_df = spark.createDataFrame([[""]]).toDF("text")

pipeline_model = nlp_pipeline.fit(empty_df)

results = pipeline_model.transform(reviews_df)

**Index | Review | Sentiment | Score**

In [None]:
results.select("index", explode(arrays_zip("normalized_document.result", "sentiment.result", "sentiment.metadata")).alias("cols")) \
.select("index",
        expr("cols['0']").alias("normalized_document"),
        expr("cols['1']").alias("sentiment"), 
        when(expr("cols['1']") == "pos", format_number(expr("cols['2'].pos") * 100, 1))\
        .otherwise(format_number(expr("cols['2'].neg") * 100, 1)).alias("score")
        ).show(truncate=100)

+-----+----------------------------------------------------------------------------------------------------+---------+-----+
|index|                                                                                 normalized_document|sentiment|score|
+-----+----------------------------------------------------------------------------------------------------+---------+-----+
|    0|One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. ...|      pos|100.0|
|    1|A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashio...|      pos|100.0|
|    2|I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air ...|      pos|100.0|
|    3|Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his ...|      neg|100.0|
|    4|Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offe...|      pos|100.0|


**Classification Report & Accuracy Score**

In [None]:
sentiments_df = results.select("actual_sentiment", explode("sentiment.result").alias("sentiment")).toPandas()

In [None]:
print("\033[1m", "Classification Report\n\n", classification_report(sentiments_df["actual_sentiment"], sentiments_df["sentiment"]))
print("\033[1m", "Accuracy Score:", accuracy_score(sentiments_df["actual_sentiment"], sentiments_df["sentiment"]))

  _warn_prf(average, modifier, msg_start, len(result))


[1m Classification Report

               precision    recall  f1-score   support

         neg       0.90      0.83      0.86     25000
     neutral       0.00      0.00      0.00         0
         pos       0.85      0.90      0.88     25000

    accuracy                           0.87     50000
   macro avg       0.58      0.58      0.58     50000
weighted avg       0.88      0.87      0.87     50000

[1m Accuracy Score: 0.86636


**Sentiment Analysis of a Single Review**


In [None]:
# Light Pipline Model
light_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

In [None]:
# UDF to analyze one review and print out results
def analyze_sentiment(review):  
  annotations = light_model.fullAnnotate(review)
  sentiment = annotations[0]["sentiment"][0].result
  print(review)
  if sentiment == "neutral":
    print("This seems like a neutral review.\U0001F610")
    return
  score = format((float(annotations[0]["sentiment"][0].metadata[sentiment]) * 100), ".1f")
  if sentiment == "pos":
    print("This seems like a positive review.\U0001F601")
    print("Sentiment Score:", score + "%", "Positive.")
  else:
    print("This seems like a negative review.\U0001F621")
    print("Sentiment Score:", score + "%", "Negative.")

In [None]:
# You can choose any review from our IMDB dataset and analyze it
review = reviews["text"][0]
analyze_sentiment(review)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [None]:
# You can enter any review to analyze it
review = "Harry Potter is a good movie!"
analyze_sentiment(review)

NameError: ignored

In [None]:
# You can enter any review to analyze it
review = "Harry Potter is a bad movie!"
analyze_sentiment(review)

Harry Potter is a bad movie!
This seems like a negative review.😡
Sentiment Score: 99.9% Negative.
