In this module, we'll explore Named Entity Recognition (NER) and Sentiment Analysis using Spark NLP with the Amazon reviews dataset. We'll identify the most common named entities mentioned in the reviews and analyze the overall sentiment of those reviews.

Import the necessary modules from PySpark and Spark NLP. These include functions and types from PySpark, and various annotators and base classes from Spark NLP.

In [0]:
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

In [0]:
text_data_path = "dbfs:/databricks-datasets/amazon/data20K"
text_df = spark.read.parquet(text_data_path, header=True, inferSchema=True)

### Named Entity Recognition

Once our data is loaded, we're ready to start processing it. For our first task, we're going to identify named entities in the reviews. Named entities are real-world objects such as persons, locations, organizations, and so on, that can be denoted with a proper name.

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

We'll create a processing pipeline with Spark NLP to do this. A pipeline is a sequence of stages where each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

In [0]:
# NER pipeline to identify and classify named entities in the reviews:

document_assembler = DocumentAssembler() \
    .setInputCol("review") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("tokens")

embeddings = WordEmbeddingsModel.pretrained("glove_100d") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_model = NerDLModel.pretrained("ner_dl", "en") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = NerConverter() \
    .setInputCols(["sentences", "tokens", "ner_tags"]) \
    .setOutputCol("entities")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

We're using the GloVe embeddings pretrained on 6 billion words ("glove_100d") to generate **word embeddings** which are then provided as input during named entity recognition. You can replace "glove_100d" with [any other embeddings](https://nlp.johnsnowlabs.com/models?type=model&q=glove) you prefer.

The **NerDLModel** is a Named Entity Recognition model trained by a deep learning approach. It assigns to every word in a text a tag that signifies whether the word is a named entity or not, and what category it belongs to. It's trained using a variety of neural network architectures, such as Char CNNs - BiLSTM - CRF and Bert - BiLSTM - CRF.  
In the code above, we are loading a pretrained NER model (named "ner_dl") for English ("en"). We set the input columns to be "sentences" and "tokens". This means that the model will perform NER on the tokenized sentences from our text. The output of this stage is a new column, "ner_tags", which contains the NER tags assigned by the model to each word.

The **NerConverter** is used to convert the output from the NerDLModel into a more readable format. It groups together consecutive tokens with the same NER tag into single entities and classifies them according to their tag.  
We're setting the input columns to be "sentences", "tokens", and "ner_tags". This means that the NerConverter will take the tokenized sentences and their corresponding NER tags as input. The output of this stage is a new column, "entities", which contains the named entities extracted from the text. Each entity is represented as a chunk of text along with its NER tag.

We will now transform the reviews DataFrame using the NER pipeline:

In [0]:
pipeline_model = pipeline.fit(text_df)
ner_result_df = pipeline_model.transform(text_df)

In [0]:
ner_result_df.show(4)

# ner_result_df.select("review", "entities").show(2, truncate=False)

Spark NLP also offers pretrained pipeline. They are a pre-built set of transformations that have been trained on a large dataset and saved, so they can be reused. This is a great way to get started quickly with NLP tasks like Named Entity Recognition (NER), sentiment analysis, and more.

To use a pretrained pipeline, you don't need to assemble each individual component. Instead, you can load the entire pipeline at once. One catch is that pretrained pipelines assume that the input column is named "text".

Here's an example of how you can do this:

In [0]:
# from sparknlp.pretrained import PretrainedPipeline

# Load a pretrained pipeline
# pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")

# Rename the target column - "review" -> "text"
# text_df_2 = test_df.withColumnRenamed("review", "text")

# Use the pipeline to annotate a DataFrame
# result_df = pipeline.transform(text_df_2)

- We import the PretrainedPipeline class from the sparknlp.pretrained module.
- We create an instance of the PretrainedPipeline class by calling the constructor and passing the name of the pretrained pipeline we want to use ("recognize_entities_dl") and the language ("en").
- We rename the input column from "review" to "text".
- We use the annotate method of the PretrainedPipeline instance to transform our DataFrame. The annotate method takes two arguments: the DataFrame to transform, and the name of the column in the DataFrame that contains the text to process.

Our next step is to extract the most common named entities from our processed data. 

We'll use groupBy and count to find the most common ones.

In [0]:
named_entities = ner_result_df.select("entities.result").withColumnRenamed("result", "named_entities")

named_entities.show(5, truncate=False)

In [0]:
from pyspark.sql.functions import count, desc

top_entities = named_entities.groupBy("named_entities").agg(count("*").alias("count")).sort(desc("count")).limit(10)
top_entities.show()

- We rename the column containing the list of NER results to "named_entities". 
- Then we count the number of occurrences for each entity and sort them in descending order.
- Finally, we limit our output to the top 10 entities.

### Sentiment analysis


Next, we build another pipeline, this time for sentiment analysis. This pipeline is similar to our NER pipeline, but instead of the NER model, we use a pretrained sentiment analysis model.

In [0]:
document_assembler = DocumentAssembler() \
    .setInputCol("review") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("tokens")

sentiment_model = SentimentDLModel.pretrained("sentimentdl_use_imdb", "en") \
    .setInputCols(["sentences", "tokens"]) \
    .setOutputCol("sentiment")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    sentiment_model
])

We can now fit and transform our filtered reviews DataFrame using this new pipeline.

In [0]:
pipeline_model = pipeline.fit(text_df)
sentiment_result_df = pipeline_model.transform(text_df)