In this module, we'll discuss using PySpark for NLP tasks. We'll cover how to load, preprocess, and analyze text data using PySpark. We'll also discuss when to use PySpark for NLP tasks and when to consider other Python NLP libraries.

We'll introduce Spark NLP, a popular open-source NLP library built on top of PySpark by John Snow Labs. The hands-on exercise will demonstrate how to perform text preprocessing and feature extraction with Spark NLP.

In [0]:
# !pip install spark-nlp

Now, let's perform the necessary imports:

In [0]:
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

# import sparknlp
# spark = sparknlp.start()

In [0]:
%fs

head databricks-datasets/amazon/README.md

In [0]:
# dbutils.fs.ls('/amazon')

**Load a text dataset as a DataFrame**

Let's load a text dataset from the Databricks file system as a DataFrame:

In [0]:
text_data_path = "dbfs:/databricks-datasets/amazon/data20K/"
text_df = spark.read.parquet(text_data_path, header=True, inferSchema=True)

In [0]:
text_df.show(5, truncate=False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|rating|review                                                                                                                                                                                      

**Preprocess and analyze text data using PySpark**

Now, we'll preprocess and analyze the text data using PySpark. First, tokenize the text.

Tokenization is the process of breaking text into individual words or tokens. It's one of the essential steps in NLP to convert unstructured text data into a structured format.

In [0]:
from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol="review", outputCol="tokens")
tokenized_df = tokenizer.transform(text_df)

In [0]:
tokenized_df.show(5, truncate=False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##### Remove

Stop words are common words that don't carry much meaning and are often removed from text data to reduce noise and computational complexity. Examples of stop words are "a", "an", "the", "and", etc.

In [0]:
remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")
filtered_df = remover.transform(tokenized_df)

In [0]:
filtered_df.show(5)

+------+--------------------+--------------------+--------------------+
|rating|              review|              tokens|     filtered_tokens|
+------+--------------------+--------------------+--------------------+
|   4.0|Worked as expecte...|[worked, as, expe...|[worked, expected...|
|   5.0|This mouse is ama...|[this, mouse, is,...|[mouse, amazing,,...|
|   4.0|we recently had a...|[we, recently, ha...|[recently, baby, ...|
|   3.0|Works good for a ...|[works, good, for...|[works, good, boy...|
|   2.0|Fabric is nice an...|[fabric, is, nice...|[fabric, nice, so...|
+------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [0]:
cv = CountVectorizer(inputCol="filtered_tokens", outputCol="raw_features")
cv_model = cv.fit(filtered_df)
featurized_df = cv_model.transform(filtered_df)

idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(featurized_df)
result_df = idf_model.transform(featurized_df)

In [0]:
result_df.show(5)

+------+--------------------+--------------------+--------------------+--------------------+--------------------+
|rating|              review|              tokens|     filtered_tokens|        raw_features|            features|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+
|   4.0|Worked as expecte...|[worked, as, expe...|[worked, expected...|(84704,[0,67,98,3...|(84704,[0,67,98,3...|
|   5.0|This mouse is ama...|[this, mouse, is,...|[mouse, amazing,,...|(84704,[2,17,19,3...|(84704,[2,17,19,3...|
|   4.0|we recently had a...|[we, recently, ha...|[recently, baby, ...|(84704,[0,2,3,5,6...|(84704,[0,2,3,5,6...|
|   3.0|Works good for a ...|[works, good, for...|[works, good, boy...|(84704,[4,8,17,27...|(84704,[4,8,17,27...|
|   2.0|Fabric is nice an...|[fabric, is, nice...|[fabric, nice, so...|(84704,[5,6,7,16,...|(84704,[5,6,7,16,...|
+------+--------------------+--------------------+--------------------+-----------------

**Build a Spark NLP pipeline**

A pipeline is a sequence of NLP operations applied to text data. In Spark NLP, you create a pipeline by chaining together various annotators and transformers.

In [0]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

In [0]:
# document_assembler_demo = DocumentAssembler().setInputCol("review").setOutputCol("document")
# document_assembler_demo_df = document_assembler_demo.transform(text_df)

# document_assembler_demo_df.show(1, truncate=False)


In [0]:
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler().setInputCol("review").setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens")

normalizer = Normalizer() \
    .setInputCols(["tokens"]) \
    .setOutputCol("normalized")

lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["normalized"]) \
    .setOutputCol("lemmas")

finisher = Finisher() \
    .setInputCols(["lemmas"]) \
    .setCleanAnnotations(False)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    normalizer,
    lemmatizer,
    finisher
])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ][OK!]


- DocumentAssembler: It is the first stage in the Spark NLP pipeline, converting input text data into Spark NLP "Document" format.
- Tokenizer: It takes the "Document" as input and tokenizes it, breaking the text into individual words or tokens.
- Normalizer: It removes punctuations, numbers, and any other non-alphabetic characters from the tokens, resulting in clean tokens.
- Lemmatizer: It reduces words to their base or dictionary form, also known as lemmas. This step helps in standardizing words with similar meanings to their base form, which can improve text analysis.
- Finisher: It is the final stage in the Spark NLP pipeline, converting the output of the previous annotators and transformers into a DataFrame format that can be used for further analysis or machine learning tasks.


Transform the text DataFrame using the Spark NLP pipeline:

In [0]:
pipeline_model = pipeline.fit(text_df)
processed_df = pipeline_model.transform(text_df)

In [0]:
processed_df.show()

+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|rating|              review|            document|              tokens|          normalized|              lemmas|     finished_lemmas|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|   4.0|Worked as expecte...|[{document, 0, 10...|[{token, 0, 5, Wo...|[{token, 0, 5, Wo...|[{token, 0, 5, Wo...|[Worked, as, expe...|
|   5.0|This mouse is ama...|[{document, 0, 46...|[{token, 0, 3, Th...|[{token, 0, 3, Th...|[{token, 0, 3, Th...|[This, mouse, be,...|
|   4.0|we recently had a...|[{document, 0, 79...|[{token, 0, 1, we...|[{token, 0, 1, we...|[{token, 0, 1, we...|[we, recently, ha...|
|   3.0|Works good for a ...|[{document, 0, 23...|[{token, 0, 4, Wo...|[{token, 0, 4, Wo...|[{token, 0, 4, Wo...|[Works, good, for...|
|   2.0|Fabric is nice an...|[{document, 0, 14...|[{tok

In [0]:
processed_df.select("review", "finished_lemmas").show(4, truncate=False)