# Tests For Natural Language Processing

This NoteBook is used to test the Spark-NLP for processing text

## Initialize Spark-NLP
To initialize Spark-NLP it is neccesary to import the library and use the method start().

For Spark-NLP to work it is also neccessary to install the libary pyspark using pip is not installed using the requirements.txt file.

```sh
> pip install pyspark==3.3.1
``` 

It is also neccesary to install Java. For example in Ubuntu:

```sh
> sudo apt install openjdk-17-jdk-headless
``` 


In [None]:
import sparknlp

spark = sparknlp.start()
sparknlp.version()

Check pretained models available

## Pretrained Pipeline
The following code shows hot to use a pretrained pipeline.
This example has been obtained from sparknlp.org site. 

In this case a pretained model is used to perform a series of NLP task in a given text, generally before processing the text.
For example we would preproccess - annotate - the text before trying to classificate or translate it with a model - or before using it for trainning a model -. Ofcourse, more steps would be needed in some cases so the input of the model can proccess the information, like converting the "words"/tokens to a numeric value - this is called [word embedding](https://www.turing.com/kb/guide-on-word-embeddings-in-nlp) -.

Some of these task are:


- Token: divides the sentences in tokens, generally it correspond to each word separated by a space, minus the puntuation. In words like "don't", the correct way in which the token is divided is "do" and "n't", so it is easier for the algorithms to undertand that is a negative no.

- lemmatization: it simplifies the tokens to get common forms from inflections. For example, orginze, organizes and organizing will be converted to orginize.

- pos: determines the type of token, and gives a tag to each type. For example: JJ adjetive, NNP proper name singular, VBP Verb, non-3rd person singular present... [TAGS](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html)

- Stemming: is similar to lemmatization, but instead of looking for a base form of a work, it will just chop the end of some words. For example, if will have Finally and Final, the common used token will be Fina.


In [2]:
from sparknlp.pretrained import PretrainedPipeline
import json

explain_document_pipeline = PretrainedPipeline("explain_document_ml")
annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP. But We don't know how to use it yet.")

for key, array in annotations.items():
    formatted_array = ', '.join(json.dumps(item) for item in array)
    print(f"{key}: [{formatted_array}]")


explain_document_ml download started this may take some time.
Approx size to download 9 MB
[ | ]

24/09/30 00:21:45 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
24/09/30 00:21:45 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


explain_document_ml download started this may take some time.
Approximate size to download 9 MB
Download done! Loading the resource.
[ — ]

                                                                                

[ | ]

                                                                                

[OK!]
document: ["We are very happy about SparkNLP. But We don't know how to use it yet."]
spell: ["We", "are", "very", "happy", "about", "SparkNLP", ".", "But", "We", "don't", "know", "how", "to", "use", "it", "yet", "."]
pos: ["PRP", "VBP", "RB", "JJ", "IN", "NNP", ".", "CC", "PRP", "VBP", "VB", "WRB", "TO", "VB", "PRP", "RB", "."]
lemmas: ["We", "be", "very", "happy", "about", "SparkNLP", ".", "But", "We", "don't", "know", "how", "to", "use", "it", "yet", "."]
token: ["We", "are", "very", "happy", "about", "SparkNLP", ".", "But", "We", "don't", "know", "how", "to", "use", "it", "yet", "."]
stems: ["we", "ar", "veri", "happi", "about", "sparknlp", ".", "but", "we", "don't", "know", "how", "to", "us", "it", "yet", "."]
sentence: ["We are very happy about SparkNLP.", "But We don't know how to use it yet."]


## Classify Text Sentiment
The following code shows how to use a pretained spark-nlp model to classify text.

In this case the classifier will check for sentiments

In [3]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from pyspark.ml import Pipeline

# Initialize Spark NLP
spark = sparknlp.start()

# Example using a pre-trained sentiment analysis pipeline
# Replace with a relevant pipeline if available
pipeline = PretrainedPipeline("analyze_sentiment", lang="en")

# Sample data
data = spark.createDataFrame([
    (1, "I hate you"),
    (2, "I love this product"),
    (3, "You are an idiot"),
    (4, "This is amazing")
], ["id", "text"])

# Apply pipeline
result = pipeline.transform(data)

result.select("id", "text", "sentiment.result").show(truncate=False)


analyze_sentiment download started this may take some time.


24/09/30 00:22:01 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


Approx size to download 4,8 MB
[ | ]

24/09/30 00:22:01 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
24/09/30 00:22:02 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


analyze_sentiment download started this may take some time.
Approximate size to download 4,8 MB
Download done! Loading the resource.
[OK!]


                                                                                

+---+-------------------+----------+
|id |text               |result    |
+---+-------------------+----------+
|1  |I hate you         |[negative]|
|2  |I love this product|[positive]|
|3  |You are an idiot   |[positive]|
|4  |This is amazing    |[positive]|
+---+-------------------+----------+



## Classify Text Hate Speech with pretrained models
The following code shows how to use a pretained spark-nlp model to classify text, while a like shows the line fit, this does not train the modle.

In this case the classifier will check for hate speech

In [8]:
import sparknlp
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import Tokenizer, BertForSequenceClassification
from pyspark.ml import Pipeline

# 0. Configure the step for document assembler
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

# 1. Configure the step that will get tokens from text
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")


# 2. Configure the step that will classify the text
sequenceClassifier = BertForSequenceClassification \
    .pretrained('bert_sequence_classifier_dehatebert_mono', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('class') \
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

# 3. Configure the step that will obtain the result from the proccess
#finisher = Finisher() \
#    .setInputCols(["class"]) \
#    .setOutputCols(["prediction"]) \
#    .setCleanAnnotations(False)

# 4. Organize the steps
pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
    #finisher
])

# 5. Classify sample data
# Sample data
example = spark.createDataFrame([
    (1, "I hate you"),
    (2, "You are an amazing person."),
    (3, "This is terrible."),
    (4, "What a wonderful day!"),
    (5, "You are a moron."), 
    (6,"** are cool"), 
    (7,"** are not cool"),
    (8,"RT @TheBeardedOak: R.I.P *****.\r\nDied first on May 16th 1943 when tragically hit by a car.\r\nDied second in July 2020 due to Political Corr…")
], ["id", "text"])

# If we want to modify the model
#model = pipeline.fit(example)

result = pipeline.fit(example).transform(example)
result.select("id", "text", "class").show(truncate=False)
#result.select("id", "text", "prediction").show(truncate=False)

bert_sequence_classifier_dehatebert_mono download started this may take some time.


24/09/30 00:26:32 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


Approximate size to download 599 MB
[OK!]
+---+------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
|id |text                                                                                                                                            |class                                                                                          |
+---+------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
|1  |I hate you                                                                                                                                      |[{category, 0, 9, NON_HATE, {sentence -> 0, NON_HATE -> 0.58358485,

In [9]:
import sparknlp
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import Tokenizer, BertForSequenceClassification
from sparknlp.pretrained import PretrainedPipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

seq_classifier = BertForSequenceClassification.pretrained("bert_classifier_bert_base_uncased_hatexplain","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class")

pipeline = Pipeline(stages=[documentAssembler, tokenizer, seq_classifier])

example = spark.createDataFrame([
    (1, "I hate you"),
    (2, "You are an amazing person."),
    (3, "This is terrible."),
    (4, "What a wonderful day!"),
    (5, "You are a moron."), 
    (6,"** are cool"), 
    (7,"RT @TheBeardedOak: R.I.P *****.\r\nDied first on May 16th 1943 when tragically hit by a car.\r\nDied second in July 2020 due to Political Corr…")
], ["id", "text"])

result = pipeline.fit(example).transform(example)
result.select("text", "class").show(truncate=False)

data = spark.createDataFrame(
    [(1,"I hate you"), (2,"I love you"), (3,"** are cool"), (4,"** are not cool")], 
    ["id","text"]
)

result = pipeline.fit(data).transform(data)
result.select("text", "class").show(truncate=False)




bert_classifier_bert_base_uncased_hatexplain download started this may take some time.


24/09/30 00:27:01 WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.


Approximate size to download 390,4 MB
[OK!]
+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                             |class                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
|I hate you                                                                                                                  