![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Use pretrained `explain_document` Pipeline

### Stages

 * DocumentAssembler
 * SentenceDetector
 * Tokenizer
 * Lemmatizer
 * Stemmer
 * Part of Speech
 * SpellChecker (Norvig)

In [None]:
import sys
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

#### This is our testing document, we'll use it to exemplify all different pipeline stages.

In [None]:
testDoc = [
"Frenchg author who helped pioner the science-fiction genre. \
Verne wrate about space, aisr, and underwater travel befdaore \
navigable aircrast and practical submarines were invented, \
and before any means of space travel had been devised. "
]

In [None]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

#### We are not interested in handling big datasets, let's switch to LightPipelines for speed.

In [None]:
result = pipeline.annotate(testDoc)

#### Let's analyze these results - first let's see what sentences we detected

In [None]:
[content['sentence'] for content in result]

#### Now let's see how those sentences were tokenized

In [None]:
[content['token'] for content in result]

#### Notice some spelling errors? the pipeline takes care of that as well

In [None]:
[content['spell'] for content in result]

#### Now let's see the lemmas

In [None]:
[content['lemmas'] for content in result]

#### Let's check the stems, any difference with the lemmas shown bebore?

[content['lemmas'] for content in result]

In [None]:
[content['stems'] for content in result]

#### Now it's the turn on Part Of Speech(POS)

In [None]:
pos = [content['pos'] for content in result]
token = [content['token'] for content in result]
# let's put token and tag together
list(zip(token[0], pos[0]))

# Use pretrained `match_chunk` Pipeline for Individual Noun Phrase 

* DocumentAssembler
* SentenceDetector
* Tokenizer
* Part of speech
* chunker

Pipeline:
* The pipeline uses regex `<DT>?<JJ>*<NN>+`
* which states that whenever the chunk finds an optional determiner 
* (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.

In [None]:
pipeline = PretrainedPipeline('match_chunks', lang='en')

In [None]:
result = pipeline.annotate("The book has many chapters") # single noun phrase

In [None]:
result['chunk']

In [None]:
result = pipeline.annotate("the little yellow dog barked at the cat") #multiple noune phrases

In [None]:
result['chunk']

In [None]:
result