## Task
In this notebook you will work with the spark-nlp library to find some information about the the `body` column from the questions dataset.

* spark-nlp [docs](https://nlp.johnsnowlabs.com/docs/en/quickstart)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import size, col, sum, expr, desc, length
from pyspark.ml import Pipeline

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetector, Tokenizer, NerConverter, WordEmbeddingsModel, PerceptronModel, NerCrfModel

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('NLP I')
    .config('spark.jars.packages', 'com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3')
    .config('spark.executor.memory', '20g')  # the memory is needed to run various parts of this notebook
    .config('spark.driver.memory', '10g')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/questions-json')

In [None]:
# we will take only small sample of the data (1%) to speed up the transformations

dataDF = (
    spark
    .read
    .format('json')
    .option('path', data_input_path)
    .load()
    .withColumnRenamed('title', 'Text')
    .sample(0.01)
)

## Compute the number of sentences in the dataset.
### Hint
* use [documentAssembler](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/document_assembler/index.html) as the entry point in the Spark NLP lib
* use [sentenceDetector](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/sentence/sentence_detector/index.html) to split the text into sentences
* use [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) to specify both steps and fit it on the DataFrame to create a model
* use the model to transform the DataFrame. This will add a new column of array type to the dataframe
* use [size](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.size.html#pyspark.sql.functions.size) to compute number of elements in the array
* sum the size accross the entire DataFrame using [agg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html#pyspark.sql.DataFrame.agg) and [sum](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.sum.html#pyspark.sql.functions.sum)

In [None]:
# your code here

documentAssembler = (
    DocumentAssembler()
    .setInputCol('Text')
    .setOutputCol('document')
)

sentenceDetector = (
    SentenceDetector()
    .setInputCols('document')
    .setOutputCol('sentence')
)

In [None]:
model = Pipeline().setStages([documentAssembler, sentenceDetector]).fit(dataDF)

In [None]:
(
  model.transform(dataDF)
  .withColumn('sentences', size('sentence'))
  .agg(sum('sentences'))
).show()

In [None]:
# check the schema of the dataframe transformed by the model:

model.transform(dataDF).printSchema()

In [None]:
# check the extracted sentences:

model.transform(dataDF).select('sentence').show(truncate=100)

## Convert the `Text` column to tokens
### Hint
* use [Tokenizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/token/tokenizer/index.html)
* use [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) and set the stages with the documentAssebler and Tokenizer
* fit the pipeline on the DataFrame to create a model
* use the model to transform the DataFrame


In [None]:
tokenizer = (
    Tokenizer()
    .setInputCols(['document'])
    .setOutputCol('token')
)

model = Pipeline().setStages([documentAssembler, tokenizer]).fit(dataDF)

model.transform(dataDF).select('token').show(truncate=100)

# Compute NER (Named Entity Recognition)

Hint:
* compute POS (part-of-speech tags)
  * use `PerceptronModel.pretrained("pos_anc", "en")`
  * see [docs](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/pos/perceptron/index.html#sparknlp.annotator.pos.perceptron.PerceptronModel)
* compute embeddings
  * use `WordEmbeddingsModel.pretrained("glove_100d", "en")`
  * see [docs](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/embeddings/word_embeddings/index.html#sparknlp.annotator.embeddings.word_embeddings.WordEmbeddingsModel)
* compute NER
  * use `NerCrfModel.pretrained("ner_crf", "en")`
  * see [docs](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/ner/ner_crf/index.html#sparknlp.annotator.ner.ner_crf.NerCrfModel)
  * use [NerConverter](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/ner/ner_converter/index.html#sparknlp.annotator.ner.ner_converter.NerConverter) to convert the data to user-friendly shape
  * build Pipeline
  * fit the data to create a model
  * transform the data using the model

In [None]:
# POS tagger:
pos_tagger = (
    PerceptronModel.pretrained('pos_anc', 'en') 
    .setInputCols(['document', 'token']) 
    .setOutputCol('pos')
)

# WordEmbeddings:
embeddings = (
    WordEmbeddingsModel.pretrained('glove_100d', 'en') 
    .setInputCols(['document', 'token']) 
    .setOutputCol('word_embeddings')
)

# NerCrfModel:
ner = (
    NerCrfModel.pretrained('ner_crf', 'en') 
    .setInputCols(['document', 'token', 'pos', 'word_embeddings']) 
    .setOutputCol('ner')
)

# NerConverter:
ner_converter = (
    NerConverter()
    .setInputCols(['document', 'token', 'ner'])
    .setOutputCol('entities')
)

In [None]:
# Step 3: Build pipeline

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    pos_tagger,
    embeddings,
    ner,
    ner_converter
])

In [None]:
# fit the pipeline to the data to create the model

model = pipeline.fit(dataDF)

In [None]:
# transform the data using the model

results = model.transform(dataDF)

In [None]:
# see the result

(
    results.select('entities')
).show(truncate=100)

In [None]:
spark.stop()