## Task
In this notebook you will work with the spark-nlp library to find some information about the the `body` column from the questions dataset.

* spark-nlp [docs](https://nlp.johnsnowlabs.com/docs/en/quickstart)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import size, col, sum, expr, explode, desc, length
from pyspark.ml import Pipeline

import os
import re

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetector, Tokenizer, BertEmbeddings, NerDLModel, NerConverter

In [None]:
spark = (
    SparkSession
    .builder
    .appName('NLP I')
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
    .config("spark.executor.memory", "20g")  # the memory is needed to run various parts of this notebook
    .config("spark.driver.memory", "10g")
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/questions')

In [None]:
dataDF = (
    spark
    .read
    .option('path', data_input_path)
    .load()
    .withColumnRenamed('title', 'Text')
)

## Compute the number of sentences in the dataset.
### Hint
* use [documentAssembler](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/document_assembler/index.html) as the entry point in the Spark NLP lib
* use [sentenceDetector](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/sentence/sentence_detector/index.html) to split the text into sentences
* use [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) to specify both steps and fit it on the DataFrame to create a model
* use the model to transform the DataFrame. This will add a new column of array type to the dataframe
* use [size](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.size.html#pyspark.sql.functions.size) to compute number of elements in the array
* sum the size accross the entire DataFrame using [agg](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.agg.html#pyspark.sql.DataFrame.agg) and [sum](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.sum.html#pyspark.sql.functions.sum)

In [None]:
# your code here


## Convert the `Text` column to tokens
### Hint
* use [Tokenizer](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/token/tokenizer/index.html)
* use [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) and set the stages with the documentAssebler and Tokenizer
* fit the pipeline on the DataFrame to create a model
* use the model to transform the DataFrame



In [None]:
# your code here


## Compute embedings for the tokens.
### Hint
* use pretrained bert model called `bert_base_cased` using [BertEmbeddings](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/bert_embeddings/index.html) by calling BertEmbeddings.pretrained('bert_base_cased', 'en')
* then define the Pipeline as in previous questions and add the embedding as another stage to create new model
* finaly use the model to transform the DataFrame

In [None]:
# your code here


## Compute NER (Named Entity Recognition)
### Hint
* use a pretrained [NerDLModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_dl/index.html#sparknlp.annotator.ner.ner_dl.NerDLModel) model.
* Use specifically `ner_conll_bert_base_cased` which is compatible with the `bert_base_cased` embedding we computed in the previous question. This should be specified as the argument to the pretrained method `NerDLModel.pretrained('ner_conll_bert_base_cased', 'en')`
* the display function may fail for displaying the embedings since they are large. In that case use show() instead which will truncate the output by default. (You can also drop the embeddings column for the display)

In [None]:
# your code here


## Extract the entities from the result and find entities that are the most frequent.
### Hint
* use [NerConverter](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_converter/index.html#sparknlp.annotator.ner.ner_converter.NerConverter) as another step in the pipeline. It will convert the NER output to more friendly representation.
* fit again the pipeline and transform the DataFrame
* filter only for rows where the output is not empty using [size](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.size.html#pyspark.sql.functions.size)('entities') > 0
* use higher order function [TRANSFORM](https://spark.apache.org/docs/latest/api/sql/index.html#transform) to extract `result` and `entity` fields from the `entities` array
* [explode](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.explode.html#pyspark.sql.functions.explode) the final array
* finaly group by entity and count number of occurences and sort the the result in descending order using [orderBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.orderBy.html#pyspark.sql.DataFrame.orderBy) and [desc](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.desc.html#pyspark.sql.functions.desc)



In [None]:
# your code here

