# Text Encoding with DJL Spark Support

In this example, we will use Jupyter Notebook to run Text Encoding with DJL Spark extension on Scala. To execute this Scala kernel successfully, you need to install [Almond](https://almond.sh/), a Scala kernel for Jupyter Notebook. Almond provide extensive functionalities for Scala and Spark applications.

[Almond installation instruction](https://almond.sh/docs/quick-start-install) (Note: only Scala 2.12 are tested)

After that, you can start with DJL's Scala notebook.


## Import dependencies

Firstly, let's import the depdendencies we need.

In [None]:
import $ivy.`org.apache.spark::spark-sql:3.3.2`
import $ivy.`ai.djl:api:0.24.0`
import $ivy.`ai.djl.spark:spark:0.24.0`

Then we can import the packages we need to use. In the last two lines, we disabled the Spark logging in order to avoid polluting your cell outputs.

In [None]:
import org.apache.spark.sql.NotebookSparkSession
import ai.djl.spark.task.text.{TextDecoder, TextEncoder}
import org.apache.spark.sql.SparkSession

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF) // avoid too much message popping out
Logger.getLogger("ai").setLevel(Level.OFF) // avoid too much message popping out

## Start Spark application

We can create a `NotebookSparkSession` through the Almond Spark plugin. It will internally apply all necessary jars to each of the worker node.

In [None]:
// Create Spark session
val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Let's create DataFrame with text values using Spark library:

In [None]:
val df = spark.createDataFrame(Seq(
  (1, "Hello, y'all! How are you?"),
  (2, "Hello to you too!"),
  (3, "I'm fine, thank you!")
)).toDF("id", "text")
df.show(truncate=false)

Then we can run encoding on the text. All we need to do is to create a `HuggingFaceTextEncoder`, use the "bert-base-cased" tokenizer and run encoding with DJL. The output is StructType column: "encoded".

In [None]:
val encoder = new TextEncoder()
  .setInputCol("text")
  .setOutputCol("encoded")
  .setHfModelId("bert-base-cased")
var encDf = encoder.encode(df)
encDf.printSchema()

Then we can run decoding on the above encoded text. All we need to do is to create a `HuggingFaceTextDecoder`, use the "bert-base-cased" tokenizer and run decoding with DJL. The output is StringType column: "decoded".

In [None]:
encDf = encDf.select("id", "text", "encoded.*")
val decoder = new TextDecoder()
  .setInputCol("ids")
  .setOutputCol("decoded")
  .setHfModelId("bert-base-cased")
var decDf = decoder.decode(encDf)
decDf.printSchema()
decDf.select("id", "text", "ids", "decoded").show(truncate=false)