![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/openvino/HuggingFace_OpenVINO_in_Spark_NLP_Bart.ipynb)

# Import OpenVINO GPT2  models from HuggingFace 🤗 into Spark NLP 🚀

This notebook provides a detailed walkthrough on optimizing and exporting BGE models from HuggingFace for use in Spark NLP, leveraging the various tools provided in the [Intel OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) ecosystem.

Let's keep in mind a few things before we start 😊

- OpenVINO support was introduced in  `Spark NLP 5.4.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.


## 1. Export and Save the HuggingFace model

- Let's install `transformers` and `openvino` packages with other dependencies. You don't need `openvino` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock `transformers` on version `4.41.2`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q --upgrade transformers==4.39.3
!pip install -q --upgrade openvino==2024.3
!pip install -q --upgrade optimum-intel==1.18.3
!pip install -q --upgrade onnx==1.12.0
!pip install --upgrade huggingface-hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openvino-dev 2024.6.0 requires openvino==2024.6.0, but you have openvino 2024.3.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;3

[Optimum Intel](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) is the interface between the Transformers library and the various model optimization and acceleration tools provided by Intel. HuggingFace models loaded with optimum-intel are automatically optimized for OpenVINO, while being compatible with the Transformers API.
- To load a HuggingFace model directly for inference/export, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. We can use this to import and export OpenVINO models with `from_pretrained` and `save_pretrained`.
- By setting `export=True`, the source model is converted to OpenVINO IR format on the fly.
- We'll use [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) model from HuggingFace, representing an OpenVINO model.
- In addition to the OVModelForFeatureExtraction model, we also need to save the `AutoTokenizer`. This is the same for every model, these are assets (saved in `/assets`) needed for tokenization inside Spark NLP.

In [2]:
from transformers import AutoTokenizer

MODEL_NAME = "openai-community/gpt2"
EXPORT_PATH = f"ov_models/{MODEL_NAME}"

! optimum-cli export openvino --model {MODEL_NAME} --task text-generation {EXPORT_PATH}
!mkdir {EXPORT_PATH}/assets

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


config.json: 100%|█████████████████████████████| 665/665 [00:00<00:00, 55.2kB/s]
Framework not specified. Using pt to export the model.
model.safetensors: 100%|█████████████████████| 548M/548M [00:42<00:00, 12.9MB/s]
generation_config.json: 100%|██████████████████| 124/124 [00:00<00:00, 10.2kB/s]
The task `text-generation` was manually specified, and past key values will not be reused in the decoding. if needed, please pass `--task text-generation-with-past` to export using the past key values.
tokenizer_config.json: 100%|█████████████████| 26.0/26.0 [00:00<00:00, 9.07kB/s]
vocab.json: 100%|██████████████████████████| 1.04M/1.04M [00:00<00:00, 2.87MB/s]
merges.txt: 100%|████████████████████████████| 456k/456k [00:00<00:00, 8.77MB/s]
tokenizer.json: 100%|██████████████████████| 1.36M/1.36M [00:00<00:00, 3.13MB/s]
Using framework PyTorch: 2.6.0+cu124
Overriding 1 configuration item(s)
	- use_cache -> False
  if batch_size <= 0:
OpenVINO Tokenizers is not available. To deploy models in pr

In [3]:
! mv -t {EXPORT_PATH}/assets {EXPORT_PATH}/*.json {EXPORT_PATH}/*.txt

In [4]:
import json
output_json = json.load(open(f"{EXPORT_PATH}/assets/vocab.json"))

with open(f"{EXPORT_PATH}/assets/vocab.txt", "w") as f:
    for key in output_json.keys():
        print(key, file=f)

In [5]:
!ls -l {EXPORT_PATH}/assets

total 3736
-rw-rw----+ 1 alegp97 alegp97     896 may  9 14:28 config.json
-rw-rw----+ 1 alegp97 alegp97     119 may  9 14:28 generation_config.json
-rw-rw----+ 1 alegp97 alegp97  456318 may  9 14:28 merges.txt
-rw-rw----+ 1 alegp97 alegp97      99 may  9 14:28 special_tokens_map.json
-rw-rw----+ 1 alegp97 alegp97     444 may  9 14:28 tokenizer_config.json
-rw-rw----+ 1 alegp97 alegp97 2107652 may  9 14:28 tokenizer.json
-rw-rw----+ 1 alegp97 alegp97  798156 may  9 14:28 vocab.json
-rw-rw----+ 1 alegp97 alegp97  406992 may  9 14:28 vocab.txt


## Import and Save GPT2 in Spark NLP

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [6]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 6.0.0
setup Colab for PySpark 3.2.3 and Spark NLP 6.0.0
[33m  DEPRECATION: Building 'pyspark' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'pyspark'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m


Let's start Spark with Spark NLP included via our simple `start()` function

In [7]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()
print("Apache Spark version: {}".format(spark.version))

:: loading settings :: url = jar:file:/opt/spark-3.5.1/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/alegp97/.ivy2/cache
The jars for the packages stored in: /home/alegp97/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-aeb100ab-a92c-45c8-8d9f-c21f9c8e08f9;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;6.0.0 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-s3;1.12.500 in central
	found com.amazonaws#aws-java-sdk-kms;1.12.500 in central
	found com.amazonaws#aws-java-sdk-core;1.12.500 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.15 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.13 in central
	found software.amazon.ion#ion-java;1.0.2 in central
	found joda-time#joda-time;2.8.1 in central
	found com.amazonaws#jmespath-java;1.12.500 in cent

Apache Spark version: 3.5.1


- Let's use `loadSavedModel` functon in `GPT2Transformer` which allows us to load the Openvino model
- Most params will be set automatically. They can also be set later after loading the model in `GPT2Transformer` during runtime, so don't worry about setting them now
- `loadSavedModel` accepts two params, first is the path to the exported model. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.st and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [8]:
from sparknlp.annotator import *

gpt2 = GPT2Transformer.loadSavedModel(EXPORT_PATH, spark)\
  .setInputCols(["documents"])\
  .setMaxOutputLength(50)\
  .setDoSample(True)\
  .setTopK(50)\
  .setTemperature(0)\
  .setBatchSize(5)\
  .setNoRepeatNgramSize(3)\
  .setOutputCol("generation")

25/05/09 14:29:53 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: Can't load library: /tmp/openvino-native10225245130208108791/libtbb.so.2
25/05/09 14:29:53 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: /tmp/openvino-native10225245130208108791/libopenvino.so: libtbb.so.12: cannot open shared object file: No such file or directory
25/05/09 14:29:53 WARN NativeLibrary: Failed to load library null: java.lang.UnsatisfiedLinkError: /tmp/openvino-native10225245130208108791/libinference_engine_java_api.so: libopenvino.so.2410: cannot open shared object file: No such file or directory
25/05/09 14:29:53 ERROR OpenvinoWrapper$: Could not initialize OpenVINO Core. Please make sure the jsl-openvino JAR is loaded and Intel oneTBB is installed.
(See https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/overview.html)


Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer.loadSavedModel.
: java.lang.UnsatisfiedLinkError: 'long org.intel.openvino.Core.GetCore()'
	at org.intel.openvino.Core.GetCore(Native Method)
	at org.intel.openvino.Core.<init>(Core.java:22)
	at com.johnsnowlabs.ml.openvino.OpenvinoWrapper$.liftedTree1$1(OpenvinoWrapper.scala:79)
	at com.johnsnowlabs.ml.openvino.OpenvinoWrapper$.<init>(OpenvinoWrapper.scala:78)
	at com.johnsnowlabs.ml.openvino.OpenvinoWrapper$.<clinit>(OpenvinoWrapper.scala)
	at com.johnsnowlabs.nlp.annotators.seq2seq.ReadGPT2TransformerDLModel.loadSavedModel(GPT2Transformer.scala:614)
	at com.johnsnowlabs.nlp.annotators.seq2seq.ReadGPT2TransformerDLModel.loadSavedModel$(GPT2Transformer.scala:571)
	at com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer$.loadSavedModel(GPT2Transformer.scala:632)
	at com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer.loadSavedModel(GPT2Transformer.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)


Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
gpt2.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

Let's clean up stuff we don't need anymore

In [None]:
!rm -rf {EXPORT_PATH}

Awesome  😎 !

This is your Openvino GPT2 model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 487664
drwxr-xr-x 4 root root      4096 Sep  7 19:43 fields
-rw-r--r-- 1 root root 499355270 Sep  7 19:44 gpt2_onnx
drwxr-xr-x 2 root root      4096 Sep  7 19:43 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny GPT2 model 😊

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

test_data = spark.createDataFrame([
    ["Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
       "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
       " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
       "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
       "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
       "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
       "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
       "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
       "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
       "learning for NLP, we release our data set, pre-trained models, and code."]
]).toDF("text")


document_assembler = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

gpt2 = GPT2Transformer.load(f"{MODEL_NAME}_spark_nlp")\
      .setInputCols(["document"])\
      .setMaxOutputLength(50)\
      .setDoSample(True)\
      .setTopK(50)\
      .setTemperature(0)\
      .setBatchSize(5)\
      .setNoRepeatNgramSize(3)\
      .setOutputCol("generation")

pipeline = Pipeline().setStages([document_assembler, gpt2])

result = pipeline.fit(test_data).transform(test_data)
result.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

That's it! You can now go wild and use hundreds of GPT2 models from HuggingFace 🤗 in Spark NLP 🚀
