#**A Sample BERT Model Filtering Interface**

This is a demostration of how the BERT UI would interact with the user.
Disclaimer: Due to GitHub's file size limitations, we were only able to host a smaller dataset. This does not fully demostrate our model.

Please refer to the BERT_Embeddings_Filter.ipynb for full documentation.

#**Setup SparkNLP and PySpark**

In [None]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-05-15 03:48:13--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-05-15 03:48:13--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-05-15 03:48:13--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:44

In [None]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import numpy as np
from pyspark.ml.linalg import *
from pyspark.sql.types import * 
from pyspark.sql.functions import *
from pyspark.ml.feature import *
import sparknlp

spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.4
Apache Spark version:  3.0.3


#**Retrieving Dataset Sample**


In [None]:
!wget https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/final_data_small.csv -O final_data_small.csv

final_data = spark\
.read\
.option("inferSchema","true")\
.option("header", "true")\
.csv("final_data_small.csv")

df = (final_data.withColumn("Description", regexp_replace('Description', '\\.(?=\\s|$)', '')))

--2022-05-15 04:03:36--  https://raw.githubusercontent.com/azraf-a/BERT_SparkNLP_Filter/main/final_data_small.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1565040 (1.5M) [text/plain]
Saving to: ‘final_data_small.csv’


2022-05-15 04:03:36 (27.2 MB/s) - ‘final_data_small.csv’ saved [1565040/1565040]



##**Generate BERT Recommendation Pipeline**

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("Description") \
    .setOutputCol("document")
sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128") \
    .setInputCols(["sentence"]) \
    .setOutputCol("sentence_bert_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_bert_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
    
pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    embeddings,
    embeddingsFinisher
])

brp = BucketedRandomProjectionLSH( \
                                  inputCol="result", \
                                  outputCol="hashes", \
                                  bucketLength=2.0, \
                                  numHashTables=3)

movielens_bert = pipeline.fit(df).transform(df)
movielens_bert = movielens_bert.selectExpr("title","Description","explode(finished_embeddings) as result")
model = brp.fit(movielens_bert)
movielens_bert.cache()

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


DataFrame[title: string, Description: string, result: vector]

##**Generating Random Movies inside smaller dataset**

In [None]:
movielens_bert.orderBy(rand()).limit(5).show(truncate=False)

+----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##**User Queries**

In [None]:
print("Try running queries here. Please note that queries are run against a smaller subset of the total MovieLens dataset.")
print("To ensure you're searching for a movie inside the dataset, try running the cell above to get 5 randomly selected movies that exist inside this smaller dataset!")
print()
title = input ("Enter a movie you've watched recently and please use the title as posted at https://www.themoviedb.org/ (for example, try Thor or Saw VI or Inception or Son of God) :") 
query_vec1 = movielens_bert.filter(movielens_bert.title == title).select("result").collect()[0][0]

print("You are searching for neighbors to : ", title)
print()
print("Approximately searching df for 3 nearest neighbors of query 1 (The first item listed is the queried movie):")
model.approxNearestNeighbors(movielens_bert, query_vec1, 4).select("title", "Description").show(truncate=False)

Try running queries here. Please note that queries are run against a smaller subset of the total MovieLens dataset.
To ensure you're searching for a movie inside the dataset, try running the cell above to get 5 randomly selected movies that exist inside this smaller dataset!

Enter a movie you've watched recently and please use the title as posted at https://www.themoviedb.org/ (for example, try Thor or Saw VI or Inception) :Son of God
You are searching for neighbors to :  Son of God

Approximately searching df for 3 nearest neighbors of query 1 (The first item listed is the queried movie):
+-----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------