# Find similar question

Using the library spark-nlp compute sentence embeddings for each question and then using cosine similarity find the most similar question to a reference question.

See the [docs](https://sparknlp.org/) for spark-nlp.

In [None]:
import os

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import size, col, sum, expr, desc, concat, lit, regexp_replace, lower, trim
from pyspark.ml import Pipeline

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *

In [None]:
spark = (
    SparkSession
    .builder
    .appName('sparkNLP')
    .config('spark.jars.packages', 'com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3')
    .config('spark.executor.memory', '10g')  # the memory is needed to run various parts of this notebook
    .config('spark.driver.memory', '10g')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/questions-json')

In [None]:
dataDF = (
    spark
    .read
    .format('json')
    .option('path', data_input_path)
    .load()
)

In [None]:
# This function will be used to clean the text and remove html tags with other symbols

def clean_text(df: DataFrame) -> DataFrame:
    return (
        df.withColumn('body', regexp_replace('body', '<[^>]*>', ''))  # Remove HTML tags
        .withColumn('body', regexp_replace('body', '\\\\n|\\\\r|\\\\t|\\n|\\r|\\t', ' '))  # Remove escape characters
        .withColumn('body', regexp_replace('body', '\\s+', ' '))  # Collapse multiple spaces
        .withColumn('body', trim('body'))  # Trim leading/trailing spaces
    )

1) First filter the questions DataFrame to questions where the tags contain the expression `spark`. This will speed up the calculation.
2) Apply the `clean_text` function on the questions data.
3) Next create a new column `title` in which you concat `title` with the `body` of the question to have more context for the embedding.

In [None]:
# your code here:



## Compute the embeddings for the questions
### Hint
* use [documentAssembler](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/document_assembler/index.html) as the entry point in the Spark NLP lib
* use [BertSentenceEmbeddings](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/embeddings/bert_sentence_embeddings/index.html#) to compute the embeddings
* use [Pipeline](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html) to specify both steps and fit it on the DataFrame to create a model
* use the model to transform the DataFrame. This will add a new column of array type to the dataframe

In [None]:
# use DocumentAssembler:



In [None]:
# use BertSentenceEmbeddings with the sent_bert_base_uncased model
# use this: BertSentenceEmbeddings.pretrained('sent_bert_base_uncased', 'en') 



In [None]:
# define the pipeline and fit it:



Below is the reference embedding for the reference question. If you want to test for different question or use a different model, you will need to compute the embedding on your own.

In [None]:
reference_question = "How can I get the first and last row of each partition in PySpark after using repartition and sortWithinPartitions?"
reference_embedding = [
    -0.30178409814834595,
     -0.19708700478076935,
     0.15369994938373566,
     0.0939711257815361,
     0.3758930265903473,
     0.08161608874797821,
     -0.03837691247463226,
     0.02317805029451847,
     -0.09031324833631516,
     -0.378640741109848,
     0.0442027673125267,
     -0.2808622121810913,
     -0.3001766800880432,
     0.3082522451877594,
     -0.34695732593536377,
     0.5568943023681641,
     -0.3406010866165161,
     0.08446189016103745,
     -0.4601571559906006,
     -0.011745428666472435,
     0.24236643314361572,
     0.09660684317350388,
     -0.7524130344390869,
     0.47307536005973816,
     0.6893370151519775,
     -0.04322933778166771,
     0.28004246950149536,
     0.04987960308790207,
     -0.1138516217470169,
     0.010363995097577572,
     0.22644250094890594,
     0.2790714502334595,
     0.2975863516330719,
     -0.26814553141593933,
     -0.07684861123561859,
     -0.2711309790611267,
     -0.10638158768415451,
     -0.06883128732442856,
     -0.03899205103516579,
     0.12820465862751007,
     -0.44228196144104004,
     -0.23025663197040558,
     0.34290367364883423,
     0.0378534160554409,
     -0.025388848036527634,
     -0.05891880765557289,
     0.22706739604473114,
     -0.0017251564422622323,
     0.09438951313495636,
     -0.16475561261177063,
     -0.6486769914627075,
     0.17522284388542175,
     0.12571455538272858,
     0.15245836973190308,
     0.04244507849216461,
     0.4613300561904907,
     0.21268592774868011,
     -0.3897077739238739,
     -0.06714511662721634,
     -0.2076021134853363,
     -0.1019292101264,
     -0.09873979538679123,
     0.17479419708251953,
     -0.2835703194141388,
     0.08135509490966797,
     -0.0925772413611412,
     0.1329706460237503,
     0.45464026927948,
     -0.7338437438011169,
     -0.1317150741815567,
     -0.30478397011756897,
     -0.13949401676654816,
     -0.07301975041627884,
     0.187718465924263,
     -0.42080044746398926,
     0.010787555947899818,
     -0.2980436682701111,
     0.17510899901390076,
     0.21055014431476593,
     -0.1472564935684204,
     -0.2557067573070526,
     0.38267338275909424,
     -0.13468363881111145,
     0.33781224489212036,
     0.5624389052391052,
     -0.048148758709430695,
     -0.20927508175373077,
     0.04072866588830948,
     -0.3709409236907959,
     0.382205605506897,
     -0.074553944170475,
     0.012394404970109463,
     -0.11037153750658035,
     0.19901372492313385,
     0.4230853319168091,
     -0.11323663592338562,
     -0.1489659696817398,
     -0.11554533988237381,
     0.17790982127189636,
     0.4730387330055237,
     0.04123180732131004,
     -0.22823508083820343,
     -0.04965947940945625,
     -0.11652124673128128,
     -0.26018470525741577,
     -0.21879711747169495,
     -0.28960639238357544,
     0.22966213524341583,
     0.303631991147995,
     0.3668553829193115,
     0.0485733263194561,
     0.08010977506637573,
     -0.06236279010772705,
     -0.14096732437610626,
     -0.03195207566022873,
     0.11987362802028656,
     -0.19482292234897614,
     -0.06172975152730942,
     0.3109027147293091,
     -0.43834832310676575,
     -0.0810699611902237,
     0.27466005086898804,
     0.12789691984653473,
     1.0642324686050415,
     -0.09696020185947418,
     0.14424200356006622,
     0.07426901906728745,
     0.5947192907333374,
     0.05770682543516159,
     -0.24017831683158875,
     0.2424989938735962,
     0.45836976170539856,
     0.6285819411277771,
     -0.34213992953300476,
     -0.20424826443195343,
     -0.17910034954547882,
     -0.1767565757036209,
     -0.027041004970669746,
     -0.40715140104293823,
     -0.3159913420677185,
     -0.017789583653211594,
     -0.22005297243595123,
     0.37197282910346985,
     0.17415502667427063,
     0.3394246995449066,
     0.27434372901916504,
     0.058024775236845016,
     -0.13786494731903076,
     -0.1266781985759735,
     0.3782951533794403,
     0.1418980211019516,
     0.4133802652359009,
     -0.3590308427810669,
     -0.46262872219085693,
     -0.33027780055999756,
     -0.07705603539943695,
     0.2263164520263672,
     0.4525968134403229,
     -0.2578897178173065,
     -0.025954008102416992,
     0.5023742914199829,
     -0.005888945888727903,
     -0.15454652905464172,
     0.08619049936532974,
     -0.08527686446905136,
     -0.45395180583000183,
     0.27677324414253235,
     0.44782835245132446,
     0.03332305699586868,
     0.1514490246772766,
     -0.13049805164337158,
     -0.15472108125686646,
     0.37226447463035583,
     -0.007784188725054264,
     0.07711733877658844,
     -0.008034148253500462,
     0.37832310795783997,
     -0.06661960482597351,
     0.5237884521484375,
     0.0002052822383120656,
     -0.723609983921051,
     0.41940319538116455,
     0.015614881180226803,
     0.14085455238819122,
     -0.003545263549312949,
     0.12105090171098709,
     0.2052862048149109,
     -0.5683361291885376,
     0.20738841593265533,
     0.2350941002368927,
     -0.6172229647636414,
     -0.16668610274791718,
     -0.2311754822731018,
     -0.41962048411369324,
     0.48334628343582153,
     0.02290220744907856,
     -0.13690407574176788,
     -0.22479164600372314,
     -0.3712921440601349,
     0.010634449310600758,
     0.11918144673109055,
     0.057880718261003494,
     0.07639604061841965,
     -0.32835593819618225,
     0.13521726429462433,
     -0.0173820648342371,
     -0.34187400341033936,
     -0.034068457782268524,
     -0.8430806994438171,
     0.15938173234462738,
     -0.7061365842819214,
     0.46103137731552124,
     -0.03631477430462837,
     -0.00376415834762156,
     -0.10413558781147003,
     0.12354352325201035,
     -0.01792198233306408,
     -0.013701935298740864,
     0.2207133024930954,
     -0.07175825536251068,
     0.2969892919063568,
     -0.062346234917640686,
     -0.36725932359695435,
     0.27192914485931396,
     -0.3424476385116577,
     1.0353460311889648,
     0.00010838912567123771,
     0.07233232259750366,
     0.7770789861679077,
     0.5194998979568481,
     -0.16262266039848328,
     -0.1409246176481247,
     0.13407061994075775,
     0.08759194612503052,
     0.2556590139865875,
     0.13831673562526703,
     -0.20415447652339935,
     -0.03554953634738922,
     0.3417550325393677,
     0.0707956850528717,
     -0.26129940152168274,
     0.470436692237854,
     0.42138808965682983,
     -0.1972520798444748,
     -0.018833095207810402,
     0.030662912875413895,
     -0.3953421115875244,
     0.03783306106925011,
     0.1619846522808075,
     -0.40780341625213623,
     -0.48519596457481384,
     -0.36192891001701355,
     -0.0718427449464798,
     -0.8042348027229309,
     -0.35646429657936096,
     -0.035953715443611145,
     0.023174477741122246,
     -0.009908843785524368,
     -0.2737221419811249,
     0.3925217390060425,
     0.04849648475646973,
     0.03553785756230354,
     -0.050270937383174896,
     0.09138578176498413,
     -0.38934463262557983,
     -0.5464734435081482,
     -0.24541090428829193,
     0.11259450763463974,
     0.6439579725265503,
     0.030625293031334877,
     -0.21863700449466705,
     -0.5109617114067078,
     -0.1834828406572342,
     0.45458468794822693,
     -0.5699089169502258,
     -0.2553756833076477,
     0.14749039709568024,
     0.17908428609371185,
     -0.21878157556056976,
     -0.17523381114006042,
     0.12926413118839264,
     0.4236261248588562,
     -0.39215847849845886,
     0.03584416210651398,
     -0.31356868147850037,
     -0.12153778225183487,
     0.3253122866153717,
     -0.06431534886360168,
     -0.5588430762290955,
     -0.5334948897361755,
     0.05076585337519646,
     0.38722744584083557,
     -0.2927577495574951,
     -0.061412159353494644,
     0.6947567462921143,
     0.39683955907821655,
     0.09949979931116104,
     -0.017063476145267487,
     -0.36704376339912415,
     -0.31324779987335205,
     -0.2758723199367523,
     -0.3866256773471832,
     0.13976570963859558,
     0.11628952622413635,
     -0.6111676096916199,
     -0.0792013481259346,
     -0.3392672836780548,
     -0.1233425959944725,
     -4.351192474365234,
     -0.2740230858325958,
     -0.14115290343761444,
     0.40219393372535706,
     0.0683966726064682,
     -0.04185524210333824,
     -0.2069893628358841,
     -0.21592971682548523,
     -0.3439629077911377,
     -0.006271071266382933,
     -0.05685687065124512,
     -0.4705599546432495,
     -0.01927715539932251,
     0.23416060209274292,
     0.10558763891458511,
     0.28999269008636475,
     0.14920561015605927,
     0.12155339121818542,
     0.1061423122882843,
     0.5223740935325623,
     -0.467450350522995,
     -0.5195315480232239,
     0.18833674490451813,
     -0.333377867937088,
     0.05877644568681717,
     0.2661980390548706,
     -0.17847467958927155,
     -0.005600540433079004,
     -0.27504977583885193,
     -0.2921060025691986,
     -0.1497303694486618,
     -0.3774574398994446,
     0.3317543566226959,
     0.338726282119751,
     -0.3306295573711395,
     0.08458831906318665,
     -0.003188084112480283,
     -0.3284481465816498,
     -0.24468472599983215,
     -0.2608281672000885,
     -0.058811403810977936,
     0.07763990014791489,
     0.3455803096294403,
     -0.3010557293891907,
     0.9316583275794983,
     -0.1723729819059372,
     0.14263151586055756,
     -0.2382686883211136,
     0.10605166852474213,
     0.464462548494339,
     -0.11621169000864029,
     -0.0585365854203701,
     -0.03646261245012283,
     -0.11100076884031296,
     -0.03132642060518265,
     -0.1075524166226387,
     0.18750527501106262,
     0.2657329738140106,
     -0.24904517829418182,
     -0.2233189195394516,
     0.0857217013835907,
     -0.339830219745636,
     0.019875047728419304,
     -0.21782056987285614,
     -0.32245996594429016,
     -0.276040256023407,
     -0.8557724952697754,
     -0.3633749783039093,
     -0.18469157814979553,
     -0.040419802069664,
     -0.1913837045431137,
     -0.07136143743991852,
     -0.4066154956817627,
     -0.8621860146522522,
     -0.06405483186244965,
     0.009469076059758663,
     0.2031208872795105,
     -0.059074144810438156,
     0.21657155454158783,
     0.21551959216594696,
     -0.8516822457313538,
     -0.6391984224319458,
     0.14272074401378632,
     -0.19050255417823792,
     -0.23636826872825623,
     -0.2567095160484314,
     -0.12715648114681244,
     -0.2548961043357849,
     -0.20068560540676117,
     -0.31933459639549255,
     -0.07239078730344772,
     -0.1441214680671692,
     0.09570792317390442,
     0.283069908618927,
     0.2579619288444519,
     0.06800274550914764,
     0.38475143909454346,
     -0.2825963795185089,
     0.42042645812034607,
     0.06362414360046387,
     0.4188879728317261,
     -0.27621331810951233,
     0.32387450337409973,
     0.14399230480194092,
     -0.022722771391272545,
     0.40372347831726074,
     -0.5916458368301392,
     0.17026938498020172,
     0.3527926206588745,
     0.049775559455156326,
     0.13768984377384186,
     -0.1721336543560028,
     0.5541738271713257,
     -0.6056422591209412,
     -0.2442193478345871,
     0.1549926996231079,
     0.2686009705066681,
     0.2340429276227951,
     0.2839624881744385,
     -0.17411015927791595,
     -0.34005850553512573,
     0.7093517780303955,
     -0.48208311200141907,
     -0.19984294474124908,
     -0.4038844704627991,
     -0.07171820849180222,
     0.031830061227083206,
     0.05998101830482483,
     -0.12145014852285385,
     -0.2987116575241089,
     -0.38342347741127014,
     -0.19925452768802643,
     -0.3514118492603302,
     0.10924828797578812,
     0.17192865908145905,
     -0.3793765902519226,
     -0.11992333829402924,
     0.05825672671198845,
     -0.21609941124916077,
     0.21551533043384552,
     0.3110406696796417,
     0.22568874061107635,
     -0.06536965817213058,
     -0.13788224756717682,
     -0.058358460664749146,
     0.32137125730514526,
     0.2981645464897156,
     -0.06410913169384003,
     0.03175748884677887,
     0.06027095392346382,
     -0.4819735586643219,
     -0.5521308183670044,
     0.14261330664157867,
     -0.2868664860725403,
     0.11616923660039902,
     -0.11875956505537033,
     0.18264451622962952,
     -0.3759809136390686,
     0.2868663966655731,
     -0.3624308705329895,
     0.3771396279335022,
     0.2750980257987976,
     0.3630715310573578,
     -0.10467895120382309,
     0.20167218148708344,
     0.5495109558105469,
     0.14963148534297943,
     -0.039929475635290146,
     -0.2797887623310089,
     0.04892564192414284,
     -0.4964466392993927,
     -0.06616495549678802,
     0.30336740612983704,
     0.210565447807312,
     0.1900172233581543,
     0.4428388476371765,
     0.2698492109775543,
     0.07720645517110825,
     0.3583415150642395,
     0.6034777164459229,
     0.08027783036231995,
     0.08055121451616287,
     -0.21678221225738525,
     0.052599936723709106,
     0.5353500247001648,
     0.11582808941602707,
     0.024658387526869774,
     -0.06596594303846359,
     0.12291833013296127,
     -0.0692330151796341,
     -0.5107376575469971,
     0.4438939690589905,
     -0.3641442358493805,
     -0.49152690172195435,
     -0.15208156406879425,
     -0.4302481710910797,
     0.3103487193584442,
     -0.0483727864921093,
     0.14224064350128174,
     0.18511050939559937,
     0.04011574015021324,
     7.188176095951349e-05,
     -0.4560277760028839,
     0.20052938163280487,
     0.16825532913208008,
     -0.17861898243427277,
     0.09495014697313309,
     0.023161254823207855,
     0.08930686861276627,
     -0.09670156985521317,
     0.1165153905749321,
     -0.25785666704177856,
     -0.3927076458930969,
     -0.6332149505615234,
     -0.08897460252046585,
     0.06734783202409744,
     0.07748738676309586,
     -0.4597119390964508,
     -0.06269870698451996,
     0.19610099494457245,
     -0.3053820729255676,
     -0.10050918906927109,
     0.45423024892807007,
     0.7255757451057434,
     -0.16754382848739624,
     -0.047957565635442734,
     0.012037741020321846,
     -0.05578304082155228,
     -0.44798731803894043,
     0.36419299244880676,
     0.23835599422454834,
     -0.7418327927589417,
     0.03412292152643204,
     0.11836604028940201,
     -0.1337360441684723,
     -0.5700799822807312,
     0.12565985321998596,
     -0.36039528250694275,
     -0.2997814118862152,
     -0.2791460156440735,
     -0.12145042419433594,
     -0.22347380220890045,
     -0.14263325929641724,
     -0.2555077373981476,
     0.24259242415428162,
     0.07142355293035507,
     -0.16354475915431976,
     0.4129716455936432,
     0.06430640816688538,
     0.10465867072343826,
     0.03798418119549751,
     -0.06637220829725266,
     -0.4250990152359009,
     -0.1588599979877472,
     0.33922672271728516,
     0.04830508679151535,
     -0.09463021904230118,
     -0.5035303831100464,
     -0.39160633087158203,
     0.14749857783317566,
     0.020472293719649315,
     -0.229701966047287,
     0.3536834120750427,
     0.19881847500801086,
     0.47724032402038574,
     0.29931995272636414,
     0.31341445446014404,
     0.20526790618896484,
     0.15528157353401184,
     -0.508894681930542,
     -0.13146942853927612,
     -0.2249491959810257,
     0.05408334732055664,
     -0.0905483141541481,
     0.19103674590587616,
     -0.16986088454723358,
     0.08692745864391327,
     -0.15372233092784882,
     0.36479052901268005,
     -0.4337369501590729,
     -0.07572540640830994,
     -0.065556101500988,
     -0.10045193135738373,
     0.1364922821521759,
     -0.005465127062052488,
     -0.01554494071751833,
     -0.29556313157081604,
     -0.07178344577550888,
     -0.08714015781879425,
     0.23290285468101501,
     -0.00952592771500349,
     0.2955721914768219,
     -0.07441406697034836,
     -0.02279754728078842,
     0.11213318258523941,
     0.09127560257911682,
     0.5753017067909241,
     0.0138353630900383,
     0.008958166465163231,
     0.02044711261987686,
     -0.197453111410141,
     -0.037188511341810226,
     0.45102545619010925,
     0.22150875627994537,
     0.010170169174671173,
     0.22649328410625458,
     0.015783775597810745,
     -0.3795439600944519,
     0.0906098261475563,
     -0.09935794770717621,
     0.16198572516441345,
     -0.4694459140300751,
     0.7353531718254089,
     0.3254466652870178,
     -0.6833875775337219,
     -0.3282569646835327,
     0.25172337889671326,
     -0.42638012766838074,
     -0.024639790877699852,
     0.22115133702754974,
     -0.2033555805683136,
     0.15040968358516693,
     0.6740664839744568,
     0.11636605113744736,
     -0.3653099536895752,
     0.48466140031814575,
     -0.2539305090904236,
     -0.09066807478666306,
     -0.02698243409395218,
     -0.01585509441792965,
     -0.17528286576271057,
     0.4735703766345978,
     -0.30612465739250183,
     0.47502008080482483,
     -0.24407802522182465,
     -0.42239055037498474,
     0.12963378429412842,
     0.21380950510501862,
     0.3332408666610718,
     -0.023892709985375404,
     0.38725370168685913,
     0.2339586466550827,
     -0.07940857112407684,
     0.2403808981180191,
     0.056671980768442154,
     0.1246364563703537,
     0.41994598507881165,
     -0.04105700924992561,
     0.4012738764286041,
     0.8672699928283691,
     0.10142245143651962,
     0.08793169260025024,
     0.016662154346704483,
     0.14274461567401886,
     0.14089924097061157,
     0.2876858413219452,
     0.4487122595310211,
     0.012396144680678844,
     0.05036242678761482,
     0.6006976366043091,
     0.4272596538066864,
     0.1003025621175766,
     -0.1288764923810959,
     -0.7848197221755981,
     -0.2591199278831482,
     0.3385956883430481,
     0.692954957485199,
     0.033151835203170776,
     0.11135072261095047,
     0.029529564082622528,
     -0.06353388726711273,
     0.3550261855125427,
     0.08704766631126404,
     0.07120222598314285,
     -0.1971062421798706,
     0.2999005615711212,
     0.03351316601037979,
     0.1155720204114914,
     -0.30866220593452454,
     -0.056267306208610535,
     0.1305721253156662,
     -0.26073014736175537,
     -0.09032974392175674,
     -0.0860048159956932,
     -0.5295311212539673,
     -0.33239781856536865,
     0.1855258345603943,
     -0.07846595346927643,
     0.015187432058155537,
     0.009269935078918934,
     -0.19864806532859802,
     -0.11138765513896942,
     0.3690583109855652,
     0.03456612676382065,
     0.2179807871580124,
     -0.05846627056598663,
     -0.11212410777807236,
     0.6375910043716431,
     0.8951848745346069,
     0.01939748227596283,
     0.17632444202899933,
     -0.10800936818122864,
     0.18149809539318085,
     -0.0593816339969635,
     0.23547646403312683,
     0.07167836278676987,
     -0.036985963582992554,
     0.1331740766763687,
     -0.5822545289993286,
     0.06244172528386116,
     0.07475005835294724,
     0.455001562833786,
     -0.5827022790908813,
     -0.22429786622524261,
     -0.11691593378782272,
     0.21198339760303497,
     -0.11463500559329987,
     -0.11045360565185547,
     -0.16137658059597015,
     0.12750712037086487,
     -0.3730361759662628,
     -0.16031292080879211,
     -0.0022220159880816936,
     0.32379350066185,
     -0.27941253781318665,
     0.22098945081233978,
     0.0988348200917244,
     0.37546274065971375,
     -0.13330362737178802,
     0.2573034167289734,
     0.42543983459472656,
     0.2494048774242401,
     0.46989431977272034,
     0.27441170811653137,
     0.08098154515028,
     -0.34427985548973083,
     -0.07821826636791229,
     -2.689467328309547e-05,
     -0.12856777012348175,
     0.1549813151359558,
     -0.4812433421611786,
     -0.07131695002317429,
     -0.09001703560352325,
     -0.011225216090679169,
     -0.1257323920726776,
     0.2589960992336273,
     0.06275302171707153,
     -0.26387819647789,
     -0.7300661206245422,
     -0.43463435769081116,
     0.24543005228042603,
     0.17081515491008759,
     -0.11892806738615036,
     -0.2170266956090927,
     -0.027800224721431732,
     0.14713403582572937,
     -0.218027725815773,
     -0.4322946071624756,
     0.01697184517979622,
     0.29630568623542786
]

1) Add the embedding for the reference question as a new column to the DataFrame. Then compute the similarity between the reference question and all other questions.

This SQL expression with higher order functions can calculate the dot product of two vectors. Make sure embedding and ref_embedding are columns in the DataFrame and contain the arrays of doubles for the embeddings of the two questions for which you want to compute the similarity.
```
aggregate(
    zip_with(embedding, ref_embedding, (x, y) -> x * y),
    0D,
    (acc, x) -> acc + x
)
```
This SQL expression can calculate the norm of a vector. We need that as our model doesn't return normalized vectors.
```
sqrt(aggregate(embedding, 0D, (acc, x) -> acc + x * x))
```
To compute the cosine similarity use this mathematical formula (use the above expressions to plug them in the formula):
```
similarity(u, v) = dot_product(u, v) / (norm(u) * norm(v))
```
For more info about cosine similarity see [wiki](https://en.wikipedia.org/wiki/Cosine_similarity) or [scikit_learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

2) Finaly sort the result in desc order by the computed similarity and find the questions that is the most similar to the reference question.


In [None]:
# your code here:



In [None]:
spark.stop()