# Outlier analysis on prompt embeddings

If you are using the RAG pattern, it's interesting to think about whether your reference data is adequately covering the questions being asked. This notebook shows a relatively simple technique.

A Glue job has calculated the distance from each prompt's embedding to the closest reference embedding cluster centroid, and stored the mean and standard deviation of those metrics. You can certainly look at the absolute values of the mean, median, and standard deviation, and track the trends of those statistics over time. 

In this notebook we'll just calculate the z-score of each prompt's embedding vector and count how many outliers there are.

###  Set up Glue interaction session


In [1]:
%idle_timeout 2880
%glue_version 3.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.38.1 
Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 5
Setting new number of workers to: 5
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::102165494304:role/glueinteractive
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 5dae2f9f-42c4-44b4-8357-104eed1b9ab8
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
Waiting for session 5dae2f9f-42c4-4

### Read saved prompt embedding distance data


In [2]:
df = spark.read.parquet("s3://cdkstack-documentsbucket9ec9deb9-sbbf9n4wdhze/promptdistance/")




### Specify the mean and standard deviation as calculated by the Glue distance analysis job

In [4]:
mean = 670
stdev = 370




### Calculate z-scores

In [7]:
from pyspark.sql.functions import col
df = df.withColumn("z", (col("dist") - mean) / stdev)




### Filter by z-score value more than 2

In [12]:
dfOutliers = df.where(df.z > 2)




### Count outliers

In [13]:
dfOutliers.count()

0
