# 5 - Glossary

#### Table of Contents

1. [Knowledge Graphs and Software Bill of Materials (SBOM)](#glossary-know-graphs)
2. [Graph Databases](#glossary-graph-db)
    1. [Amazon Neptune](#glossary-neptune)
3. [Generative AI and Large Language Models](#glossary-gen-ai)
    1. [Retrieval Augmented Generation](#glossary-rag)
4. [Vector Embeddings and Similarity Search](#glossary-vector-embeddings)
    1. [Similarity Search](#glossary-sim-search)
5. [Knowledge Graph Enhanced RAG](#glossary-graph-rag)
    1. [Provenance and lineage](#glossary-prov-and-line)
    2. [Grounding](#glossary-grounding)
    3. [Context](#glossary-context)

<a id='glossary-know-graphs'></a>
## 1. Knowledge Graphs

>A knowledge graph is a collection of entities – objects, events, and concepts – and relationships that describe a specific domain, the things in that domain, and the ways in which those things are connected or related. 
>
>In this workshop we're going to build a **security knowledge graph** based on Software Bill of Materials (SBOMs).  An SBOM describes the different relationships between software components, the dependencies of those components, and related processes and information.  
>
>Using this knowledge graph, you'll be able to perform software vulnerability and security analyses, asking questions such as 'How many components with high and critical vulnerabilities are shared across 2 or more packages?', and grouping vulnerabilities based on these properties.
>
![image.png](attachment:c899d7c6-ea0d-4fb1-8871-db961848494b.png)

### What is a Software Bill of Materials (SBOM)?
A software bill of materials (SBOM) is a critical component of software development and management, helping organizations to improve the transparency, security, and reliability of their software applications. An SBOM acts as an "ingredient list" of libraries and components of a software application that:

* Enables software creators to track dependencies within their applications 
* Provides security personnel the ability to examine and assess the risk of potential vulnerabilities within an environment 
* Provides legal personnel with the information needed to assure that a particular software is in compliance with all licensing requirements.

However, the true power of an SBOM lies in its ability to represent the intricate relationships and dependencies between the various components of a software system. This is where graphs come into play, offering an excellent way to model these interconnected relationships.

In a graph representation, nodes represent individual components, while edges depict the dependencies and relationships between them. This structure can handle complex hierarchies and recursive relationships with ease, making it an ideal choice for analyzing software systems.

Graph visualizations can illustrate the origin and propagation of open-source components from lower-level suppliers to the final product, aiding in the identification of vulnerabilities or licensing issues throughout the supply chain. By harnessing the power of graphs, organizations can gain a deeper understanding of their software systems, enabling them to make informed decisions, mitigate risks, and ensure compliance more effectively.

<div class="alert alert-block alert-warning"> 
<details>
    <summary>💡 <b><i>Click here to view the schema of our SBOM Graph</i></b></summary>

![image.png](attachment:c899d7c6-ea0d-4fb1-8871-db961848494b.png)


What does our graph look like?
Let’s take a look at the types of data that we are storing in our graph. The plugin uses the opinionated graph data model shown below to represent SBOM data files.
This model contains the following elements:
**Node Types**
* **Document** - This represents the application that the SBOM document describes as well as the metadata for that SBOM such as data created, version, format, verstion, etc.  e.g. boto3, AWS CLI
* **Component** - This represents a specific library/component used by an application and contains properties such as name, version, type, etc.  e.g. libssl3, openjdk
* **Reference** - This represents a reference to any external system which the system wanted to include as a reference. e.g. references to package managers, URLs to external websites, Github repos
* **Vulnerability** - This represents a specific known vulnerability for a component and contains properties such as the severity, risk score, source/advisory URL, etc.. e.g. CVE, ALAS
* **License** - The license for the component or package. e.g. Apache 2.0, MIT, CC, etc.

**Edge Types**
* **DESCRIBES/DEPENDS_ON/DEPENDENCY_OF/DESCRIBED_BY/CONTAINS** - This represents the type of relationship between a Document and a Component in the system.
* **REFERS_TO** - This represents a reference between a Component and a Reference
* **AFFECTS** - This represents that a particular Component is affected by the connected Vulnerability

The properties associated with each element will vary depending on the input format used, and the optional information contained in each file.
    </details>
</div>

<a id='glossary-graph-db'></a>
>## 2. Graph Databases
>
>A graph database allows you to model, store and query your data in the form of a graph, or network structure. Graphs and graph databases are ideal for applications where you need to understand the connections between things, the nature or semantics of those connections, and the strength, weight or quality of each connection. 
>
>With a graph database you can identify patterns of connected data, follow paths that connect entities, identify indirect and transitive relationships that might not otherwise be apparent on the data, and analyze the communities to which entities belong, and the relative importance or influence of different entities within a connected structure.
>
>Example graph database workloads include:
>
> - **Knowledge graphs** We discussed knowledge graphs above. Today, knowledge graphs are being used more and more across industries and verticals to generate insights into complex business and technical domains.
>
> - **Fraud detection** By connecting seemingly isolated facts graphs can help us find patterns of fraudulent behaviors.
>
> - **Identity graphs** Used for real-time personalization and targeted advertising, these graphs connect devices, browsing data, and identity information to create a single unified view of customers and prospects based on their interactions with a product or website.
>
> - **Security graphs** Graphs are ideal for modelling the connections between assets and roles within your organization. Security graphs can be used for proactive detection, reactive investigation, and as part of a defense-in-depth strategy for improving your IT security.
>
<a id='glossary-neptune'></a>
>### 2A. Amazon Neptune 
>
>[Amazon Neptune](https://aws.amazon.com/neptune/) is a high-performance graph analytics and serverless database for graph workloads. In this workshop you'll be using Amazon Neptune Analytics to store and query your knowledge graph. Neptune Analytics uses built-in algorithms, vector search, and in-memory computing to run queries on data with tens of billions of relationships in seconds.

<a id='glossary-gen-ai'></a>

>## 3. Generative AI and Large Language Models
>
>Large Language Models (LLMs), such as ChatGPT, are increasingly being used at the interfaces between humans, technology and data for complex tasks that involve reasoning, recognition, prediction, summarization and translation.
>
>But there are problems with LLMs – most notably, **hallucinations**. LLMs are trained on very large datasets – typically, a vast amount of internet data up to a certain point in time – and their responses very often incorporate knowledge derived from this underlying training data. But if an answer, or part of an answer, can't be sourced from the LLM's intrinsic knowledge, the LLM can sometimes create a plausible, but incorrect, response, based on its ability to predict the next best likely tokens in the output it generates. These incorrect responses are called hallucinations. In some domains we might treat these hallucinations as mild distractions, but in other domains we should consider them as being dangerous to the health and safety of anyone depending on the answer generated by the LLM.
>
>

<a id='glossary-rag'></a>

>### 3A. Retrieval Augmented Generation
>
>To address these issues, a technique has emerged of supplying the LLM with additional, external data whenever a question is posed to the model. The LLM is then instructed to use its language reasoning capabilities to answer the question based on this new contextual information. This technique, which is called **Retrieval Augmented Generation** (RAG), allows companies to use existing data repositories to enhance the decision-making and information processing capabilities of their generative AI applications.
>
>A RAG application uses **vector embeddings** and **similarity search** to query an external repository for concepts and data that are similar and relevant to the question being asked. This data, or facts, then accompany the question and are passed to the LLM to generate a response to the question. 

<a id='glossary-vector-embeddings'></a>

>## 4. Vector Embeddings and Similarity Search
>
>A vector embedding is a numerical representation of a complex, high-dimensional object such as a piece of text or an image. Using an embedding model, this high-dimensional object is transformed to a lower-dimension representation – a vector, or numerical array. This vector can then be used to perform complex calculations that identify patterns and relationships in the underlying data, including calculations that determine the similarities between two or more pieces of content.
>
>![image.png](attachment:image.png)
>
>Amazon Bedrock offers several different embedding models, including Titan models from Amazon and Embed models from Cohere. In this workshop you'll be using a Titan embedding model to create vectors from sections of press releases.

<a id='glossary-sim-search'></a>
>### 4a. Similarity Search
>
>A vector is a numerical array. This array can be thought of as a set of coordinates that represent a point in vector space. It's pretty easy to imagine a 2D- or 3D-vector space, with X, Y (and Z) coordinates; perhaps less easy to picture a vector space with 1536 dimensions: but the idea remains the same. In this vector space, similarity is determined by the distance between points (which in turn represent different pieces of underlying content). The closer the points are together, the more similar the content.
>
>This is the basis of similarity search – sometimes called vector search. Under the hood, similarity search uses methods such as [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) and [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) to measure the proximity between vectors, and identify those pieces of content most similar to the search term. In your knowledge graph application, you'll have several dcouments about SBOMs that have been indexed, chunked, and vectorized. If you ask the chatbot a question such as 'My vendor isn't giving me an SBOM, what do I do now??', the application will first vectorize the question, and then compare this question embedding with the embeddings for each of the press releases in order to identify those articles whose embeddings suggest the content is somehow related to the question.
>
>![image.png](attachment:image.png)
>
>You don't have to know the details of these embedding or similarity search methods: the embedding models provided by Amazon Bedrock do the work of creating vectors for each piece of content, and most similarity search APIs will typically expose something like a `top_k` method that allows you to find a specified number of 'similar' entities within your dataset.

<a id='glossary-graph-rag'></a>

>## 5. Graph Enhanced RAG (GraphRAG)
>
>As mentioned earlier, a RAG application uses an external source of data to supply an LLM with additional content at question answering time. A graph database can act as one such source of external data for a RAG application. 
>
>But besides supplying connected content, graphs bring several other benefits to RAG applications such as providing comprehensive and explainable responses for questions that rely on data on closely related concepts but which are distant in vector space.




<a id='glossary-prov-and-line'></a>
> ### 5A. Provenance and Lineage
>
>Traditional RAG applications can sometimes conflate sources, or combine extracts from different sources, without regard to the provenance or lineage of the underlying data. This has implications when evaluating things such as medical and insurance policies: though the details in the response may have been sourced directly from the supplied data, the response might combine them in ways that don’t reflect reality, creating a fictitious super policy whose details come from several different providers. 
>
>   In a knowledge graph, we can track the lineage and provenance of each fact, which makes it easier to distinguish mutually exclusive, or contradictory, or otherwise separate streams of reasoning. Graphs help keep RAG applications honest, not only with the details, but also with regards to how they combine those details to generate a response.
>   
<a id='glossary-grounding'></a>
> ### 5B. Grounding 
>
>One of the purposes of RAG is to reduce hallucinations in LLM-based applications, by supplementing the LLM's intrinsic knowledge with recent or private data that was not available to the LLM when it was trained. But RAG applications can still suffer from hallucinations – fictitious assertions, invented details, and nonsensical conclusions. Furthermore, the mechanism for finding relevant information to supply to the LLM – similarity search – is itself difficult to reason about. Given a set of similarity results, and even access to the underlying vectors, it's difficult to determine exactly _what_ within the vector space leads X, Y or Z to appear similar or dissimilar.
>
>   A well-curated knowledge graph contains asserted facts that can be traversed or navigated by following the relations that connect entities. If we ask a graph what links entities X and Y, the results are usually self explanatory: the facts along the paths that connect X and Y provide the basis for explaining why or how we should consider these entities as being related (or not related). Furthermore, we can take an existing answer from an LLM, and break it up into facts that we can then check in the graph. Graphs, therefore, can help squash hallucinations even further, by providing a grounding of asserted facts that can be used to fact check an answer.
>
<a id='glossary-context'></a>
> ### 5C. Context
>
>RAG applications work well if the answer to a question can be sourced from a few discreet pieces of content. But they don't work so well when an answer requires connecting disparate pieces of information scattered across the underlying data, or depends on an understanding of the indirect or hidden relations between parts of a response and some other data not directly referenced in the question. This kind of contextual knowledge is important in supply chain applications for example, where an understanding of upstream and downstream dependencies can often determine how we reason about each element in the chain. 
>
>Graphs can provide additional local context to supplement that found through similarity search. A vector search might unearth one or more press releases that directly mention a particular organization; a graph-based traversal that starts from each of these organizations might also uncover business relationships that point to important facts not covered in the found content.
>
>Besides enriching local context, graphs can also be used to provide global context. There are many graph algorithms that can be used to identify communities and groups within a dataset, and determine the influence and importance of individual entities relative to the whole. These hierarchical summaries on top of low-level detail in the graph can be used to answer questions that refer to the entire corpus: 'Which are the most connected organizations?', 'How are organizations distributed across industries?', 'What are the recurring themes in recent events?', and so on.