#Uploading Data to ChromaDB

## Objective
Uploading MITRE ATT&CK and CISA Advisories data into Chroma, a high-performance vector database, using Python. Aligns with the MITRE embed project and supports Retrieval-Augmented Generation (RAG) in cybersecurity.

##Step 1: Set Up the Enviornment

In [1]:
!pip install chromadb pandas transformers

Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.35.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.35.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.35.0-py3-none-any.whl.metadata (1.5 k

##Step 2: Load the MITRE Att&CK Data

In [2]:
import pandas as pd

In [15]:
# Load the MITRE ATT&CK datasets
mitre_data = pd.read_csv('/content/mitreembed_master_Chroma.csv', sep='\t', on_bad_lines='skip')
print("MITRE Data Sample:\n", mitre_data.head())

MITRE Data Sample:
                                              Subject  \
0  Analyze network data for uncommon data flows (...   
1  Analyze network data for uncommon data flows (...   
2  Analyze network data for uncommon data flows (...   
3  Analyze network data for uncommon data flows (...   
4  #  Windows #\n\nMonitor for unexpected process...   

                                        filepath    Date  \
0      https://attack.mitre.org/techniques/T1001  4/5/24   
1  https://attack.mitre.org/techniques/T1001/001  4/5/24   
2  https://attack.mitre.org/techniques/T1001/002  4/5/24   
3  https://attack.mitre.org/techniques/T1001/003  4/5/24   
4      https://attack.mitre.org/techniques/T1003  4/5/24   

                                                Body Source  
0  Adversaries may obfuscate command and control ...  MITRE  
1  Adversaries may add junk data to protocols use...  MITRE  
2  Adversaries may use steganographic techniques ...  MITRE  
3  Adversaries may impersonate leg

In [6]:
cisa_data = pd.read_csv('CISA_combo_features_new.csv')
print("CISA Data Sample:\n", cisa_data.head())

CISA Data Sample:
                                              Subject  \
0  CISA Releases Malware Analysis Reports on Barr...   
1  CISA Adds One Known Exploited Vulnerability to...   
2  Adobe Releases Security Updates for Multiple P...   
3      Fortinet Releases Security Update for FortiOS   
4    Microsoft Releases August 2023 Security Updates   

                              Date  \
0  Wed, 09 Aug 2023 19:19:13 +0000   
1  Wed, 09 Aug 2023 16:23:59 +0000   
2  Tue, 08 Aug 2023 21:16:14 +0000   
3  Tue, 08 Aug 2023 19:14:52 +0000   
4  Tue, 08 Aug 2023 19:13:07 +0000   

                                                Body  \
0      CISA Releases Malware Analysis Reports on ...   
1      CISA Adds One Known Exploited Vulnerabilit...   
2      Adobe Releases Security Updates for Multip...   
3      Fortinet Releases Security Update for Fort...   
4      Microsoft Releases August 2023 Security Up...   

                                            filepath  
0  /Users/Jupiter/Deskt

## Step 3: Initialize the Sentence Transformer

In [7]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-n

In [8]:
#sentence embedding model
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

In [9]:
# This line downloads the 'all-MiniLM-L12-v2' Sentence Transformer model
original_model = SentenceTransformer('all-MiniLM-L12-v2')

# reload model using langchain wrapper
# This line saves the downloaded model to the current directory
original_model.save('./')

# Define the path where the embedding model is saved
embedding_model_path = './'
# Initialize HuggingFaceEmbeddings with the local model path
# This wraps the Sentence Transformer model for use with LangChain
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_path)

# Define example sentences to be embedded
sentences = ["This is an example sentence", "Each sentence is converted"]

# Re-initialize the Sentence Transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
# Encode the example sentences into embeddings
embeddings = model.encode(sentences)

# Re-initialize HuggingFaceEmbeddings again
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_path)

# Print the generated embeddings
print(embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_path)


[[-2.02660376e-04  8.14801678e-02  3.13617773e-02  2.92065088e-03
   2.61564665e-02  2.90739723e-02  7.82618299e-02 -1.80417008e-03
   1.01344392e-01 -4.51711752e-02  5.84349968e-02 -1.53200626e-02
   5.49956299e-02 -9.86434743e-02 -3.50253358e-02  8.45681224e-03
   1.58608723e-02  1.05626173e-02 -3.42710204e-02 -4.75065550e-03
   9.99021381e-02 -2.06017476e-02 -4.47837450e-02  3.12135555e-02
  -1.19240899e-02 -5.15015051e-02 -1.33605618e-02  1.89622138e-02
   9.76809859e-02 -5.44110276e-02 -3.43314074e-02  8.12904835e-02
   4.88119870e-02 -1.10284016e-02  2.13519260e-02  1.27190053e-02
  -1.43967532e-02  3.62864546e-02 -7.61233121e-02  3.23293582e-02
   2.08102744e-02 -4.22016494e-02  9.12907198e-02  2.08530556e-02
  -3.08016893e-02 -8.38504806e-02  1.30890906e-02 -3.00631300e-02
   4.11228687e-02 -1.27495304e-01 -7.78027922e-02 -3.93412225e-02
   1.52596645e-03 -2.80108545e-02  3.41662504e-02  1.46713192e-02
  -7.71654397e-02  1.63619593e-01  4.11295556e-02 -5.24459854e-02
  -4.18772

 ## Step 4: Setup ChromaDB

In [10]:
# Import necessary libraries
from langchain.vectorstores import Chroma  # For interacting with the Chroma vector database
import chromadb  # The ChromaDB client library
from sentence_transformers import SentenceTransformer # For creating sentence embeddings

from langchain.document_loaders import DataFrameLoader # To load data from pandas DataFrames into a document format
from langchain.embeddings import HuggingFaceEmbeddings # To use HuggingFace models for creating embeddings
import re # Regular expression operations

import pandas as pd # For data manipulation and analysis, especially with DataFrames

In [18]:
# define logic for embeddings storage
# Define the path for storing ChromaDB data (in the current directory)
chromadb_path = './'
# Initialize the ChromaDB client
chroma_client = chromadb.Client()
# Print the version of the ChromaDB client
print(chroma_client.get_version())

# Handle missing values in the 'Body' column
mitre_data['Body'] = mitre_data['Body'].fillna('')

# assemble product documents in required format (id, text)
# Create a DataFrameLoader instance, specifying that the 'Body' column contains the document content
loader = DataFrameLoader(
    mitre_data,
    page_content_column='Body'
    )

1.0.15


In [19]:
# Load the data from the mitre_data DataFrame into a list of Document objects
documents = loader.load()

## Step 5: Define Logic for embeddings storage

In [21]:
chromadb_path = './'

# Create a Chroma vector store from the documents and embeddings
vectordb = Chroma.from_documents(
  documents=documents, # The list of documents to embed and store
  embedding=embedding_model, # The embedding model to use
  persist_directory=chromadb_path, # The directory where the ChromaDB data will be stored
  collection_name = 'CISA_MITRE' # Optional: specify a name for the collection
  )

# persist vector db to storage
vectordb.persist()

  vectordb.persist()


## Step 6: Check the number of collections in the vectorDB

In [22]:
#count documents
vectordb._collection.count()

3628

## Step 7: Query ChromaDB for MITRE-related

In [23]:
query_text = "remote desktop attack"
vectordb.similarity_search_with_score(query_text)
#Response: Rows that closely aligns with remote desktop attack, meta data of row
# and file path which includes the technique

[(Document(metadata={'Date': '4/5/24', 'Subject': 'Consider monitoring processes for `tscon.exe` usage and monitor service creation that uses `cmd.exe /k` or `cmd.exe /c` in its arguments to detect RDP session hijacking.\n\nUse of RDP may be legitimate, depending on the network environment and how it is used. Other factors, such as access patterns and activity that occurs after a remote login, may indicate suspicious or malicious behavior with RDP.', 'filepath': 'https://attack.mitre.org/techniques/T1563/002', 'Source': 'MITRE'}, page_content="Adversaries may hijack a legitimate user‚Äôs remote desktop session to move laterally within an environment. Remote desktop is a common feature in operating systems. It allows a user to log into an interactive session with a system desktop graphical user interface on a remote system. Microsoft refers to its implementation of the Remote Desktop Protocol (RDP) as Remote Desktop Services (RDS).(Citation: TechNet Remote Desktop Services)\n\nAdversari

## Step 8: Inspect a record

In [24]:
#examine a vector db record
#shows a random record where metadata matches the query text
rec = vectordb._collection.peek(1)
# Print the metadata of the record
print('Metadata:  ', rec['metadatas'])
# Print the document content of the record
print('Documents:  ', rec['documents'])
# Print the IDs of the records
print('ids:        ', rec['ids'])
# Print the embeddings of the records
print('embeddings: ', rec['embeddings'])

Metadata:   [{'filepath': 'https://attack.mitre.org/techniques/T1001', 'Source': 'MITRE', 'Date': '4/5/24', 'Subject': 'Analyze network data for uncommon data flows (e.g., a client sending significantly more data than it receives from a server). Processes utilizing the network that do not normally have network communication or have never been seen before are suspicious. Analyze packet contents to detect communications that do not follow the expected protocol behavior for the port that is being used. (Citation: University of Birmingham C2)'}]
Documents:   ['Adversaries may obfuscate command and control traffic to make it more difficult to detect. Command and control (C2) communications are hidden (but not necessarily encrypted) in an attempt to make the content more difficult to discover or decipher and to make the communication less conspicuous and hide commands from being seen. This encompasses many methods, such as adding junk data to protocol traffic, using steganography, or imperso

## Step 9: Real-World Use Cases
By integrating ChromaDB with MITRE ATT&CK data, cybersecurity analysts can:

1. Rapidly map alerts to known techniques.
2. Cross-reference threat intelligence feeds.

## Conclusion
By leveraging ChromaDB and RAG frameworks, we can transform static cybersecurity data into dynamic, actionable insights.