# JSONファイルからRAG用のベクトルDBを作る

LangChain を使って RAG を試してみた #AI - Qiita
> https://qiita.com/tinymouse/items/4d359674f6b2494bb22d

LLMアプリケーション開発のためのLangChain 後編⑤ 外部ドキュメントのロード、分割及び保存 - qiita
> https://qiita.com/utanesuke/items/6efc03eca94f7de3b9cd#json-%E3%83%AD%E3%83%BC%E3%83%80%E3%83%BC


## install

In [None]:
!pip install unsloth langchain langchain_community langchain-huggingface sentence-transformers transformers accelerate chromadb

In [None]:
!pip install 

## imports

In [28]:
## main models
# need to import unsloth 1st
from unsloth import FastLanguageModel
import transformers
from sentence_transformers import SentenceTransformer
import torch

## langchain (need to make RAG)
from langchain_huggingface import HuggingFacePipeline
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.document_loaders import JSONLoader
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma

import chromadb
import json
from pathlib import Path
import pprint


## prepare tokenizer & model

In [2]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    max_seq_length = 40960,
    load_in_4bit = True,            # 4bit uses much less memory
    load_in_8bit = False,           # A bit more accurate, uses 2x memory
)

==((====))==  Unsloth 2025.9.11: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.431 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## prepare pipeline

In [3]:
pipe = transformers.pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1024,
    dtype=torch.float16
)

llm = HuggingFacePipeline(
    pipeline = pipe
)

Device set to use cuda:0


## load json datas

In [35]:
# need to avoid unicode escape & marge instruction, output
loader = JSONLoader(
    file_path="./jvn_results_merged.json",
#    jq_schema=".[]",
    jq_schema=".[] | .instruction, .output",
    text_content=False
)
docs_raw = loader.load()

print(docs_raw[0])
print(docs_raw[1])

docs = []
for i in range(0, len(docs_raw), 2):
    inst = docs_raw[i].page_content
    out = docs_raw[i+1].page_content
    inst_meta = docs_raw[i].metadata["seq_num"]
    out_meta = docs_raw[i+1].metadata["seq_num"]

    # page_content に instruction と output をまとめる
    content = (
        "### Instruction:\n"
        f"{inst}\n\n"
        "### Output:\n"
        f"{out}"
    )

    # メタデータとして元データを保持（任意）
    metadata = {
        "instruction": inst_meta,
        "output": out_meta
    }

    docs.append(Document(page_content=content, metadata=metadata))

page_content='クロスサイトスクリプティングの脆弱性の例を教えて' metadata={'source': '/home/isusers/b223r030p@kochi-u.ac.jp/cifshome/Ubuntu/Unsloth/jvn_results_merged.json', 'seq_num': 1}
page_content='Iqbolshoh Ilhomjonov の PHP Education Management におけるクロスサイトスクリプティングの脆弱性が起きているようです。PHP Education Manager v1.0 is vulnerable to Cross Site Scripting (XSS) in the worksheet.php file via the participant_name parameter. 該当するのはIqbolshoh Ilhomjonov
PHP Education Management 1.0とのことです。' metadata={'source': '/home/isusers/b223r030p@kochi-u.ac.jp/cifshome/Ubuntu/Unsloth/jvn_results_merged.json', 'seq_num': 2}


In [4]:
# # ひとまずこれで試してみる
# file_path = "./jvn_results_merged.json"
# docs = json.loads(Path(file_path).read_text())

In [36]:
pprint.pprint(docs[1])

Document(metadata={'instruction': 3, 'output': 4}, page_content='### Instruction:\nSQL インジェクションの脆弱性の例を教えて\n\n### Output:\ncarmelogarcia の Simple Leave Manager In PHP With Source Code における SQL インジェクションの脆弱性が発表された。A flaw has been found in code-projects Simple Leave Manager 1.0. This vulnerability affects unknown code of the file /user.php. This manipulation of the argument table causes sql injection. Remote exploitation of the attack is possible. The exploit has been published and may be used. この脆弱性はcarmelogarcia\nSimple Leave Manager In PHP With Source Code 1.0に影響を及ぼす。')


## prepare embedding model

In [15]:
embeddings = SentenceTransformerEmbeddings(
    model_name="./infloat_multilingual-e5-large"
)

## save DB

In [None]:
# Error no longer appear
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

In [None]:
# https://github.com/dodeeric/json-files-ingestion-into-chroma-vector-db
persistent_client = chromadb.PersistentClient()
collection_name = "CVE_RAG"
collection = persistent_client.get_or_create_collection(collection_name)

langchain_chroma = Chroma(
    client = persistent_client,
    collection_name = collection_name,
    embedding_function = embeddings,
)

db = langchain_chroma.from_documents(docs, embeddings, persist_directory="./chromadb/CVR_RAG-json")

## search DB

## add context for prompt