<a href="https://colab.research.google.com/github/UchidaYasuto/DH/blob/main/Execute_RAG_JTwp_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

環境設定･準備

In [None]:
# Google Colab 上で LangChain を正しくインストールするために、システムのエンコーディングを UTF-8 に設定
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# Additional requirement to use LLM from OpenAI and other LangChain components
!pip install -qU langchain-openai langchain langchain-community langchain_text_splitters langchain_chroma langchain-core

# Additional requirement to use open-source LLM from hugging face community
!pip install -q torch transformers accelerate sentence-transformers faiss-cpu

# Non-free LLM, OpenAI gpt-4o
import getpass
import os
from google.colab import userdata
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

llm = ChatOpenAI(model="gpt-4o-mini", temperature = 0.0)  # for embedding, retriever and generator

Webのリンクから抽出･作成したテキストファイル  
* [Joho-Tsushin_whitepaper_1973-2025.txt](https://drive.google.com/file/d/1cdGCh1VhyhS695MitsK6gX4HFLrqFyFj/view?usp=drive_link)


ベクトルデータベース Choma_db の用意：すでに作成･準備された既存のChromaデータベースを再利用
* [Choma_db](https://drive.google.com/drive/folders/1cFD3QW7_fHty7Y5XEslH9WM1LwP8aJvR?usp=sharing)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# download existing Chroma DB from google drive
!gdown --folder https://drive.google.com/drive/folders/1cFD3QW7_fHty7Y5XEslH9WM1LwP8aJvR?usp=sharing

RAGの構築

In [None]:
# 埋め込みモデルの設定とベクトルデータベースの初期化
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embedding = OpenAIEmbeddings(
    model="text-embedding-3-large",
    chunk_size=5,
)

vectorstore = Chroma(
    embedding_function=embedding,
    persist_directory="./chroma_db"
)


# Retriever using vector DB
retriever = vectorstore.as_retriever()

retriever_sim = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10},
)

retriever_mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 20,
        "lambda_mult": 0.5,  # 0〜1で調整（0寄り → 多様性重視）
    },
)


# Prompt Template
from langchain_core.prompts import PromptTemplate

prompt_template = """
<|system|>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
keep the answer as concise as possible.
Use markdown formatting when displaying code.
Emphasis should be used to terminologies.
Always say "thanks for asking!" at the end of the answer.

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

"""

#  Create the PromptTemplate Instance
prompt = PromptTemplate(
    input_variables=[
        "context",
        "question",
        "sources"
        ],
    template=prompt_template,
)


# Output formatter

# Document (context) formatter
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Source information formatter
def format_sources(docs):
    sources = [doc.metadata.get('source', 'Unknown Source') for doc in docs]
    return "Sources:\n" + "\n".join(sources)

# Answer formatter
from IPython.display import Markdown, display, HTML

def display_answer(answer):
    # Wrap the answer string directly in an HTML div with CSS to force word wrapping
    # Using 'word-break: break-word' for more aggressive breaking if needed
    wrapped_html = f"<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>{answer}</div>"
    display(HTML(wrapped_html))


# Overall workflow by combining the above components

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm_chain = prompt | llm | StrOutputParser()

# similarity_RAGチェーン
rag_chain_sim = (
    {
        "context": retriever_sim | format_docs,
        "sources": retriever_sim | format_sources,
        "question": RunnablePassthrough(),
    }
    | llm_chain
)

# MMR_RAGチェーン
rag_chain_mmr = (
    {
        "context": retriever_mmr | format_docs,
        "sources": retriever_mmr | format_sources,
        "question": RunnablePassthrough(),
    }
    | llm_chain
)

---
### **Q&A（フォーマット）**

**１）RAGの回答**：以下の２つの方法を併用し、検索結果を比較
*   **Similarity RAG**（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   **MMR RAG**（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

**２）LLMの回答との比較**

**３）RAGのソース情報の提示**
*   **RAGの回答のソースリスト**：項目タイトル、URL
*   **RAGの回答のソース詳細**：年次、項目タイトル、URL、内容

---

## **＜質問はこちらに＞**

In [None]:
# question
query = "　？"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   **Similarity RAG**（cosine 類似度）：質問と一番似ているチャンクを上から順に５つ選択
*   **MMR RAG**（MMR:Max Marginal Relevance）：質問に近い＋相互に似すぎていないチャンクを10個選択

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---
# **Showcase**

---
## Q1　数値把握／状況説明問題（テレトピア政策の進捗状況）

外部知識（昭和63年版 情報白書の1-1-3 地域情報化推進政策の現状）に関する以下の文脈について、RAGシステムに問い合わせる：

> 郵政省では，テレトピア事業や民活法施設の整備事業等を通じて，地域に密着した情報通信を中心とした情報化を推進している。ここでは，情報通信を中心とした地域の情報化推進政策の現状について概観する。  
> （1）テレトピアの推進  
**テレトピア計画は，ＣＡＴＶやビデオテックス等のニューメディアをモデル都市に集中的に導入することにより，地域の情報化を促進し，それぞれの地域の情報通信の核となる基盤を整備するもの**である。
テレトピアは，現在，**63地域が指定**され，全国で260のシステム構築予定に対し，**62年度末現在，既に92システムが運用を開始**している。また，これらのシステムの事業主体として41の第三セクターが設立されている。
運用システムに導入されているメディアとしては，データ通信，ビデオテックス及びＣＡＴＶが中心であり，全体の8割を占めている。


## **＜質問はこちらに＞**

In [None]:
# question
query = "昭和62年時点で、テレトピアの指定地域数やシステムの整備状況はどうなっていますか。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   **Similarity RAG**（cosine 類似度）：質問と一番似ているチャンクを上から順に５つ選択
*   **MMR RAG**（MMR:Max Marginal Relevance）：質問に近い＋相互に似すぎていないチャンクを10個選択

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q2　語彙･概念の説明問題（「テレトピア」の説明）

## **＜質問はこちらに＞**

In [None]:
# question2
query = "テレトピアとは何ですか。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   **Similarity RAG**（cosine 類似度）：質問と一番似ているチャンクを上から順に５つ選択
*   **MMR RAG**（MMR:Max Marginal Relevance）：質問に近い＋相互に似すぎていないチャンクを10個選択

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q3　数値把握･説明問題（情報化指数の国際比較 1970年時点）

外部知識（『昭和48年版 情報白書』情報化指数の国際比較）に関する以下の文脈について、RAGシステムに問い合わせる：

> 情報化指数により国際比較を行うと，1970年現在で日本を100とした場合，米国199，英国108，西独101，フランス96となり，我が国は総体的には西欧並みであるが，米国には遠く及ばない状態である。
　特に我が国が他国に比べて大きい値を示すのは，1人当たり年間通話度数（日本382回，米国779，英国195，西独166，フランス105），100人当たり1日平均新聞発行部数（日本51.1部，米国30.2，英国46.3，西独31.9，フランス23.8），人口密度（日本280人／ｋｍ2，米国22，英国228，西独240，フランス93），100人当たりの大学在学者数（日本1.56人，米国3.82、英国0.62，西独0.74，フランス1.20）である。
　反対に日本が他国に比べて小さい値を示す項目は1万人当たり年間書籍発行点数（日本3.03点，米国3.88，英国5.97，西独7.69，フランス4.50），1万人当たり電子計算機台数（日本0.65台，米国3.06，英国1.06，西独1.08，フランス0.88）である。これをみると，1万人当たり書籍発行点数が他の国に比べて非常に小さいことのほか，人的要素による情報化（通話度数，人口密度，大学在学者数）は高いが，テレビ，電子計算機などのいわば情報装備率ともいうべきものが低いことが特徴的である


In [None]:
# question
query = "1970年時点における日本の情報化の進捗状況について，情報化指数により国際比較を行うと，どのような状況になっていますか。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   **Similarity RAG**（cosine 類似度）：質問と一番似ているチャンクを上から順に５つ選択
*   **MMR RAG**（MMR:Max Marginal Relevance）：質問に近い＋相互に似すぎていないチャンクを10個選択

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q4　数値把握／状況説明問題（昭和48年における年代別テレビ視聴時間の特徴）

外部知識（『昭和48年版 情報白書』テレビ視聴時間）に関する以下の文脈について、RAGシステムに問い合わせを試みる：

> テレビ視聴時間を年代別にみると，年代層によって変化の状況にかなりの相違がある。特に，20代がほとんど変化していないのに対し，10代が若干減少気味であり，逆に50代，60代は大幅に増加している。これを詳細にみると，10代は40年の2時間18分に対して45年は2時間9分で，9分減少しており，一方増加が顕著な60代は，40年の3時間16分に対して45年は4時間1分となり，45分増えている。このように年代層によってテレビ視聴時間の増減傾向にかなりの差があるが，ＮＨＫ放送世論調査所が行った「全国意向調査」によると，男女とも若い世代ほどテレビ放送番組に対する選択意識が強く，「見たい番組がある時に見る」比率が高いのに対して，高年齢層になるに従って「見るのが習慣になっている」比率が高く，極めて対照的である



In [None]:
# question
query = "昭和48年におけるテレビ視聴時間は、年代別にみてどのような特徴がありましたか。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   Similarity RAG（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   MMR RAG（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q5 〔時系列･縦断〕数値把握／状況説明問題（スマートフォンの所有率の推移、利用状況の変化）

外部知識（『令和7年版 情報通信白書』第Ⅱ部 情報通信分野の現状と課題－第11節 デジタル活用の動向－1 国民生活におけるデジタル活用の動向－（1）情報通信機器・端末）に関する以下の文脈について、RAGシステムに問い合わせる：
デジタルを活用する際に必要となるインターネットなどに接続するための端末について、2024年の情報通信機器の世帯保有率は、「モバイル端末全体」で97.0％であり、その内数である「スマートフォン」は90.5％である。また、パソコンは66.4％となっている（図表Ⅱ-1-11-1）。



In [None]:
# question
query = "これまでにスマートフォンの利用率はどのように推移してきましたか。また、スマートフォン利用状況はどのように変化してきましたか。"

* **これまでに**スマートフォンの利用率（所有率）はどのように推移してきましたか。また、スマートフォン利用状況はどのように変化してきましたか。
* **2025年までに** ～
* **2024年までに** ～  
⇒ 長期の縦断的・時系列の内容については、参照文書数を増やした方がいいか
* **2024年**におけるスマートフォンの保有率はどうなっていますか。（**令和7年（2025）版の情報通信白書**を参照して、回答して下さい）
* **2023年** ～  
⇒ どうしてこのような回答になるのか、原因不明！

In [None]:
# question
query = "2024年におけるスマートフォンの保有率はどうなっていますか。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   Similarity RAG（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   MMR RAG（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q6 〔時系列･縦断〕概況･変化の説明問題（情報化の潮流･概況）

## **＜質問はこちらに＞**

In [None]:
# question
query = "1973年から2025年までの50年ほどの情報化をめぐる潮流について、どのようにまとめることができますか。全体的な流れ・動向を述べてください。また、主たる潮流を5つ程度挙げ、その動向や変容についても述べてください。"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   Similarity RAG（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   MMR RAG（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q

外部知識（『？?年版 情報通信白書』？）に関する以下の文脈について、RAGシステムに問い合わせる：

>



---
## **＜質問はこちらに＞**

In [None]:
# question
query = "　？"

## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   Similarity RAG（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   MMR RAG（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---

---
## Q

外部知識（『？?年版 情報通信白書』？）に関する以下の文脈について、RAGシステムに問い合わせる：

>



## **＜質問はこちらに＞**

In [None]:
# question
query = "　？"

---
## ＜LLMの回答とRAGの回答の比較＞

### **RAGの回答**（Domain-specific answering）
*   Similarity RAG（cosine 類似度）：「質問と一番似ているチャンクを上から順に取ってくる」シンプルな方式
*   MMR RAG（MMR:Max Marginal Relevance）：「質問に近い＋相互に似すぎていないチャンクを選ぶ」方式

In [None]:
# answer with the context from vector DB
anwser_sim = rag_chain_sim.invoke(query)
print("=== Similarity RAG ===")
display_answer(anwser_sim)

anwser_mmr = rag_chain_mmr.invoke(query)
print("\n=== MMR RAG ===")
display_answer(anwser_mmr)

### **LLMの回答**（Domain-general answering）

In [None]:
# answer without external knowledge
answer_without_knowledge = llm_chain.invoke({"context":"", "question": query})
display_answer(answer_without_knowledge)

---
## ＜ソース情報（Extracting sources）＞

### **RAGの回答のソースリスト**

In [None]:
docs_sim = retriever_sim.invoke(query)
docs_mmr = retriever_mmr.invoke(query)

sim_retrieved_docs = retriever_sim.invoke(query)
sim_sources = format_sources(sim_retrieved_docs)

mmr_retrieved_docs = retriever_mmr.invoke(query)
mmr_sources = format_sources(mmr_retrieved_docs)


print("=== Similarity RAG ===")
print()
for d in docs_sim:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

print("\n=== MMR RAG ===")
print()
for d in docs_mmr:
    print(d.metadata.get("year"), d.metadata.get("title"))
    print(d.metadata.get("url"))
    print()

### **RAGの回答のソース詳細**

In [None]:
# Display the sources separately
from IPython.display import HTML

print("=== Similarity Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(sim_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (Similarity):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

print("\n=== MMR Source metainfo (Metadata and Content) ===")
for i, doc in enumerate(mmr_retrieved_docs):
    url = doc.metadata.get('url', 'N/A')
    title = doc.metadata.get('title', 'N/A')
    year = doc.metadata.get('year', 'N/A')
    display(HTML(f"<b>Document {i+1} (MMR):</b><br>" +
                 f"Year: {year}<br>" +
                 f"Title: {title}<br>" +
                 f"URL: <a href='{url}' target='_blank'>{url}</a><br>" +
                 "<div style='word-wrap: break-word; overflow-wrap: break-word; white-space: pre-wrap;'>" +
                 f"Content: {doc.page_content}</div><br>"))

---