<a href="https://colab.research.google.com/github/franlin1860/llm/blob/main/ingestion_nodes_20240827.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/low_level/ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Data Ingestion from Scratch

In this tutorial, we show you how to build a data ingestion pipeline into a vector database.

We use Pinecone as the vector database.

We will show how to do the following:
1. How to load in documents.
2. How to use a text splitter to split documents.
3. How to **manually** construct nodes from each text chunk.
4. [Optional] Add metadata to each Node.
5. How to generate embeddings for each text chunk.
6. How to insert into a vector database.

Refer: https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_nodes/

# Prevent disconnection

In [1]:
#@markdown <h3>← 输入了代码后运行以防止断开</h>
import IPython
from google.colab import output

display(IPython.display.Javascript('''
 function ClickConnect(){
   btn = document.querySelector("colab-connect-button")
   if (btn != null){
     console.log("Click colab-connect-button");
     btn.click()
     }

   btn = document.getElementById('ok')
   if (btn != null){
     console.log("Click reconnect");
     btn.click()
     }
  }

setInterval(ClickConnect,60000)
'''))

print("Done.")

<IPython.core.display.Javascript object>

Done.


In [None]:
function ConnectButton(){
    console.log("Connect pushed");
    document.querySelector("#connect").click()
}
setInterval(ConnectButton,60000);

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [2]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.11.1-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_agent_openai-0.3.0-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_cli-0.3.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.1 (from llama-index)
  Downloading llama_index_core-0.11.1-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.2.3-py3-none-any.whl.metadata (635 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.3.0-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post3-py3-none-any.whl.metadata (8.5 kB)
Collecting 

## LLM



## Environment

First we add our dependencies.

#### Set Environment Variables


In [25]:
import os

os.environ["DEEPSEEK_API_KEY"] = ""

Set your LLM api key, and environment in the file we created.

## Setup

In [4]:
!pip install llama_index-llms-openai_like
!pip install llama_index-embeddings-huggingface

Collecting llama_index-llms-openai_like
  Downloading llama_index_llms_openai_like-0.2.0-py3-none-any.whl.metadata (753 bytes)
Downloading llama_index_llms_openai_like-0.2.0-py3-none-any.whl (3.1 kB)
Installing collected packages: llama_index-llms-openai_like
Successfully installed llama_index-llms-openai_like-0.2.0
Collecting llama_index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.3.1-py3-none-any.whl.metadata (718 bytes)
Collecting sentence-transformers>=2.6.1 (from llama_index-embeddings-huggingface)
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama_index-embeddings-huggingface)
  Downloading minijinja-2.2.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.8 kB)
Downloading llama_index_embeddings_huggingface-0.3.1-py3-none-any.whl (8.6 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━

In [26]:
import os
import logging
import sys
from llama_index.llms.openai_like import OpenAILike
from llama_index.core import Settings, ServiceContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 配置日志
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

# 定义DeepSpeed model
llm = OpenAILike(model="deepseek-chat",
                 api_base="https://api.deepseek.com/v1",
                 api_key=os.environ["DEEPSEEK_API_KEY"],
                 temperature=0.6,
                 is_chat_model=True)

# 配置环境
Settings.llm = llm

# 设置嵌入模型
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-zh-v1.5")
Settings.embed_model = embed_model
Settings.chunk_size = 256

In [28]:
results = llm.complete("你好")
print(results)

你好！欢迎使用人工智能助手。有什么我可以帮助你的吗？


## Build an Ingestion Pipeline from Scratch

We show how to build an ingestion pipeline as mentioned in the introduction.

Note that steps (2) and (3) can be handled via our `NodeParser` abstractions, which handle splitting and node creation.

For the purposes of this tutorial, we show you how to create these objects manually.

### 1. Load Data

In [7]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2024-08-27 07:50:06--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2307.09288 [following]
--2024-08-27 07:50:07--  http://arxiv.org/pdf/2307.09288
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2024-08-27 07:50:07 (85.4 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [9]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.9 (from pymupdf)
  Downloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.9-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyMuPDFb-1.24.9-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.9 pymupdf-1.24.9


In [10]:
import fitz

In [11]:
file_path = "./data/llama2.pdf"
doc = fitz.open(file_path)

### 2. Use a Text Splitter to Split Documents

Here we import our `SentenceSplitter` to split document texts into smaller chunks, while preserving paragraphs/sentences as much as possible.

In [12]:
from llama_index.core.node_parser import SentenceSplitter

In [13]:
text_parser = SentenceSplitter(
    chunk_size=1024,
    # separator=" ",
)

In [14]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

### 3. Manually Construct Nodes from Text Chunks

We convert each chunk into a `TextNode` object, a low-level data abstraction in LlamaIndex that stores content but also allows defining metadata + relationships with other Nodes.

We inject metadata from the document into each node.

This essentially replicates logic in our `SentenceSplitter`.

In [15]:
from llama_index.core.schema import TextNode

In [16]:
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc_idx = doc_idxs[idx]
    src_page = doc[src_doc_idx]
    nodes.append(node)

In [17]:
print(nodes[0].metadata)

{}


In [18]:
# print a sample node
print(nodes[0].get_content(metadata_mode="all"))

Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov

###  4. Extract Metadata from each Node

We extract metadata from each Node using our Metadata extractors.

This will add more metadata to each Node.

In [29]:
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline

llm = Settings.llm

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
]

In [30]:
pipeline = IngestionPipeline(
    transformations=extractors,
)
nodes = await pipeline.arun(nodes=nodes, in_place=False)




  0%|          | 0/5 [00:00<?, ?it/s][A[A[A


 20%|██        | 1/5 [00:04<00:16,  4.12s/it][A[A[A


 40%|████      | 2/5 [00:07<00:10,  3.47s/it][A[A[A


 60%|██████    | 3/5 [00:07<00:04,  2.13s/it][A[A[A


 80%|████████  | 4/5 [00:15<00:04,  4.44s/it][A[A[A


100%|██████████| 5/5 [00:24<00:00,  4.99s/it]



  0%|          | 0/107 [00:00<?, ?it/s][A[A[A


  1%|          | 1/107 [00:26<46:17, 26.21s/it][A[A[A


  2%|▏         | 2/107 [00:40<33:26, 19.11s/it][A[A[A


  3%|▎         | 3/107 [00:50<25:55, 14.96s/it][A[A[A


  4%|▎         | 4/107 [00:53<17:56, 10.45s/it][A[A[A


  5%|▍         | 5/107 [00:56<12:45,  7.51s/it][A[A[A


  6%|▌         | 6/107 [00:59<10:00,  5.95s/it][A[A[A


  7%|▋         | 7/107 [01:01<08:11,  4.91s/it][A[A[A


  7%|▋         | 8/107 [01:09<09:21,  5.67s/it][A[A[A


  8%|▊         | 9/107 [01:10<07:17,  4.46s/it][A[A[A


  9%|▉         | 10/107 [01:11<05:09,  3.19s/it][A[A[A


 10%|█         | 11/107 [01:1

In [31]:
print(nodes[0].metadata)

{'document_title': '"Llama 2: Comprehensive Insights into Pretraining, Fine-Tuning, Safety, Ethical Considerations, and Open-Source Contributions for Advanced Dialogue Optimization"', 'questions_this_excerpt_can_answer': '1. **What is the range of parameter sizes for the Llama 2 collection of large language models?**\n   - The excerpt specifies that Llama 2 includes large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\n\n2. **How do the fine-tuned Llama 2-Chat models perform compared to other open-source chat models on benchmarks?**\n   - The document states that the fine-tuned Llama 2-Chat models outperform open-source chat models on most benchmarks tested.\n\n3. **What are the primary objectives of the Llama 2-Chat models in terms of their application and safety?**\n   - The abstract mentions that Llama 2-Chat models are optimized for dialogue use cases and are evaluated for helpfulness and safety, potentially serving as a suitable substitute for clo

### 5. Generate Embeddings for each Node

Generate document embeddings for each Node using our OpenAI embedding model (`text-embedding-ada-002`).

Store these on the `embedding` property on each Node.

In [21]:
embed_model = Settings.embed_model

In [32]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

### 6. Load Nodes into a Vector Store

We now insert these nodes into our `PineconeVectorStore`.

**NOTE**: We skip the VectorStoreIndex abstraction, which is a higher-level abstraction that handles ingestion as well. We use `VectorStoreIndex` in the next section to fast-track retrieval/querying.

## Retrieve and Query from the Vector Store

Now that our ingestion is complete, we can retrieve/query this vector store.

**NOTE**: We can use our high-level `VectorStoreIndex` abstraction here. See the next section to see how to define retrieval at a lower-level!

In [33]:
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

In [37]:
index = VectorStoreIndex(nodes=nodes)

In [38]:
query_engine = index.as_query_engine()

In [39]:
query_str = "Can you tell me about the key concepts for safety finetuning"

In [40]:
response = query_engine.query(query_str)

In [41]:
print(str(response))

The key concepts for safety fine-tuning in Llama 2 include:

1. **Supervised Safety Fine-Tuning**: This involves gathering adversarial prompts and safe demonstrations to teach the model to align with safety guidelines before integrating human feedback.

2. **Safety RLHF (Reinforcement Learning from Human Feedback)**: This technique integrates safety into the RLHF pipeline, including training a safety-specific reward model and gathering more challenging adversarial prompts for fine-tuning.

3. **Safety Context Distillation**: This method refines the RLHF pipeline by generating safer model responses using safety preprompts, such as "You are a safe and responsible assistant," and then fine-tuning the model on these responses.

4. **Safety Categories and Annotation Guidelines**: These guidelines help in creating adversarial prompts based on risk categories (illicit and criminal activities, hateful and harmful activities, unqualified advice) and attack vectors to cover different varieties o