In this notebook, we make use of the LangChain library to extract information from a pdf file created from Wikipedia about steel.

In [171]:
import os

# Load data

In [1]:
from langchain_community.document_loaders import PyPDFLoader

In [2]:
filepath = "../pdf/Steel_Wikipedia.pdf"

In [3]:
loader = PyPDFLoader(filepath)
docs = loader.load()
print("Number of pages:", len(docs))

Number of pages: 17


PyPDFLoader return a list of Document, each contains a text string from a page and metadata as a dict (pdf file and page number). Let concatenate all pages into a single string.

In [4]:
text = ""
for doc in docs:
    text += doc.page_content
print("Number of characters:", len(text))

Number of characters: 52481


In [209]:
text[:100]

'Steel Steel  is an alloy  of iron  and carbon  with improved strength  and fracture resistance  comp'

"New line" and "new paragraph" are both represented by "\n". It may be better to replace "\n" by a space, such that the whole pdf file is read as a long paragraph.

In [6]:
text = text.replace("\n", " ")

# Split texts into smaller chuncks

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [9]:
chunk_size = 1000  # Maximum size of chunks
chunk_overlap = 200  # Overlap in characters between chunks

separators = [".", " ", ""]
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=separators,
    keep_separator=False,
)

In [10]:
splits = text_splitter.split_text(text)

In [11]:
print("Number of text chunks:", len(splits))

Number of text chunks: 66


In [12]:
splits[0:3]

['Steel Steel  is an alloy  of iron  and carbon  with improved strength  and fracture resistance  compared to other forms of iron. Many other elements may be present or added. Stainless steels , which are resistant to corrosion  and oxidation , typically  need an additional 11% chromium . Because  of its high tensile strength  and low cost, steel is used in buildings, infrastructure, tools, ships, trains, cars, bicycles, machines, electrical appliances, furniture, and weapons. Iron is the base metal of steel. Depending on the temperature, it can take two crystalline forms (allotropic forms):  body-centred cubi c and face-centred cubic . The interaction of the allotropes of iron  with the alloying elements, primari ly carbon, gives steel and cast iron  their range of unique properties. In pure iron, the crystal structure  has relatively little resistance to the iron atoms slipping past one another, and so pure iron is quite ductile , or soft and easil y formed',
 "In pure iron, the crys

# Embedding/Vectorization

Vector embeddings are numeric representations of the meaning of textes. They allow the measurement of the similarity in meaning between two texts and, thus, semantic search. In RAG, the meaning of each text chunks is embedded into a vector (sentence embedding). For this, different text2vec models are available, such as `text-embedding-3-small`, `text-embedding-3-large` and `text-embedding-ada-002` provided by `OpenAI`, or `all-MiniLM-L12-v2` and many others provided by the `sentence-transformers` framework. In this work, we use the `all-MiniLM-L12-v2` model. The inference is performed using Hugging Face Hosted Inference API (free of charge but rate limited).

In [13]:
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

In [114]:
model_id = "sentence-transformers/all-MiniLM-L12-v2"
hf_token = "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"
embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=hf_token, model_name=model_id)

We can embed text chunks directly using the embeddings :

In [115]:
%%time
vectors = embeddings.embed_documents(splits[:2])

CPU times: total: 15.6 ms
Wall time: 600 ms


In [116]:
print("Vector size:", len(vectors[0]))

Vector size: 384


In [117]:
print(vectors[0])

[-0.07919583469629288, -0.0014829323627054691, -0.027909070253372192, 0.08472666889429092, 0.08796290308237076, 0.04938700050115585, -0.015877781435847282, 0.03698257729411125, -0.053582895547151566, -0.04986926168203354, -0.0657147541642189, -0.12852558493614197, 0.06819683313369751, -0.025681251659989357, -0.014892508275806904, 0.034175895154476166, 0.05304539576172829, -0.02160452865064144, 0.0222077164798975, -0.0599568635225296, -0.027940066531300545, -0.08698154240846634, -0.019495679065585136, 0.06383048743009567, 0.029610035941004753, 0.037859853357076645, -0.0321282260119915, 0.05028742924332619, 0.033865783363580704, 0.03830917552113533, 0.03487708792090416, 0.004239337984472513, 0.01860380731523037, 0.03008398413658142, -0.022422118112444878, -0.08121604472398758, 0.02826480008661747, -0.030367454513907433, -0.052014824002981186, 0.08601373434066772, 0.005202345084398985, 0.04636238515377045, 0.0904047042131424, 0.07268722355365753, 0.05832554027438164, 0.023639652878046036,

Or provide the embedding engine to a `vector store` which is a Python classe for the sorage of embedding texts and the semantic search. Many options are available: FAISS, Chroma, LanceDB, Weaviate... In this work, we use Chroma. Note that one must have the chromadb python package installed.

In [118]:
from langchain_community.vectorstores import Chroma

In [145]:
vdb = Chroma.from_texts(
    splits,
    embedding=embeddings,
    persist_directory="./chroma_db",
    # Distance metric, valid options are "l2", "ip, "or "cosine"
    collection_metadata={"hnsw:space": "l2"},
)

No embedding_function provided, using default embedding function: DefaultEmbeddingFunction https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2


Below are the first text chunk and the corresponding embedding vector. Note that the order of text chuns is not preserved.

In [146]:
id = 0
print("Index:\n", vdb.get()["ids"][id])
print("Text:\n", vdb.get()["documents"][id])
print("Embedding vector:\n", vdb.get(include=["embeddings"])["embeddings"][id])

Index:
 aab68a62-c73b-11ee-a82e-48a472072d20
Text:
 Steel Steel  is an alloy  of iron  and carbon  with improved strength  and fracture resistance  compared to other forms of iron. Many other elements may be present or added. Stainless steels , which are resistant to corrosion  and oxidation , typically  need an additional 11% chromium . Because  of its high tensile strength  and low cost, steel is used in buildings, infrastructure, tools, ships, trains, cars, bicycles, machines, electrical appliances, furniture, and weapons. Iron is the base metal of steel. Depending on the temperature, it can take two crystalline forms (allotropic forms):  body-centred cubi c and face-centred cubic . The interaction of the allotropes of iron  with the alloying elements, primari ly carbon, gives steel and cast iron  their range of unique properties. In pure iron, the crystal structure  has relatively little resistance to the iron atoms slipping past one another, and so pure iron is quite ductile , or 

# Retrieval

The idea of `retrieval` or semantic search is that the user gives a questionor query. The question is then embeded into a vector. Next, `k` text chunks with embedding vector closest to the query vector will be identified. Common distance metrics are `Squared Euclidean` (L2 norm), `Manhatten` (L1 norm), `Cosine` (measures the angle) and `Dot product`. In this work, the retrieval is performed by a vector store method. This distance metric is specified during the construction of the vector store.

In [175]:
retriever = vdb.as_retriever(search_type="similarity", search_kwargs={"k": 10})

In [179]:
query = "How steel is produced?"
contexts = retriever.invoke(query)
contexts

[Document(page_content='To become steel, it must be reprocessed to reduce the carbon to the correct amount, at which point other elements can be added. In the past, steel facilities would cast the raw steel prod uct into ingots  whic h would beHeat treatment ProductionIron ore pellets used in the production of steel Bloomery smelting during the Middle Ages in the 5th to 15th centuriesstored until use in further refinement processes that resulted in the finished product. In modern facilities, the initial product is close to the final composition and is continuously cast into long slabs, cut and shaped into bars and extru sions and heat treated to produce a final product. Today, approximately 96% of steel is continuously cast, while only 4% is produced as ingots.[16] The ingots are then heated in a soaking pit and hot rolled into slabs, billets , or blooms . Slab s are hot or cold rolled  into sheet metal  or plates . Billets are hot or cold rolled into bars, rods, and wire'),
 Document(

# Answer generation

In [203]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_community.chat_models import ChatOpenAI

First, we need to initiate a LLM-backed chat model. Here we use the 

In [204]:
os.environ["OPENAI_API_KEY"] = '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'
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

Next, we need to create the prompt/message which is gived to the chatbot. For this, we create a template with the `context` and `question` variables.

In [205]:
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use five sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""

rag_prompt = PromptTemplate.from_template(template)

In [206]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [207]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

Combine all into a chain.

In [208]:
query = "How steel is produced?"
answer = rag_chain.invoke(query)
print(answer)

Steel is produced through a process called steelmaking, which involves reducing the carbon content of iron and adding other elements. In modern facilities, the initial product is continuously cast into long slabs and then cut and shaped into various forms. The increase in steel's strength compared to iron is achieved by reducing iron's ductility. The production of steel began on a large scale in the 17th century with the introduction of more efficient methods like the blast furnace and the Bessemer process. Further refinements in the process, such as basic oxygen steelmaking, have lowered the cost and increased the quality of steel production. Thanks for asking!
