# Semantic retrieval with `LangChain`

Quote from the [website](https://github.com/langchain-ai/langchain): `LangChain` is a framework for building agents and LLM-powered applications. It helps you chain together interoperable components and third-party integrations to simplify AI application development â€“ all while future-proofing decisions as the underlying technology evolves.

`LangChain` works as a kind of facade - it does not offer any own functionality. However,
it nicely encapsulates functions and allows implementations which are independent of
the specific language model.

In this notebook, we try to use as many `LangChain` functions as possible and rebuild
our previous solution. You will see that we have to change some things and also get
different results.

## Load data

`LangChain` offers functions for loading and preprocessing data. If you read the code,
it is quite intuitive. However, finding the `DirectoryLoader` and knowing that it
needs a parameter `loader_cls` is not so easy as it is not standard Python.

Note: I had to change this code quite frequently as the API of `LangChain` changed.
The last change was needed because `document_loaders` migrated from `langchain`
to `langchain_community`. Please check your notebooks regularly!

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders.text import TextLoader
loader = DirectoryLoader("un/TXT/Session 78 - 2023/", glob="**/*.txt", loader_cls=TextLoader)
data = loader.load()

In [None]:
data[0:5]

Same here, moved from `langchain` to `langchain_text_splitters`.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
all_splits = text_splitter.split_documents(data)

In [None]:
all_splits[0:10]

Similar here, `vectorstore` move to `langchain_community`. Also the `HuggingFaceEmbeddings` have been in three different packages so far.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.sklearn import SKLearnVectorStore

retriever = SKLearnVectorStore.from_documents(all_splits, HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"))\
                              .as_retriever(search_kwargs={"k": 100})

## Retrieval

In [None]:
question = "Is the climate crisis worse for poorer countries?"
docs = retriever.invoke(question)
len(docs)

In [None]:
docs[0:10]

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)
df = pd.DataFrame([{"source": d.metadata["source"], "text": d.page_content} for d in docs])
df

## Integrate cross encoder

This **used to work**. However, if you take a look at the documentation of [CrossEncoderReranker](https://api.python.langchain.com/en/latest/langchain/retrievers/langchain.retrievers.document_compressors.cross_encoder_rerank.CrossEncoderReranker.html) and [CrossEncoderReranker](https://api.python.langchain.com/en/latest/langchain/retrievers/langchain.retrievers.contextual_compression.ContextualCompressionRetriever.html), it should still work. However, cloning the GitHub project today (2025-11-17), `retrievers` is only present in `langchain-classic`, not in the latest version. 

In [None]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever


cross_encoder = HuggingFaceCrossEncoder(model_name="mixedbread-ai/mxbai-rerank-large-v1")
compressor = CrossEncoderReranker(model=cross_encoder, top_n=20)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

Taking a look at the source and the tests, we find out that `retrievers` is really only present in `langchain-classic`.

In [None]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker
from langchain_classic.retrievers import ContextualCompressionRetriever


cross_encoder = HuggingFaceCrossEncoder(model_name="mixedbread-ai/mxbai-rerank-large-v1")
compressor = CrossEncoderReranker(model=cross_encoder, top_n=20)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [None]:
compressed_docs = compression_retriever.invoke(question)

In [None]:
pd.DataFrame([{"source": d.metadata["source"], "text": d.page_content} for d in compressed_docs])