# Headline

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus lobortis scelerisque auctor. Maecenas eu ullamcorper eros. Ut aliquet at quam nec fringilla. Nunc sem dui, rhoncus sed dignissim a, dapibus ut mi. Nulla nisi nunc, scelerisque faucibus libero id, scelerisque rutrum nisl. Sed non aliquet lorem. Interdum et malesuada fames ac ante ipsum primis in faucibus. Maecenas magna nisl, pulvinar in nisi in, accumsan ornare lectus. Etiam semper mi tortor, sed consectetur sem iaculis a. Aenean laoreet eros at pulvinar fermentum.

In [1]:
from langchain import hub
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams


## Load documents

In [2]:
loader = PyPDFDirectoryLoader("PDFs/")
documents = loader.load()
print(len(documents))

107


Each document corresponds to one page in a PDF file. Let us explore the content of the first document

In [3]:
print(f"{documents[0].page_content[:500]}\n")


Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023
OR
☐  TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE TRANSITION PERIOD FROM                         TO                         .
Commission File No. 1-10635
NIKE, Inc.
(Exact name of Registrant as specified in it



and the corresponding metadata

In [4]:
print(documents[0].metadata)


{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'PDFs/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


More advanced method of extracting text and multimodal

## Text splitter

Split each page into smaller chunks. 

add_start_index=True ensures meta data is preserved.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
# https://python.langchain.com/docs/how_to/split_by_token/

# from langchain_text_splitters import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
#     model_name="gpt-4",
#     chunk_size=1000,
#     chunk_overlap=200,
#     add_start_index=True
# )

all_splits = text_splitter.split_documents(documents)

len(all_splits)

516

In [8]:
all_splits[12]

Document(metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'PDFs/nke-10k-2023.pdf', 'total_pages': 107, 'page': 3, 'page_label': '4', 'start_index': 2264}, page_content='We also sell sports apparel, which features the same trademarks and are sold predominantly through the same marketing and distribution channels as athletic footwear.\nOur sports apparel, similar to our athletic footwear products, is designed primarily for athletic use, although many of the products are worn for casual or leisure purposes,\nand demonstrates our commitment to innovation and high-quality construction. Our Men\'s and Women\'s apparel produc

## Embeddings

In [16]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [9]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[0.04747234284877777, 0.021675756201148033, -0.00901807937771082, 0.005356709938496351, 0.025557713583111763, -0.0102302897721529, -0.008413960225880146, 0.03930390253663063, 0.021570511162281036, -0.02409539930522442]


## How does embed_query work - dive deep

In [10]:
client = QdrantClient("http://localhost:6333")

collection_exists = client.collection_exists(collection_name="test2")

if not collection_exists:
    client.create_collection(
        collection_name="test2",
        vectors_config=VectorParams(size=768, distance=Distance.COSINE),
    )

vector_store = QdrantVectorStore(
    client=client,
    collection_name="test2",
    embedding=embeddings,
)


In [11]:
ids = vector_store.add_documents(documents=all_splits)

In [12]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 

In [13]:
llm = ChatOllama(
    model="deepseek-r1:1.5b", temperature=0, base_url="http://localhost:11434"
)

In [15]:
# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vector_store.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

qa_chain.invoke("Does Nike have an office in China?")



"<think>\nOkay, so I need to figure out if Nike has an office in China. Let me start by looking at the context provided. The context is divided into three main sections: PROPERTIES, LEGAL PROCEEDINGS, and a table of contents.\n\nIn the PROPERTIES section, there's a bullet point that mentions NIKE has an office complex in Shanghai, China, which serves as their Greater China geography headquarters. That sounds like they have a physical presence there. \n\nI don't see any information about offices or locations outside of China mentioned here. The other sections talk about international markets and distribution centers, but nothing about offices specifically in China.\n\nSo, based on the context given, it seems that Nike does have an office in China, specifically in Shanghai. I should make sure to mention that without any additional details.\n</think>\n\nYes, NIKE has an office in China, specifically located in Shanghai."