# Get started
This is the Get Started section of
https://python.langchain.com/docs/tutorials/



## Chat models and prompts
This contains the Chat models and prompts bullet point of Get started

More about how to interact with models and which are supported can be found in this concept guide https://python.langchain.com/docs/concepts/chat_models/

(The `Runnable` interface is implemented by a lot of classes, including the different prompt classes. Here is its [concept guide](https://python.langchain.com/docs/concepts/runnables/))

In [None]:
!pip install langchain



In [None]:
!pip install -qU langchain-openai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.3/454.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Enter API key for OpenAI: ··········


Invoke our model with two messages. This is the basic way to use it.

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage("Identify the language or dialect of the userinput"),
    HumanMessage("Griaß di")
]

model.invoke(messages)

AIMessage(content='The language of the user input is German, specifically a dialect from Austria or Bavaria, as "Griaß di" is a colloquial greeting meaning "Hello" or "Greetings to you."', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 39, 'prompt_tokens': 24, 'total_tokens': 63, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_6fc10e10eb', 'finish_reason': 'stop', 'logprobs': None}, id='run-c4a3986c-a224-4bb8-8781-fc15752c6574-0', usage_metadata={'input_tokens': 24, 'output_tokens': 39, 'total_tokens': 63, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

Message objects like the ones listed above convey conversational [roles](https://python.langchain.com/docs/concepts/messages/#role) and hold important data, such as [tool](https://python.langchain.com/docs/concepts/tool_calling/) calls and token usage counts.

Instead of receiving the whole output at once, we can stream it as it is generated.


In [None]:
# Streaming
for token in model.stream(messages):
    print(token.content, end="|")

|The| language| is| Austrian| German|,| specifically| a| dialect| from| the| region| of| Austria|.| "|Gr|ia|ß| di|"| is| a| common| informal| greeting| in| this| dialect|,| equivalent| to| "|Hello|"| in| standard| German|.||

### Prompt templates
Prompt templates can be used to format messages, by using variables in the text or slotting in a variable list of messages in a certain spot. See https://python.langchain.com/docs/concepts/prompt_templates/

#### ChatPromptTemplate
ChatPromptTemplates can contain messages of different message roles. We use this to create a template with variables, which can later be invoked with different values for those variables by passing a dictionary which assigns values to those variables(/keys) creating a prompt. This prompt can then be used to invoke the model.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

system_template = "Name {number} typical dishes of the country"


# These result in the same ChatPrompTemplate after using .invoke
prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", "{text}")]
)

prompt_template_uses_classes = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template("{text}")
    ]
)


#### ChatPromptValue

In [None]:
prompt = prompt_template.invoke({"number": "two", "text": "India!"})

prompt

ChatPromptValue(messages=[SystemMessage(content='Name two typical dishes of the country', additional_kwargs={}, response_metadata={}), HumanMessage(content='India!', additional_kwargs={}, response_metadata={})])

In [None]:
prompt = prompt_template_uses_classes.invoke({"number": "two", "text": "India!"})
# prompt = prompt_template_uses_classes.invoke({"number": "two", "text": "India!"})

prompt

ChatPromptValue(messages=[SystemMessage(content='Name two typical dishes of the country', additional_kwargs={}, response_metadata={}), HumanMessage(content='India!', additional_kwargs={}, response_metadata={})])

 This ChatPromptValue contains two messages, the same object type that we worked with before


In [None]:
prompt.to_messages()

[SystemMessage(content='Name two typical dishes of the country', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='India!', additional_kwargs={}, response_metadata={})]

In [None]:
print(prompt)
response = model.invoke(prompt)
print(response.content)
print(response)

messages=[SystemMessage(content='Name two typical dishes of the country', additional_kwargs={}, response_metadata={}), HumanMessage(content='India!', additional_kwargs={}, response_metadata={})]
Two typical dishes from India are:

1. **Biryani** - A flavorful and aromatic rice dish made with basmati rice, meat (such as chicken, mutton, or beef), and a blend of spices. There are various regional variations of biryani, such as Hyderabadi and Lucknowi (Awadhi) biryani.

2. **Paneer Tikka** - A popular vegetarian dish made from marinated cubes of paneer (Indian cottage cheese) that are skewered and grilled or roasted. It's often served with mint chutney and is a popular appetizer or snack.

These dishes showcase the rich and diverse culinary heritage of India!
content="Two typical dishes from India are:\n\n1. **Biryani** - A flavorful and aromatic rice dish made with basmati rice, meat (such as chicken, mutton, or beef), and a blend of spices. There are various regional variations of birya

## [Build a semantic search engine](https://python.langchain.com/docs/tutorials/retrievers/)
Search engine over PDF document -retreive passages similar to input query.

In [None]:
!pip install langchain-community pypdf

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain-community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_i

#### Document class
Document class/abstraction: represents unit of text + its metadata

In [None]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content = "Chana Masala is a classic indian dish",
        metadata= {"source": "alberts-opinion"},
    ),
    Document(
        page_content = "Samosa Chat is a great indian street food",
        metadata= {"source" : "alberts-opinion"}
    )
]

#### Loading documents

In [None]:
# Lets load into colab a sample pdf from langchain:
!wget -O document.pdf https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/example_data/nke-10k-2023.pdf

--2024-12-20 15:02:08--  https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/example_data/nke-10k-2023.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2397936 (2.3M) [application/octet-stream]
Saving to: ‘document.pdf’


2024-12-20 15:02:08 (27.7 MB/s) - ‘document.pdf’ saved [2397936/2397936]




[PyPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/)




In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("document.pdf")
docs = loader.load()
print(f"PyPDFLoader loaded {len(docs)} documents, defaulting to one document per page")
print(f"The first 50 chars of documents 10 content are:\n{docs[10].page_content[:50]}")
print(f"Document 10 has the following metadata:\n{docs[10].metadata}")

PyPDFLoader loaded 107 documents, defaulting to one document per page
The first 50 chars of documents 10 content are:
Table of Contents
INFORMATION ABOUT OUR EXECUTIVE 
Document 10 has the following metadata:
{'source': 'document.pdf', 'page': 10}


#### Splitting
We don't want PyPDFLoaders default splitting of one document per page with no overlap between documents as it is to coarse. We use a **text splitter** to split each document into chunks of around 1000 chars with around 200 chars overlap. (However, this still means that there can not be any overlap between documents/pages. So a sentence starting one one page and ending on another will never be in the same chunk.)

[RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/) takes seperators like commas and newlines into account and recursively splits the document until all chunks are around the specified size.



In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    add_start_index = True # saves doc character index of first char in chunk
)

first_document_split = text_splitter.split_documents([docs[0]])
fiftieth_document_split = text_splitter.split_documents([docs[50]])
all_documents_splits = text_splitter.split_documents(docs)

print(f"Documents 1 and 50 have {len(first_document_split)} and {len(fiftieth_document_split)} chunks. All {len(docs)} docs together have {len(all_documents_splits)} chunks")

Documents 1 and 50 have 5 and 4 chunks. All 107 docs together have 516 chunks


#### Embeddings
Embed the document chunks content by using an embedding model, for example by openai

See conceptual guide [Embedding models](https://python.langchain.com/docs/concepts/embedding_models/)

In [None]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

embedding_of_first_chunk = embeddings.embed_query(all_documents_splits[0].page_content)
embedding_of_second_chunk = embeddings.embed_query(all_documents_splits[1].page_content)

print(f"Embedding size is {len(embedding_of_first_chunk)}")

Embedding size is 3072


#### Vector Stores
We want to store the computed embeddings for all chunks. We can utilize different vector databases like the PostgreSQL extension PGVector or for testing in-memory. See [here](https://python.langchain.com/docs/tutorials/retrievers/#vector-stores)

See Conceptual guide [Vector stores](https://python.langchain.com/docs/concepts/vectorstores/) for more in depth knowledge e.g. on advanced search&retrival techniques.

adding documents to the vectore store


In [None]:
!pip install -qU langchain-core

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)

first_fifty_chunks = all_documents_splits[:50]
ids = vector_store.add_documents(documents= first_fifty_chunks)

In [None]:
ids[:5]

['d9131143-0fc9-43b6-93f5-38dad2714d81',
 '9da7a518-2eff-4eb6-8c9d-5465a1bf3488',
 'd328af62-1320-49b9-b676-eb697a906db2',
 'a2870561-ed93-4bcb-b559-865d8fa86312',
 '26b2ee64-b776-4f5e-bec1-3c9118ea9d1e']

Querying the vector store


In [None]:
first_fifty_chunks[30].page_content

'In recent years, uncertain global and regional economic and political conditions have affected international trade and increased protectionist actions around the\nworld. These trends are affecting many global manufacturing and service sectors, and the footwear and apparel industries, as a whole, are not immune. Companies in our\nindustry are facing trade protectionism in many different regions, and, in nearly all cases, we are working together with industry groups to address trade issues and reduce\nthe impact to the industry, while observing applicable competition laws. Notwithstanding our efforts, protectionist measures have resulted in increases in the cost of our\nproducts, and additional measures, if implemented, could adversely affect sales and/or profitability for NIKE, as well as the imported footwear and apparel industry as a\nwhole.'

Performing a similarity search on all chunks in our vector store

In [None]:
results = vector_store.similarity_search("What problems necessitated working together?", k=2)

In [None]:
results[0]

Document(id='fa8bf488-14ef-4029-a82b-6072a280d703', metadata={'source': 'document.pdf', 'page': 6, 'start_index': 4929}, page_content='restrictions, we work with a broad coalition of global businesses and trade associations representing a wide variety of sectors to help ensure that any legislation enacted\nand implemented (i) addresses legitimate and core concerns, (ii) is consistent with international trade rules and (iii) reflects and considers domestic economies and the\nimportant role they may play in the global economic community.\nWhere trade protection measures are implemented, we believe we have the ability to develop, over a period of time, adequate alternative sources of supply for the\nproducts obtained from our present suppliers. If events prevented us from acquiring products from our suppliers in a particular country, our operations could be temporarily\ndisrupted and we could experience an adverse financial impact. However, we believe we could abate any such disruption, a

async version

In [None]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'source': 'document.pdf', 'page': 3, 'start_index': 0}


Return scores:

In [None]:
results_scored = vector_store.similarity_search_with_score("What problems necessitated working together?", k=2)

In [None]:
for result in results_scored:
  print(f"The score for document {result[0].id} is {result[1]:.4f}")

The score for document fa8bf488-14ef-4029-a82b-6072a280d703 is 0.2087
The score for document 28200f85-05c3-4d73-9ad8-a5e1ca5fd9f3 is 0.2039


Search embeddings directly

In [None]:
sample_embedding = embeddings.embed_query("Unrelated text to see the score")

results_sample_embedding = vector_store.similarity_search_with_score_by_vector(sample_embedding)
for result in results_sample_embedding:
  print(f"The score for document {result[0].id} is {result[1]:.4f}")

The score for document 7fb145a7-a722-4d11-82f5-e26248ce6ead is 0.1876
The score for document d328af62-1320-49b9-b676-eb697a906db2 is 0.1760
The score for document a2870561-ed93-4bcb-b559-865d8fa86312 is 0.1707
The score for document 4ff0ea3b-5427-4c94-9de6-bd2f55a2179b is 0.1676


In [None]:
results[0][0].page_content

"Table of Contents\nNIKE, INC.ANNUAL REPORT ON FORM 10-KTABLE OF CONTENTS\nPAGE\nPART I 1\nITEM 1. Business 1\nGeneral 1\nProducts 1\nSales and Marketing 2\nOur Markets 2\nSignificant Customer 3\nProduct Research, Design and Development 3\nManufacturing 3\nInternational Operations and Trade 4\nCompetition 5\nTrademarks and Patents 5\nHuman Capital Resources 6\nAvailable Information and Websites 7\nInformation about our Executive Officers 8\nITEM 1A.Risk Factors 9\nITEM 1B.Unresolved Staff Comments 24\nITEM 2. Properties 24\nITEM 3. Legal Proceedings 24\nITEM 4. Mine Safety Disclosures 24\nPART II 25\nITEM 5. Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities 25\nITEM 6. Reserved 27\nITEM 7. Management's Discussion and Analysis of Financial Condition and Results of Operations 28\nITEM 7A.Quantitative and Qualitative Disclosures about Market Risk 49\nITEM 8. Financial Statements and Supplementary Data 51"

#### Retreivers
`VectorStore` objects are not `Runnables`. However we can construct `Retreivers` from `VectorStore`s which subclass `Runnables`.

This functionality can be achived without subclassing as well like this

In [None]:
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain

# \@chain  is a decorator/decorated function -> retreiver is an object, not a function
# an RunnableLambda(retreiver) object implementing the Runnable interface
@chain
def retreiver(query:str)-> List[Document]:
  return vector_store.similarity_search(query, k=1)

retreiver.batch(
    [
        "What year was this document created in",
        "Which shoe brand is relevant here"
    ]
)


[[Document(id='26646b59-0c65-4aa1-abcf-cce59c6658a4', metadata={'source': 'document.pdf', 'page': 1, 'start_index': 0}, page_content="Table of Contents\nAs of July 12, 2023, the number of shares of the Registrant's Common Stock outstanding were:\nClass A 304,897,252 \nClass B 1,225,074,356 \n1,529,971,608 \nDOCUMENTS INCORPORATED BY REFERENCE:\nParts of Registrant's Proxy Statement for the Annual Meeting of Shareholders to be held on September 12, 2023, are incorporated by reference into Part III of this report.")],
 [Document(id='a8a25008-8a41-467c-bcd4-ef7cf34137fc', metadata={'source': 'document.pdf', 'page': 7, 'start_index': 757}, page_content='leisure footwear companies, athletic and leisure apparel companies, sports equipment companies and large companies having diversified lines of athletic and leisure\nfootwear, apparel and equipment, including adidas, Anta, ASICS, Li Ning, lululemon athletica, New Balance, Puma, Under Armour and V.F. Corporation, among others.\nThe intense co

In [None]:
retreiver

RunnableLambda(retreiver)

Using Langchains `as_retreiver` method of `vectorStore`



In [None]:
retreiver_from_vectore_store_method = vector_store.as_retreiver(
  search_type="similarity",
  search_kwargs={"k":1},
)

retreiver_from_vectore_store_method.batch(
    [
        "What year was this document created in",
        "Which shoe brand is relevant here"
    ]
)

AttributeError: 'InMemoryVectorStore' object has no attribute 'as_retreiver'

Apparently not possible with `InMemoryVectorStore`

**Next steps and learn more about** [Building a semantic search engine](https://python.langchain.com/docs/tutorials/retrievers/#learn-more)


