In this example, we will download the LangChain docs from [langchain.readthedocs.io/](https://langchain.readthedocs.io/latest/en/).

This downloads all HTML into the `rtdocs` directory. Now we can use LangChain itself to process these docs. We do this using the `ReadTheDocsLoader` like so:

In [3]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader('LangChaindocs')
docs = loader.load()



  _ = BeautifulSoup(


  soup = BeautifulSoup(data, **self.bs_kwargs)
  soup = BeautifulSoup(data, **self.bs_kwargs)


In [4]:
docs

[Document(page_content='', metadata={'source': 'rtdocs\\python.langchain.com\\robots.txt.tmp'}),
 Document(page_content='.md\n.pdf\nDeployments\n Contents \nStreamlit\nGradio (on Hugging Face)\nBeam\nVercel\nDigitalocean App Platform\nSteamShip\nLangchain-serve\nBentoML\nDeployments#\nSo you’ve made a really cool chain - now what? How do you deploy it and make it easily sharable with the world?\nThis section covers several options for that.\nNote that these are meant as quick deployment options for prototypes and demos, and not for production systems.\nIf you are looking for help with deployment of a production system, please contact us directly.\nWhat follows is a list of template GitHub repositories aimed that are intended to be\nvery easy to fork and modify to use your chain.\nThis is far from an exhaustive list of options, and we are EXTREMELY open to contributions here.\nStreamlit#\nThis repo serves as a template for how to deploy a LangChain with Streamlit.\nIt implements a chatb

In [3]:
print(docs[0].page_content)




In [4]:
docs[5].metadata['source'].replace('LangChaindocs\\', 'https://')

'rtdocs\\python.langchain.com\\en\\latest\\glossary.html'

In [2]:
import tiktoken

tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [7]:
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, record in enumerate(tqdm(docs)):
    texts = text_splitter.split_text(record.page_content)
    chunks.extend([{
        'id': str(uuid4()),
        'text': texts[i],
        'chunk': i,
        'url': record.metadata['source'].replace('rtdocs/', 'https://')
    } for i in range(len(texts))])

100%|██████████| 326/326 [00:02<00:00, 158.44it/s]


In [7]:
import openai
import os

# initialize openai API key
openai.api_key = os.getenv("OPENAI_API_KEY_FREE")  #platform.openai.com
embed_model = "text-embedding-ada-002"



In [None]:


res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In [9]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [10]:
len(res['data'])

2

In [11]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

In [8]:
import pinecone

index_name = 'langchain-docs'

# initialize connection to pinecone
pinecone.init(
    api_key=os.getenv("pinecone_api_key"),  # app.pinecone.io (console)
    environment="asia-northeast1-gcp"  # next to API key in console
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='dotproduct'
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 5040}},
 'total_vector_count': 5040}

In [14]:
from tqdm.auto import tqdm
import datetime
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

100%|██████████| 18/18 [01:02<00:00,  3.45s/it]


In [8]:
query = "how do I use the tiktoken"

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=16, include_metadata=True)

In [16]:
res

{'matches': [{'id': 'ac55e4b2-b80e-48bb-a07a-dd4648dc18b0',
              'metadata': {'chunk': 3.0,
                           'text': 'for all LLMs, and common utilities for '
                                   'working with LLMs.\\n\\nð\\x9f”\\x97 '
                                   'Chains:\\n\\nChains go beyond just a '
                                   'single LLM call, and are sequences of '
                                   'calls (whether to an LLM or a different '
                                   'utility). LangChain provides a standard '
                                   'interface for chains, lots of integrations '
                                   'with other tools, and end-to-end chains '
                                   'for common applications.\\n\\nð\\x9f“\\x9a '
                                   'Data Augmented Generation:\\n\\nData '
                                   'Augmented Generation involves specific '
                                   'types of cha

In [9]:
# get list of retrieved text
contexts = [item['metadata']['text'] for item in res['matches']]

augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

In [18]:
print(augmented_query)

for all LLMs, and common utilities for working with LLMs.\n\nð\x9f”\x97 Chains:\n\nChains go beyond just a single LLM call, and are sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.\n\nð\x9f“\x9a Data Augmented Generation:\n\nData Augmented Generation involves specific types of chains that first interact with an external datasource to fetch data to use in the generation step. Examples of this include summarization of long pieces of text and question/answering over specific data sources.\n\nð\x9f¤\x96 Agents:\n\nAgents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end to end

---

Chains: Chains go beyond just a single LLM call, and are sequences 

In [10]:
# system message to 'prime' the model
primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""

res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

In [11]:
from IPython.display import Markdown

display(Markdown(res['choices'][0]['message']['content']))

You can use the tiktoken tokenizer package for Python to estimate the number of tokens in a piece of text. To use it, you can install it with pip install tiktoken, and import it with import tiktoken. Once imported, you can use it to count tokens in text by calling tiktoken.encoding_for_model("model_name").encode(text), where "model_name" is the name of the tokenizer model you want to use and "text" is the text you want to count the tokens for.

In [22]:
res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

I'm sorry, but I don't have enough information to answer your question. Could you provide me more detail about LangChain and LLMChain?

In [23]:
res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

LLMChain is a blockchain-based platform developed by LangChain that is designed to help language learners improve their language proficiency skills. Here are the basic steps to use the platform:

1. Navigate to the LLMChain page on LangChain's website and create an account.
2. Choose the language you want to learn and find the right course for your level.
3. Start learning by watching video lessons, reading texts, and practicing exercises.
4. Earn LLMChain tokens (LLM) by completing courses and participating in various activities.
5. Use your LLM tokens to buy additional courses or exchange them for other cryptocurrencies on supported exchanges.

Additionally, LLMChain is also designed to provide a social learning experience. You can interact with other learners and language experts through the platform's messaging system, participate in language challenges, and receive feedback and tips from native speakers.

In [9]:
#pinecone.delete_index(index_name)