# Working with Langchain

### Setup

In [1]:
import os
import ast
import json
import requests
import weaviate

import pandas as pd

from dotenv import load_dotenv, find_dotenv
from langchain_weaviate.vectorstores import WeaviateVectorStore

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file

weaviate_url = os.getenv("WEAVIATE_URL") 
weaviate_key = os.getenv("WEAVIATE_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Connect to local Weaviate instance running in docker
weaviate_client = weaviate.Client(
    url=weaviate_url,  
    auth_client_secret=weaviate.auth.AuthApiKey(api_key=weaviate_key),  
    additional_headers={
        "X-OpenAI-Api-Key": openai_key
    }
)
weaviate_client.is_ready()

            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            


True

### Document loading and splitting

Try text splitting

In [4]:
from langchain.text_splitter import TokenTextSplitter, RecursiveCharacterTextSplitter, CharacterTextSplitter, MarkdownHeaderTextSplitter

In [78]:
# text - split on characters
with open("../data/paul_graham_essay.txt", "r") as f:
    essay = f.read()

text_splitter = CharacterTextSplitter(separator=".\n\n", chunk_size=4000, chunk_overlap=0)
chunks = text_splitter.split_text(essay)

print(f"Number of chunks: {len(chunks)}")
print(f"Typical length of a chunk: {len(chunks[0])} characters")

Number of chunks: 21
Typical length of a chunk: 3740 characters


Try splitting by token

In [83]:
token_text_splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=0)
token_chunks = token_text_splitter.split_text(essay)

print(f"Number of chunks: {len(token_chunks)}")
print(f"Typical length of a chunk: {len(token_chunks[0])} characters")

Number of chunks: 17
Typical length of a chunk: 4394 characters


In [84]:
print(token_chunks[0])



What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in t

Try with markdown splitter - returns Document objects

In [87]:
with open("../data/weaviate_readme.md", "r") as f:
    readme = f.read()

In [89]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
)
markdown_chunks = markdown_splitter.split_text(readme)
# print(f"Number of chunks: {len(markdown_chunks)}")
# print(f"Typical length of a chunk: {len(markdown_chunks[0])} characters")

In [93]:
markdown_chunks[0].dict()

{'page_content': "<h1>Weaviate <img alt='Weaviate logo' src='https://weaviate.io/img/site/weaviate-logo-light.png' width='148' align='right' /></h1>  \n[![Go Reference](https://pkg.go.dev/badge/github.com/weaviate/weaviate.svg)](https://pkg.go.dev/github.com/weaviate/weaviate)\n[![Build Status](https://github.com/weaviate/weaviate/actions/workflows/.github/workflows/pull_requests.yaml/badge.svg?branch=main)](https://github.com/weaviate/weaviate/actions/workflows/.github/workflows/pull_requests.yaml)\n[![Go Report Card](https://goreportcard.com/badge/github.com/weaviate/weaviate)](https://goreportcard.com/report/github.com/weaviate/weaviate)\n[![Coverage Status](https://codecov.io/gh/weaviate/weaviate/branch/main/graph/badge.svg)](https://codecov.io/gh/weaviate/weaviate)\n[![Slack](https://img.shields.io/badge/slack--channel-blue?logo=slack)](https://weaviate.io/slack)\n[![GitHub Tutorials](https://img.shields.io/badge/Weaviate_Tutorials-green)](https://github.com/weaviate-tutorials/)",

In [94]:
markdown_chunks[1].dict()

{'page_content': 'Weaviate is a cloud-native, **open source vector database** that is robust, fast, and scalable.  \nTo get started quickly, have a look at one of these pages:  \n- [Quickstart tutorial](https://weaviate.io/developers/weaviate/quickstart) To see Weaviate in action\n- [Contributor guide](https://weaviate.io/developers/contributor-guide) To contribute to this project  \nFor more details, read through the summary on this page or see the system [documentation](https://weaviate.io/developers/weaviate/).  \n> [!NOTE]\n> **Help us improve your experience** by sharing your feedback, ideas and thoughts: Fill out our [Community Experience Survey](https://forms.gle/hrFGMqtVkdSG6ne48), preferably by June 14th, 2024.  \n---',
 'metadata': {'Header 2': 'Overview'},
 'type': 'Document'}

In [95]:
markdown_chunks[2].dict()

{'page_content': 'Weaviate uses state-of-the-art machine learning (ML) models to turn your data - text, images, and more - into a searchable vector database.  \nHere are some highlights.',
 'metadata': {'Header 2': 'Why Weaviate?'},
 'type': 'Document'}

Try with a PDF

In [5]:
from langchain.document_loaders import PyPDFLoader

In [6]:
pdf_loader = PyPDFLoader(file_path="../data/CS229_Lecture_Notes.pdf")
pdf_chunks = pdf_loader.load()

Ignoring wrong pointing object 42 0 (offset 0)
Ignoring wrong pointing object 79 0 (offset 0)
Ignoring wrong pointing object 81 0 (offset 0)
Ignoring wrong pointing object 83 0 (offset 0)
Ignoring wrong pointing object 109 0 (offset 0)
Ignoring wrong pointing object 127 0 (offset 0)
Ignoring wrong pointing object 133 0 (offset 0)
Ignoring wrong pointing object 156 0 (offset 0)
Ignoring wrong pointing object 181 0 (offset 0)
Ignoring wrong pointing object 244 0 (offset 0)
Ignoring wrong pointing object 259 0 (offset 0)
Ignoring wrong pointing object 362 0 (offset 0)
Ignoring wrong pointing object 364 0 (offset 0)
Ignoring wrong pointing object 401 0 (offset 0)
Ignoring wrong pointing object 403 0 (offset 0)
Ignoring wrong pointing object 406 0 (offset 0)
Ignoring wrong pointing object 416 0 (offset 0)
Ignoring wrong pointing object 419 0 (offset 0)
Ignoring wrong pointing object 424 0 (offset 0)
Ignoring wrong pointing object 433 0 (offset 0)
Ignoring wrong pointing object 473 0 (offset

In [7]:
pdf_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
pdf_splits = pdf_text_splitter.split_documents(pdf_chunks)

In [8]:
len(pdf_splits)

231

In [9]:
pdf_splits[0].dict()

{'page_content': 'CS229 Lecture notes\nAndrew Ng\nSupervised learning\nLet’s start by talking about a few examples of supervised learning pr oblems.\nSuppose we have a dataset giving the living areas and prices of 47 hou ses\nfrom Portland, Oregon:\nLiving area (feet2)\nPrice (1000$s)\n2104\n 400\n1600\n 330\n2400\n 369\n1416\n 232\n3000\n 540\n...\n...\nWe can plot this data:\n500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing prices\nsquare feetprice (in $1000)Given data like this, how can we learn to predict the prices of other ho uses\nin Portland, as a function of the size of their living areas?\n1',
 'metadata': {'source': '../data/CS229_Lecture_Notes.pdf', 'page': 0},
 'type': 'Document'}

### Embeddings and database loading

First, create the embeddings

In [10]:
from langchain.vectorstores import Weaviate
from langchain_openai import OpenAIEmbeddings

In [13]:
embeddings = OpenAIEmbeddings(model = "text-embedding-ada-002", api_key=openai_key)

Then load these into the vector store

In [110]:
weaviate_instance = Weaviate(client=weaviate_client, index_name="test", text_key="test")

In [16]:
weaviate_db = Weaviate.from_documents(pdf_splits, embeddings, client=weaviate_client)

### Vector search