## Requirements

In [None]:
!pip install langchain
!pip install pypdf
!pip install langchain_community

Collecting pypdf
  Downloading pypdf-5.5.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.5.0-py3-none-any.whl (303 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.5.0
Collecting langchain_community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 

## Document Loading

Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/paul_graham_essay.pdf")
pages = loader.load()

In [None]:
len(pages)

20

In [None]:
pages[0]

Document(metadata={'producer': 'LibreOffice 7.3', 'creator': 'Writer', 'creationdate': '2024-06-14T16:51:51+00:00', 'source': '/paul_graham_essay.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='What I Worked On\nFebruary 2021\nBefore college the two main things I worked on, outside of school, were writing \nand programming. I didn\'t write essays. I wrote what beginning writers were \nsupposed to write then, and probably still are: short stories. My stories were \nawful. They had hardly any plot, just characters with strong feelings, which I \nimagined made them deep.\nThe first programs I tried writing were on the IBM 1401 that our school district\nused for what was then called "data processing." This was in 9th grade, so I was\n13 or 14. The school district\'s 1401 happened to be in the basement of our \njunior high school, and my friend Rich Draves and I got permission to use it. It\nwas like a mini Bond villain\'s lair down there, with all these alien-looking 

In [None]:
pages[1]

Document(metadata={'producer': 'LibreOffice 7.3', 'creator': 'Writer', 'creationdate': '2024-06-14T16:51:51+00:00', 'source': '/paul_graham_essay.pdf', 'total_pages': 20, 'page': 1, 'page_label': '2'}, page_content='that I kept taking philosophy courses and they kept being boring. So I decided \nto switch to AI.\nAI was in the air in the mid 1980s, but there were two things especially that \nmade me want to work on it: a novel by Heinlein called The Moon is a Harsh \nMistress, which featured an intelligent computer called Mike, and a PBS \ndocumentary that showed Terry Winograd using SHRDLU. I haven\'t tried rereading \nThe Moon is a Harsh Mistress, so I don\'t know how well it has aged, but when I \nread it I was drawn entirely into its world. It seemed only a matter of time \nbefore we\'d have Mike, and when I saw Winograd using SHRDLU, it seemed like that\ntime would be a few years at most. All you had to do was teach SHRDLU more \nwords.\nThere weren\'t any classes in AI at Cornell

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
pages[5]

Document(page_content='Combinator, is that the low end eats the high end: that it\'s good to be the \n"entry level" option, even though that will be less prestigious, because if \nyou\'re not, someone else will be, and will squash you against the ceiling. Which\nin turn means that prestige is a danger sign.\nWhen I left to go back to RISD the next fall, I arranged to do freelance work \nfor the group that did projects for customers, and this was how I survived for \nthe next several years. When I came back to visit for a project later on, \nsomeone told me about a new thing called HTML, which was, as he described it, a \nderivative of SGML. Markup language enthusiasts were an occupational hazard at \nInterleaf and I ignored him, but this HTML thing later became a big part of my \nlife.\nIn the fall of 1992 I moved back to Providence to continue at RISD. The \nfoundation had merely been intro stuff, and the Accademia had been a (very \ncivilized) joke. Now I was going to see what real a

An advantage of this approach is that documents can be retrieved with page numbers.

In [None]:
pages[0].metadata["source"]

'/paul_graham_essay.pdf'

In [None]:
pages[0].page_content

'What I Worked On\nFebruary 2021\nBefore college the two main things I worked on, outside of school, were writing \nand programming. I didn\'t write essays. I wrote what beginning writers were \nsupposed to write then, and probably still are: short stories. My stories were \nawful. They had hardly any plot, just characters with strong feelings, which I \nimagined made them deep.\nThe first programs I tried writing were on the IBM 1401 that our school district\nused for what was then called "data processing." This was in 9th grade, so I was\n13 or 14. The school district\'s 1401 happened to be in the basement of our \njunior high school, and my friend Rich Draves and I got permission to use it. It\nwas like a mini Bond villain\'s lair down there, with all these alien-looking \nmachines — CPU, disk drives, printer, card reader — sitting up on a raised floor\nunder bright fluorescent lights.\nThe language we used was an early version of Fortran. You had to type programs \non punch cards, 

## Text Splitters

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [None]:
["\n\n", "\n", " ", "",]

## Split the text into chunks

In [None]:
chunks = []
for page in pages:
    page_text = page.page_content
    page_chunks = text_splitter.split_text(page_text)
    chunks.extend(page_chunks)

## Print Top 5 chunks

In [None]:
for i, chunk in enumerate(chunks):
   if i<10:
      print(f"Chunk {i + 1}:")
      print(chunk)
      print("\n")
      i=i+1

Chunk 1:
What I Worked On
February 2021


Chunk 2:
February 2021
Before college the two main things I worked on, outside of school, were writing


Chunk 3:
and programming. I didn't write essays. I wrote what beginning writers were


Chunk 4:
supposed to write then, and probably still are: short stories. My stories were


Chunk 5:
awful. They had hardly any plot, just characters with strong feelings, which I


Chunk 6:
imagined made them deep.


Chunk 7:
The first programs I tried writing were on the IBM 1401 that our school district


Chunk 8:
used for what was then called "data processing." This was in 9th grade, so I was


Chunk 9:
13 or 14. The school district's 1401 happened to be in the basement of our


Chunk 10:
junior high school, and my friend Rich Draves and I got permission to use it. It




In [None]:
internal Architecture

In [None]:
chunk in english ---> vector
What I Worked On February 2021 ---> [0.1,0.4,0.6,0.8,....]

In [None]:
store in vector db -> chunk, embedding --> What I Worked On February 2021 ---> [0.1,0.4,0.6,0.8,....]

## Congratulations you completed Module-1 !!