In [9]:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
def extract_pdf_data(directory):
    loader = DirectoryLoader(
        path=directory,
        glob="*.pdf",
        loader_cls=PyPDFLoader
    )
    
    documents = loader.load()

    return documents

## Function: `extract_pdf_data`

This function extracts and loads PDF documents from a specified directory using the `langchain` library.

### Parameters:
- **`directory`** (str): The path to the directory containing PDF files.

### Returns:
- **`documents`** (list): A list of documents extracted from the PDFs.

### Example:
```python
extracted_data = extract_pdf_data("data/")


In [10]:
data = extract_pdf_data("data/")

In [11]:
data

[Document(metadata={'source': 'data/Medical_book_compressed-15-20.pdf', 'page': 0}, page_content='Abdominal aorta ultrasound seeAbdominal\nultrasound\nAbdominal aortic aneurysm seeAortic\naneurysm\nAbdominal hernia seeHernia\nAbdominal thrust seeHeimlich maneuver\nAbdominal ultrasound\nDefinition\nUltrasound technology allows doctors to “see”\ninside a patient without resorting to surgery. A transmit-\nter sends high frequency sound waves into the body,where they bounce off the different tissues and organs toproduce a distinctive pattern of echoes. A receiver“hears” the returning echo pattern and forwards it to acomputer, which translates the data into an image on atelevision screen. Because ultrasound can distinguishsubtle variations between soft, fluid-filled tissues, it isparticularly useful in providing diagnostic images of theabdomen. Ultrasound can also be used in treatment.\nPurpose\nThe potential medical applications of ultrasound\nwere first recognized in the 1940s as an outgr

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_text_into_chunks(documents):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
    text_chunks = text_splitter.split_documents(documents)

    return text_chunks


## Function: `split_text_into_chunks`

This function splits the text from documents into smaller, manageable chunks. It's useful for processing large texts in batches.

### Parameters:
- **`documents`** (list): A list of documents containing text data.

### Returns:
- **`text_chunks`** (list): A list of text chunks, each with a specified size and overlap.

### Example:
```python
text_chunks = split_text_into_chunks(extracted_data)


In [13]:
chunks = split_text_into_chunks(data)
print("length of my chunks is :", len(chunks))

length of my chunks is : 68
