Skip to content

bissaye/PdfScrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

# PDF Text Extractor with Image Handling

This Python script allows you to extract text from a PDF file while handling images and other complex elements. It uses the pdfplumber library for text extraction.

## Usage

1. Make sure you have Python installed on your system.

2. Install the required packages by running the following command:

pip install -r requirements.txt


3. Run from the command line:

You can execute the script from the command line using the following syntax:

python script.py <pdf_file_path>

Replace `<pdf_file_path>` with the absolute or relative path to the PDF file you want to process.

the result will be registred in a file named output.docx


4. Using in another Python script:

You can also import the `extract_text_from_pdf` function into another Python script and use it as follows:

```python
from script import extract_text_from_pdf

pdf_path = "path/to/your/file.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)


===========
|  NOTES  |
===========

The script uses the pdfplumber library to extract text from PDF pages and treats images as static text.

The quality of extraction depends on the PDF's structure and how images are integrated into the document.

If you encounter extraction issues or have specific requirements, you might consider using more advanced image processing techniques to extract content from images.

Ensure you have the necessary permissions to access the PDF files you want to process.

For more information on using this script, please refer to the pdfplumber documentation: https://github.com/jsvine/pdfplumber

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages