# Testing Different File Readers for Langchain

Files for testing:
* La_Dette_d_honneur (text is high quality (probably already OCR, but also has rose colored watermark on each page)
* Ohé_Les_p_tits_agneaux_revuel (a color scan; quality seems reasonable but perhaps the weak contrast would be a problem)
* La_servante_justifiee (just a scan, and not terribly high quality; lots of water spots on the paper)

Libraries:
* PyPDFLoader
* PDFPlumberLoader

In [1]:
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import PDFPlumberLoader

async def PyLoadPDF(filepath: str) -> list:
    loader = PyPDFLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages

async def PlumberLoadPDF(filepath: str) -> list:
    loader = PDFPlumberLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages

def get_files_from_directory(directory_path: str) -> list[str]:
    directory = Path(directory_path)
    file_paths = [str(file) for file in directory.iterdir() if file.is_file()]
    return file_paths

directory_path: str = r"C:\Users\charl\Documents\VSCode\File Reader Tests\Files"
files: list[str] = get_files_from_directory(directory_path)
files.sort(reverse=True)
print(files)

['C:\\Users\\charl\\Documents\\VSCode\\File Reader Tests\\Files\\Un_monsieur_qui_suit_les_femmes_comédie.pdf', 'C:\\Users\\charl\\Documents\\VSCode\\File Reader Tests\\Files\\Ohé_Les_p_tits_agneaux_revue.pdf', 'C:\\Users\\charl\\Documents\\VSCode\\File Reader Tests\\Files\\La_servante_justifiee.pdf', 'C:\\Users\\charl\\Documents\\VSCode\\File Reader Tests\\Files\\La_Dette_d_honneur.pdf']


In [2]:
# Load PDFs
Py_loaded_PDFs: list = []
for file in files:
    pages = await PyLoadPDF(file)
    Py_loaded_PDFs.append(pages)

Plumber_loaded_PDFs: list = []
for file in files:
    pages = await PlumberLoadPDF(file)
    Plumber_loaded_PDFs.append(pages)

In [3]:
# Print page content for each PDF
for i, pages in enumerate(Py_loaded_PDFs):
    print(f"PyPDFLoader - File {i+1}:")
    for page in pages:
        print(page.page_content)
    print("\n")



PyPDFLoader - File 1:
C H A QU E P1EC E, 20 CENT 1MEs.
64 v» 65 LIvRAisoNs. IllÉATREC0NTEMPORAINILLUSTRÉ-
MICIIEL LEvY FRÉRES, EDITEURS,
R U E v 1v 1 eNN a, 2 s1s.
UN
MONNIEUR()UISUITLEN FEMMES
COMÉDlE-VAUDEVILLE EN DEUX ACTES
MM. TH. BARRIERE r A. DECOURCELLE
REPRÉsENTÉE PoUR LA PREMIÈRE FoIs,A PARIs, sUR LE THÉATRE DE LA MONTANSIER, Le 18 NovEMBRE 1850.
DISTRIBUTION DE LA PIÈCE.
. MM. RAvEL.HECTOR DUCHEMIN, célibataire.: . .M. D'ERMONT, représentant... . .. .. . . .. . . PELLERIN.LE COLONEL GUERIN. .. , . .. ., ., . LHÉRITIER.
M. LEGROS . .. . .. . . . .. . . .. . . KALEKAIRE.
M. DE CERNY, gentleman ridicule. - - - - - LAcoUR 1ERE.
, M'* Rn Ass 1xp.
- - | 1 - 1
- - -CLÉMENCE, femme ded'Ermont.
MATHILDE, sa nièce... . .. . . . .. .EVELINA, femme de Legros - : ( 1 :: *1.Lr.
GEO A. lorette. .. .. . .. . .. . - . . A :1 .
FLORlNE, femme de chambre de Clémence. . -A M1 ,* T
Une Loueuse de chaises.
ACTE I.
Aux Tuileries.-Lesdeuxpremiersplansformentune allée;lesdeuxder
niersun massifd'arbre

In [4]:
for i, pages in enumerate(Plumber_loaded_PDFs):
    print(f"PDFPlumberLoader - File {i+1}:")
    for page in pages:
        print(page.page_content)
    print("\n")

PDFPlumberLoader - File 1:
CHAQUE P1ECE, 20 CENT1MEs. IllÉATRE C0NTEMPORAIN ILLUSTRÉ MICIIEL LEvY FRÉRES, EDITEURS,
64 v» 65 LIvRAisoNs.
- RUEv1v1eNNa,2 s1s.
UN
MONNIEUR ()UI SUIT LEN FEMMES
COMÉDlE-VAUDEVILLE EN DEUXACTES
MM. TH. BARRIERE r A. DECOURCELLE
REPRÉsENTÉE PoUR LA PREMIÈRE FoIs,A PARIs,sUR LETHÉATRE DE LA MONTANSIER, Le 18NovEMBRE 1850.
DISTRIBUTION DE LAPIÈCE.
M H L M . . E E C C D L T ' O E O E L G R R R O M O N D O S E N U L . T C G , H .. U E E r M . e R I p I r . N . é N s , . e . n c t é a . . l . n i t b . . , at . a . . . i . re . . . . . . . . : . , . . . . . . , . . . . . . .. . . . MM. K L P R A H E A L É L v R L E E E I K L R T A . I I I N E R . R E . . C E G M L V A E É E T O L M H I E I N N L A C D , E E A , , f . e f l s m o e a r m m e n e t m i t e è e d . c d e e . . e . L d . . e ' . . g E r r . . o m s o . . n . . t. . . . .. . - . . - - - . . - . : . - , - M'*R ( | A n 1 : 1 1 A : s : s - . 1 * x 1 1 p . . Lr.
M. DECERNY,gentleman ridicule. LAcoUR1ERE. 