OCR-classif

Set of Python programs to extract text from pdfs. It requires the installation of

You also need the following modules:

The extraction is done by calling:

python3 pdf2txt.py

The path where the pdf files are stored must be specified inside the python file.

The texts can then be extracted from all the txt files, stored in dictionaries data structures and saved in a single pickle file:

python3 text_extractor.py

This program relies on the toolbox textbox.py. The path of the txt file and the name of the pickle file must be specified within the python script.

The projet Grevia can then be used to analyze the texts.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitignore		.gitignore
LICENSE		LICENSE
OCRtest.ipynb		OCRtest.ipynb
README.md		README.md
classify_docs_from_graph.py		classify_docs_from_graph.py
classify_documents.ipynb		classify_documents.ipynb
create_graph_cmd.py		create_graph_cmd.py
csv2folders.py		csv2folders.py
get_txt_from_pdf_with_box.ipynb		get_txt_from_pdf_with_box.ipynb
pdf2txt.py		pdf2txt.py
pdf2txtGUI.py		pdf2txtGUI.py
pdf2txtGUI2.py		pdf2txtGUI2.py
pdf2txtbox.py		pdf2txtbox.py
text_extractor.py		text_extractor.py
textbox.py		textbox.py
texts_and_graph.ipynb		texts_and_graph.ipynb
txt2graph.py		txt2graph.py
version_without_pandas.ipynb		version_without_pandas.ipynb

bricaud/OCR-classif