# pymupdf and tesseract for OCR in Amazon Sagemaker
Use the `conda_python3` kernel.

## Extract the `tessdata` folder

In [1]:
%%bash
tar -xf tessdata.tar

### (Optional) Install tesseract from scratch and get the `tessdata` folder
Although we can install `tesseract` using `yum` (the solution provided [here](https://stackoverflow.com/a/74061696)), it only installs the `3.x` version, but `pymupdf` requires version 4.x or above, so I need to compile it myself.

The solution below is adapted from [here](https://gist.github.com/mdv3101/a1b75abd2ec09dc5f1fb4f7637738f8d).

This will take around 20-25 minutes.

In [None]:
# %%bash
# bash install_tesseract.sh

## Check that the installation is complete

The output of `!ls tessdata` should read:
```
configs		 eng.user-patterns  Makefile	 Makefile.in	  pdf.ttf
eng.traineddata  eng.user-words     Makefile.am  osd.traineddata  tessconfigs
```

In [2]:
!ls tessdata

configs		 eng.user-patterns  Makefile	 Makefile.in	  pdf.ttf
eng.traineddata  eng.user-words     Makefile.am  osd.traineddata  tessconfigs


# Installing Python packages

In [3]:
!pip install -qU langchain pymupdf

# Setting the environment variable
This must be done before importing `fitz` (`pymupdf`).

In [4]:
import os
# This must be done before importing fitz (pymupdf)
os.environ["TESSDATA_PREFIX"] = "./tessdata"

In [5]:
import fitz

assert tuple(map(int, fitz.VersionBind.split("."))) >= (1, 19, 0), "Need PyMuPDF v1.19.*"
assert fitz.TESSDATA_PREFIX

## OCR using `pymupdf`
Following the example notebook [here](https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/jupyter-notebooks/partial-ocr.ipynb) to do OCR.

In [6]:
doc = fitz.open("ocr_Immunotherapy in breast cancer.pdf")
page = doc[0]

In [7]:
# make the TextPage object. It does all the OCR.
full_tp = page.get_textpage_ocr(flags=0, dpi=300, full=True)

# now look at what we have got
print(page.get_text(textpage=full_tp, sort=True))

sey
Journal of
Jah
»
.
s
Browse articles
eo &
Instructions
‘
AFCINOQGENESIS
Biker.
J Carcinog. 2019; 18: 2.
PMCID: PMC6540776
Published online 2019 May 23. doi: 10.4103/jcar.JCar_2_19: 10.4103/jcar.JCar_2_19
PMID: 31160888
Immunotherapy in breast cancer
Soley Bayraktar,''? Sameer Batoo,' Scott Okuno,' and Stefan Glick?
'Department of Medicine, Division of Medical Oncology and Hematology, Mayo Clinic Health System, Eau Claire,
WI, USA
Department of Medicine, Division of Medical Oncology and Hematology, Biruni University School of Medicine,
Istanbul, Turkey
3Vice President Global Medical Affairs, Early Assets, Celgene Corporation, Summit, NJ, USA
Address for correspondence: Dr. Soley Bayraktar, Mayo Clinic Health System, Albert J. And Judith A. Dunlap
Cancer Center, 1221 Whipple St., Eau Claire, WI 54702, USA. E-mail: soley.bayraktar@gmail.com
Received 2019 Mar 2; Accepted 2019 Apr 9.
Copyright
: © 2019 Journal of Carcinogenesis
This is an open access journal, and articles are distribute