# 3. Highlighting Keywords in PDFs using Python

In this section, we will looks at how to use Python to highlight keywords in a mass number of PDFs. The goal of this is to make it easier for users to locate information from a mass number of papers. 

------------------

## Preamble: Download this Google Colab Notebook to your Google Drive

It is possible to save a copy of this Google Colab notebook to your Google Drive. This is useful if you would like to write notes, or to make changes or additions to the code in this notebook. 

**IF YOU WOULD LIKE to save a copy of this notebook to your Google Drive before beginning**, do the following: 

1. Click the copy_to_drive.png button at the top of the Google Colab webpage.
2. A message may appear that says Google Colab wants to open the copied Google Colab notebook in a new tab. Click the button that says open the copied notebook in a new tab if this comes up. 
3. **Quit the original tab with the original Google Colab notebook** so that we dont get mixed up with notebooks. 

You can find this notebook in your Google Drive, in a folder called ```Colab Notebooks```. 

**IF YOU DO NOT want to save a copy of this notebook to your Google Drive**, take note that you will get the following message when your begin to run code in this notebook (see below). **When this comes up, click the** ```Run anyway``` **button to allow the code to run**.

<center><img src=https://raw.githubusercontent.com/geoffreyweal/Literature_Mining_Tutorial/main/Notebooks/images/Running_notebook_from_github.png width="500"></center>


We are now ready to run this Google Colab Notebook. 

------------------

## Highlighting Keywords

**First**, we will install ```pymupdf```, which is the Python package that will highlight our PDFs. 

---



In [None]:
!pip install --upgrade pymupdf

**Second**, we want to upload your PDF from arXiv/Scopus into Google Colab again. See the beginning of ```iDM_Tutorial_Literature_Mining_3.ipynb``` to learn how to do this. However, to summarise, do the following:


1.   Click the Files folder icon to the left of the Google Colab page.
2.   Upload the PDF files to Google Colab by clicking the upload.png button. 
3.   **Wait until all the PDF files have been uploaded**.
4.   Once all your PDF files have uploaded to Google colab, **run the code below**.
5.   Click the refresh.png button, which will show all your pdf files have been moved into a new folder called ```literature```. 

NOTE: If you could not obtain papers or download the ```literature.zip``` file from the ```iDM_LMT_1_Gathering_Literature.ipynb``` notebook, download an example of ```literature.zip``` by [clicking here](https://github.com/geoffreyweal/Literature_Mining_Tutorial/raw/main/Notebooks/literature.zip). 

In [None]:
import os, shutil, tqdm

# First, move into the current folder
os.chdir('/content')

# Second, make a folder called literature if it does not already exist
literature_folder_name = 'literature'
if not os.path.exists(literature_folder_name):
    os.makedirs(literature_folder_name)

# Third, move all pdf files into the literature folder
pdf_fileanmes = sorted([a_file for a_file in os.listdir(".") if (os.path.isfile(a_file) and a_file.endswith('.pdf'))])
if len(pdf_fileanmes) > 0:
    pbar = tqdm.tqdm(pdf_fileanmes)
    for a_file in pbar:
        pbar.set_description('Moving '+str(a_file))
        shutil.move(a_file, literature_folder_name+'/'+a_file)
else:
    print('There were no pdfs to move to the '+str(literature_folder_name)+' folder.')

**Third**, we now want to create a new folder called ```highlighted_literature```, and in this folder create a copy of all the PDFs in the ```literature``` folder, each of which have had the keywords in the keywords list highlighted.

In [None]:
import fitz
import tqdm

# First, add the keywords you would like to highlight in the PDFs.
keywords = ['exciton', 'diffusion', 'coefficient']

# Second, create a folder to add highlighted PDFs to.
highlighted_literature_folder = "highlighted_literature"
if os.path.exists(highlighted_literature_folder):
    shutil.rmtree(highlighted_literature_folder)
os.makedirs(highlighted_literature_folder)

# Third, create copies of the PDFs with keywords highlighted.
literature_folder_name = 'literature'
pbar = tqdm.tqdm([pdf_file for pdf_file in sorted(os.listdir(literature_folder_name)) if (os.path.isfile(literature_folder_name+'/'+pdf_file) and pdf_file.endswith('.pdf'))])
error_list = []
for pdf_file in pbar:
    pbar.set_description('Highlighting '+str(pdf_file))
    # 2.1: Read in the pdf file
    try:
        doc = fitz.open(literature_folder_name+'/'+pdf_file)
        for page in doc:
            for keyword in keywords:
                ### 2.2: Search for keywords on each page of the pdf
                text_instances = page.search_for(keyword)
                ### 2.3: Highlight each keyword on each page of the pdf
                for inst in text_instances:
                    highlight = page.add_highlight_annot(inst)
                    highlight.update()
        ### 2.4: Save the highlighted PDF into the highlighted_literature folder.
        doc.save(highlighted_literature_folder+'/'+pdf_file, garbage=4, deflate=True, clean=True)
    except Exception as exception:
        error_list.append((pdf_file, exception))
print()

# Fourth, report any errors with highlighting files.
if len(error_list) > 0:
    print('=========================================')
    print('The following pdf files has issues:')
    print()
    for pdf_file, exception in error_list:
        print(str(pdf_file)+':\t'+str(exception))
    print('=========================================')
else:
    print('This script ran with no issues')

If you click the refresh.png refresh button, you will see a new ```highlighted_literature``` folder. This folder now contains PDF files that have had their keywords highlighted.

**Fourth**, we can now convert this ```highlighted_literature``` folder into a zip file and download it. **Run the code below**

**NOTE**: It may take time to download the ```highlighted_literature.zip``` file to your computer.

In [None]:
!zip -r highlighted_literature.zip highlighted_literature > .output.txt
from google.colab import files
files.download('highlighted_literature.zip') 

You will now have downloaded a zip file called ```highlighted_literature.zip```. If you open this zip file, you will see it contains many pdf files which have had keywords highlighted. 

**Once you have completed this**, move on to the ```iDM_LMT_4_Putting_It_All_Together.ipynb``` notebook, called ```Putting It All Together```