# 4. Putting It All Together: Summary of This Tutorial

This notebook summarises all the scripts we have learnt to use in this tutorial. 

This summary is packaged into three components:

1.   **Gathering**: We will look at how to collect lots of papers using various methods.
2.   **Extracting**: We will extract text from papers and show only the segments in our papers containing keywords of interest. 
3.   **Highlighting**: We will highlight the keywords across all of our papers automatically using Python.

---

## Preamble 1: Download this Google Colab Notebook to your Google Drive

It is possible to save a copy of this Google Colab notebook to your Google Drive. This is useful if you would like to write notes, or to make changes or additions to the code in this notebook. 

**IF YOU WOULD LIKE to save a copy of this notebook to your Google Drive before beginning**, do the following: 

1. Click the copy_to_drive.png button at the top of the Google Colab webpage.
2. A message may appear that says Google Colab wants to open the copied Google Colab notebook in a new tab. Click the button that says open the copied notebook in a new tab if this comes up. 
3. **Quit the original tab with the original Google Colab notebook** so that we dont get mixed up with notebooks. 

You can find this notebook in your Google Drive, in a folder called ```Colab Notebooks```. 

**IF YOU DO NOT want to save a copy of this notebook to your Google Drive**, take note that you will get the following message when you begin to run code in this notebook (see below). 

* **When this comes up, click the** ```Run anyway``` **button to allow the code to run**.

<center><img src=https://raw.githubusercontent.com/geoffreyweal/Literature_Mining_Tutorial/main/Notebooks/images/Running_notebook_from_github.png width="500"></center>


We are now ready to run this Google Colab Notebook. 

---


## Preamble 2: Installing packages to Google Colab

We need to download and install a few Python packages into Google Colab to allow the code in this notebook to run.

First: Download the packages needed for downloading papers from arXiv.

In [None]:
!pip install --upgrade arxiv pypdf python-docx pymupdf

Second: Move into the content folder that we are working in (in the Files panel).

In [None]:
import os
os.chdir('/content')

-----------

## 1: Gathering Scientific Papers

You can either download scientific papers from arXiv, or upload PDF files to Google Colab.

### 1.1: Gathering Scientific Papers from arXiv using Python



First: Download papers downloaded from arXiv.

Troubleshooting Notes: If you have a problem or if you get the following messages:
1.   If you get a message that includes ```HTTP Forbidden```, first click at the top of the webpage ```Runtime > Disconnect and delete runtime```, then click at the top of the webpage ```Runtime > Run all```. 
2.   If you get a message that includes ```connection reset```, first click at the top of the webpage ```Runtime > Disconnect and delete runtime```, then click at the top of the webpage ```Runtime > Run all```. 

**IMPORTANT TROUBLESHOOTING NOTE**: If something happens and it can not be fixed easily or you are not sure what to do, do this:
1.  Click at the top of the webpage ```Runtime > Disconnect and delete runtime```
2.  Then click at the top of the webpage ```Runtime > Run all```. 


In [None]:
import os, shutil, arxiv, tqdm

print('DOWNLOADING SCIENTIFIC PAPERS FROM ARXIV')

# First, set up the search in arXiv
search = arxiv.Search(
  query = "Exciton Diffusion Organic Photovoltaics", # This is where the keywords of the research topic you are interest in go
  max_results = 50, # This will limit the number of papers downloaded to the most relevant. Remove this if you want to download every paper found (including papers not relevant to your research topic).
)

# Second, create folders to save papers to. DO NOT MODIFY
literature_filename = 'literature'
if os.path.exists(literature_filename):
    shutil.rmtree(literature_filename)
os.makedirs(literature_filename)
os.makedirs(literature_filename+'/cond-mat')

# Third, download the papers into a folder called literature. DO NOT MODIFY
pbar = tqdm.tqdm(search.results())
for result in pbar:
    pbar.set_description('Saving '+str(result.title)+': '+str(result.pdf_url))
    result.download_pdf(dirpath=literature_filename)
for file in os.listdir(literature_filename+'/cond-mat'):
    shutil.copyfile(literature_filename+'/cond-mat/'+file, literature_filename+'/'+file)
shutil.rmtree(literature_filename+'/cond-mat')
pbar.set_description('Finished downloading papers from arXiv')
pbar.close()

If you press the refresh.png refresh button in the ```Files``` Panel, you will see the ```literature``` folder, that contains all your papers in pdf form.

Second: Download all of these pdfs to your computer.

**NOTE**: It may take time to download the ```literature.zip``` file to your computer.

In [None]:
!zip -r literature.zip literature > .output.txt
from google.colab import files
files.download('literature.zip') 

### 1.2: Upload PDF files to Google Colab

Copy all your PDF files of interest to Google Colab and then run the code below to move all these PDF files into a folder called ```literature```. 

In [None]:
import tqdm

# First, make a folder called literature if it does not already exist
literature_folder_name = 'literature'
if not os.path.exists(literature_folder_name):
    os.makedirs(literature_folder_name)

# Second, move all pdf files into the literature folder
pdf_fileanmes = sorted([a_file for a_file in os.listdir(".") if (os.path.isfile(a_file) and a_file.endswith('.pdf'))])
if len(pdf_fileanmes) > 0:
    pbar = tqdm.tqdm(pdf_fileanmes)
    for a_file in pbar:
        pbar.set_description('Moving '+str(a_file))
        shutil.move(a_file, literature_folder_name+'/'+a_file)
else:
    print('There were no pdfs to move to the '+str(literature_folder_name)+' folder.')

---------

# 2. Extracting Text from PDF using Python



First: Extract text from PDF files.

In [None]:
import os, tqdm
from pypdf import PdfReader

print('====================================')
print('EXTRACTING TEXT FROM PDFs')

def merge_pdf_pages(pages):
    """
    This is a method that merges the text together into a single string. 
    
    Do not worry about what is going on in this method.
    """
    pdf_text = ''
    for page in pages:
        text = page.extract_text()
        pdf_text += text + ' \n'
    pdf_text = pdf_text.replace('-\n','')
    return pdf_text

# First, set up some general variables for performing text extraction.
literature_folder_name = 'literature'
literature_text = {}
error_list = []

# Second, obtain all the names of the psds in the literature folder
pdf_filenames = sorted([a_file for a_file in os.listdir(literature_folder_name) if (os.path.isfile(literature_folder_name+'/'+a_file) and a_file.endswith('.pdf'))])

# Third, extract data from all the pdf files in the literature folder
pbar = tqdm.tqdm(pdf_filenames)
for a_file in pbar:
    pbar.set_description('Extracting '+str(a_file))
    try:
        reader = PdfReader(literature_folder_name+'/'+a_file)
        pdf_text = merge_pdf_pages(reader.pages)
        literature_text[a_file] = pdf_text
    except:
        error_list.append(a_file)
pbar.close()
print()
print('Finished Extracting Text from PDFs')

# Fourth, indicate if there were any problems with extracting text from pdfs
if len(error_list) > 0:
    print('=========================================')
    print('The following pdfs could not have text extracted from them:')
    print()
    for error_file in error_list:
        print(error_file)
    print('=========================================')
else:
    print('Text could be extracted from all pdfs with no problems.')
print('====================================')

Second: Identify keywords from PDFs. 

In [None]:
import tqdm, re

print('====================================')
print('IDENTIFYING KEYWORDS FROM PDFs')

keywords = ['exciton', 'diffusion', 'coefficient']

def find_keywords(pdf_text, keywords):
    """
    This method is designed to locate all the keywords in a string of text.
    """
    all_keyword_indices_in_text = []
    for keyword in keywords:
        all_keyword_indices_in_text += [(m.start(), m.end()) for m in re.finditer(keyword, pdf_text)]
    return sorted(all_keyword_indices_in_text)

# First, set up some general variables for recording data about keywords
keywords_in_literature_text = {pdf_filename: None for pdf_filename in literature_text.keys()}

# Second, determine all the places in the PDF where keywords are found. 
pbar = tqdm.tqdm(sorted(literature_text.items()))
for pdf_filename, pdf_text in pbar:
    pbar.set_description('Analysing Keywords in '+str(pdf_filename))
    all_keyword_indices_in_text = find_keywords(pdf_text,keywords)
    keywords_in_literature_text[pdf_filename] = all_keyword_indices_in_text
pbar.close()
print()
print('Finished Identifying Keywords from PDFs')
print('====================================')

Third: Save snippets of the sentences that include the keywords into Word docx files. 

In [None]:
import tqdm
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH

print('====================================')
print('SAVE KEYWORDS FROM PDFs AS SNIPPETS IN DOCX FILES')

def already_been_printed(phsi, phei, printed_segments):
    """
    This method is used to determine what has already been printed. 

    Do not worry about what is going on in this method.
    """
    for start_index, end_index in printed_segments:
        if (start_index <= phsi <= end_index) and (start_index <= phei <= end_index):
            return True
    return False

# First, set up some general variables for running this script
literature_keyword_summary_name = 'literature_keyword_summary'
print_hangover_no = 200 # This can be modified as you wish

# Second, create a new folder to save keyword summary docx files to.
if not os.path.exists(literature_keyword_summary_name):
    os.makedirs(literature_keyword_summary_name)

# Third, print what the keywords that will be searched for in this script.
print('##################################')
print('keywords: '+str(keywords))
print('##################################')

# Fourth, save the summary of keywords found in the text into docx files.
pbar = tqdm.tqdm(sorted(literature_text.keys()))
for pdf_filename in pbar:

    # 4.1: Create the word document. 
    docx_filename = pdf_filename.replace('.pdf','.docx')
    pbar.set_description('Saving to file: '+str(docx_filename))
    document = Document()
    document.add_heading(pdf_filename)
    document.add_paragraph()

    # 4.2: Collect the text from the pdf, and the positions of keywords identified in the text
    pdf_text = literature_text[pdf_filename]
    all_keyword_indices_in_text = keywords_in_literature_text[pdf_filename]

    # 4.3: Write all the summaries involving keywords from the pdf into the docx file
    printed_segments = []
    for start_index, end_index in all_keyword_indices_in_text:
        phsi = start_index - print_hangover_no
        if phsi < 0:
            phsi = 0
        phei = end_index + print_hangover_no
        if phei > (len(pdf_text)-1):
            phei = len(pdf_text)-1
        if already_been_printed(phsi, phei, printed_segments):
            continue
        else:
            printed_segments.append((phsi,phei))
        insept_in_text = pdf_text[phsi:phei]
        insept_in_text = insept_in_text.replace('\n',' ')
        paragraph = document.add_paragraph()
        paragraph.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
        for word in insept_in_text.split():
            try:
                run = paragraph.add_run(word+' ')
            except Exception as exception:
                continue
            if any([(keyword in word) for keyword in keywords]):
                run.bold = True

    # 4.4: Save the docx file
    document.save(literature_keyword_summary_name+'/'+docx_filename)
pbar.close()
print()
print('Finished Saving the DOCX files contains summaries of keywords')
print('====================================')

Fourth: Convert this literature_keyword_summary folder into a zip file and download it.

In [None]:
!zip -r literature_keyword_summary.zip literature_keyword_summary > .output.txt
from google.colab import files
files.download('literature_keyword_summary.zip') 

----------------

## 3. Highlighting Keywords in PDFs using Python

First: Physically highlight keywords in PDFs

In [None]:
import fitz
import tqdm

# First, add the keywords you would like to highlight in the PDFs.
keywords = ['exciton', 'diffusion', 'coefficient']

# Second, create a folder to add highlighted PDFs to.
highlighted_literature_folder = "highlighted_literature"
if os.path.exists(highlighted_literature_folder):
    shutil.rmtree(highlighted_literature_folder)
os.makedirs(highlighted_literature_folder)

# Third, create copies of the PDFs with keywords highlighted.
literature_folder_name = 'literature'
pbar = tqdm.tqdm([pdf_file for pdf_file in sorted(os.listdir(literature_folder_name)) if (os.path.isfile(literature_folder_name+'/'+pdf_file) and pdf_file.endswith('.pdf'))])
error_list = []
for pdf_file in pbar:
    pbar.set_description('Highlighting '+str(pdf_file))
    # 2.1: Read in the pdf file
    try:
        doc = fitz.open(literature_folder_name+'/'+pdf_file)
        for page in doc:
            for keyword in keywords:
                ### 2.2: Search for keywords on each page of the pdf
                text_instances = page.search_for(keyword)
                ### 2.3: Highlight each keyword on each page of the pdf
                for inst in text_instances:
                    highlight = page.add_highlight_annot(inst)
                    highlight.update()
        ### 2.4: Save the highlighted PDF into the highlighted_literature folder.
        doc.save(highlighted_literature_folder+'/'+pdf_file, garbage=4, deflate=True, clean=True)
    except Exception as exception:
        error_list.append((pdf_file, exception))
print()

# Fourth, report any errors with highlighting files.
if len(error_list) > 0:
    print('=========================================')
    print('The following pdf files has issues:')
    print()
    for pdf_file, exception in error_list:
        print(str(pdf_file)+':\t'+str(exception))
    print('=========================================')
else:
    print('This script ran with no issues')

Second: Convert this ```highlighted_literature``` folder into a zip file and download it.

In [None]:
!zip -r highlighted_literature.zip highlighted_literature > .output.txt
from google.colab import files
files.download('highlighted_literature.zip') 