# Part 2: Data processing

Within this part, we focus on text processing of the downloaded files (Part 1) found in your newly created folder **dataset**. Importantly, it is fully reproducible and where not, decisions taken into account have been fully disclosed for the sake of reproducibility.

Starting point: newly created folder dataset (as result of running Notebook1_preprocessing)

Folder dataset should now contain 7 automatically downloaded pdfs and 1 manually downloaded pdf with updated title (as per Notebook1_preprocessing).  

These files should be present within the folder **dataset**:
1. 1983__book_chapter__Ng_N_Q_M_Jacquot_A_Abifarin_K_Goli_A_Ghesquiere_and_K_Miezan_Rice_germplasm_collection_and_conservation_in_west_Africa_P.pdf
2. 1984__journal_article__Hargrove_Thomas_R_Trip_Report_Africa_and_India_Trip_report_Africa_and_India_T_R_Hargrove_February_7_to_march_3_1984_14_p.pdf
3. 1985__book_chapter__Alam_M_S_K_Alluri_T_M_Masajo_Kaung_Zan_and_V_T_John_Upland_rice_improvement_in_humid_and_subhumid_tropics_of_West_Africa.pdf
4. 1985__journal_article__Kaung_Zan_V_T_John_and_M_S_Alam_Rice_production_in_Africa_an_overview_In_Rice_Improvement_in_Eastern_Central_and_Souther.pdf
5. 1985__seminar__Alam_M_S_V_T_John_and_Kaung_Zan_Insect_pests_and_diseases_of_rice_in_Africa_In_Rice_Improvement_in_Eastern_Central_and_S.pdf
6. 1986__Roger_PA__Recent_studies_on_free_living_blue_green_algae_and_azolla_at_the_International_Rice_Research_Institute.pdf
7. 1986__seminar__Swaminathan_M_S_Can_Africa_feed_itself_an_application_of_lessons_learned_in_Asia_to_the_challenge_facing_Africa_Presente.pdf
8. 1986__seminar__Swaminathan_M_S_Sustainable_nutrition_security_in_Africa_lessons_from_Asia_Presented_at_the_Twelfth_Ministerial_Session_.pdf


## Notebook setup

Prior to working with this setup, please create correct environment as per **part2_requirements.yml**. In case you need to prepare kernel for JupyterNotebook, use the following code:

`python -Xfrozen_modules=off -m ipykernel install --user --name=part2_env --display-name="part2_env"`



## Loading modules and functions

In [None]:
import pandas as pd
import re
import os
from pathlib import Path
from os import listdir
from sklearn.feature_extraction.text import CountVectorizer

def read_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()


def preprocess_and_export_text(filename, filepath):
    
    text = read_text(os.path.join(filepath, filename))
    
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\n{2,}', '\n', text).strip()
    filename = filename.removesuffix('.txt')
    
    out = str(filename)
    out = out[0:150] + '_norm.txt'
    
    new_filepath = filepath + "_norm"
    os.makedirs(new_filepath, exist_ok=True)
    text_file = new_filepath / Path(out)
    
    with open(text_file, 'w', encoding='utf-8') as file:
        file.write(text)


def wfa(file,filepath,tool):
    
    text = read_text(os.path.join(filepath, file))

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform([text])
    word_freq = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
    df = pd.DataFrame(list(word_freq.items()), columns=['Word', 'Frequency'])
    df['OCR'] = tool
    df['paper']= file
    return df


def split_and_save(df):
    papers = df['paper'].drop_duplicates().reset_index(drop=True)
    os.makedirs(os.path.join(os.getcwd(),"papers"), exist_ok=True)
    os.chdir(os.path.join(os.getcwd(),"papers"))
    for i in range(0, len(papers)):
        pap = papers[i]
        chunk = df[df['paper']==pap]
        chunk.to_excel(f'{pap.removesuffix(".txt")}_real.xlsx', index=False)
    os.chdir("..")



## Extracting text

For the sake of extracting text from pdfs, we sought to identify the best open-source tools for this task. Hence, the following 2 tools have been used:
1. [tesseract](https://pypi.org/project/pytesseract/)
2. [doctr](https://github.com/mindee/doctr)

Due to hardware limitations and speed, both of these tools have been run on our local high performance computing (HPC) cluster based on SLURM scheduler. For each of the tools, you can find the:
1. Submission script (.sh) - information on the requirements needed to run these tasks in a parallel (job array) manner (doctr.sh, tesseract.sh)
2. Python script (.py) - code performing the task of text extraction (doctr.py, tesseract.py)
3. Requirements file (.yml) - information needed for creating virtual environments (doctr_requirements.yml, tesseract_requirements.yml)

As such, if one would want to reproduce these processes on their local hardware, please check the requirements files as your starting point for building an conda environment, followed by respective python scripts for each task. Caution - the requirements files are both based on Linux-based cluster usage, so one might wants to check which libraries are unnecesarry to work with a local computer (e.g. CUDA related packages).

Furthermore, for the sake of this repository, we have included text files as on output from both of the tools, as starting point for the next steps.  


## Loading data

The data for this notebook is found in folders **doctr** and **tesseract**.


In [None]:
## defining dataset paths

doctr_path = os.path.join(os.getcwd(),"doctr")
tess_path = os.path.join(os.getcwd(),"tesseract")

input_txts_doctr = listdir(doctr_path)
input_txts_tess = listdir(tess_path)


## Text normalization

First step of text post-processing is to standardize line breaks like removing extra newlines or spaces (text normalization).

In [21]:
for file in input_txts_doctr:
    preprocess_and_export_text(file, doctr_path)
    preprocess_and_export_text(file, tess_path)


## Word frequency analysis (WEF)

At this step we are trying to idnetify unusual words that may indicate OCR errors. Furthermore, in case of duplicates these words will be removed.

In [23]:
df = pd.DataFrame([['test', 0, 'tool','paper']], columns=['Word', 'Frequency','OCR','paper'])
input_txts = listdir((doctr_path + "_norm"))

for file in input_txts:
    df1 = wfa(file, (doctr_path + "_norm"), 'doctr')
    df2 = wfa(file, (tess_path + "_norm"), 'tesseract')
    df = pd.concat([df, df1, df2], ignore_index = True)    

df = df.drop(0, axis=0).reset_index(drop=True)

## splitting our dataset per pdf
split_and_save(df)


## Realness checks

Next step of preprocessing checks for the contextual appropriateness of words, or if existing word is "real". To do that, we will be mapping our previously identifed words to 2 different corpuses and a database:
1. words ([nltk](https://www.nltk.org/)) - containing approx 236k words from Project Gutenberg. Includes British and American spellings, as well as some archaic and rare words (literature texts).
2. en_core_web_lg ([spacy](https://spacy.io/models/en)) - containing approx 685k vectors of words from Common Crawl and Wikipedia. Includes web-scale data.
3. pubmed database ([pubmed](https://pubmed.ncbi.nlm.nih.gov/)) - containing approx 36M article records with adapted vocabulary for research (e.g. rice research).


Since checking against pubmed database relies on API calls, this process has also been outsourced to our local HPC. Hence, the files are created to support this task: realness.sh, realness.py, realness_requirements.yml. Resulting files from these process can be found in folder **realness**. These have been use to estimate the quality of the processing and used to determine if further steps are needed for text cleanup.

In [None]:
## reading in files from realness dataset
realness_path = os.path.join(os.getcwd(),"realness")

input_real = listdir(realness_path)

df_real = []
for file in input_real:
    score = pd.read_excel(os.path.join(realness_path, file))
    realness_tess = score['realness_tess']
    the_score_tess = realness_tess[1]
    realness_doctr = score['realness_doctr']
    the_score_doctr = realness_doctr[1]
    the_paper = f"{file}"
    df_real.append({'Paper':the_paper,'Realness_tesseract':the_score_tess, 'Realness_doctr':the_score_doctr})
    
df_real_table = pd.DataFrame(df_real)

print(df_real_table)


                                               Paper  Realness_tess  \
0  1983__seminar__Ng_N_Q_M_Jacquot_A_Abifarin_K_G...           0.88   
1  1985__book_chapter__Alam_M_S_V_T_John_and_Kaun...           0.86   
2  1985__book_chapter__Kaung_Zan_V_T_John_and_M_S...           0.92   
3  1985__proceeding__Alam_M_S_K_Alluri_T_M_Masajo...           0.89   
4  1986__seminar__Roger_PA__Recent_studies_on_fre...           0.82   
5  1986__seminar__Swaminathan_M_S_Sustainable_nut...           0.94   

   Realness_doctr  
0            0.89  
1            0.86  
2            0.91  
3            0.89  
4            0.84  
5            0.93  


With this process, we have wrapped up our text processing pipeline for the input to the model (Part 3).     

## (Optional) Manual editing

Will be directly discussed within the paper.

## (Optional) Spelling correction

Already at this level, the quality of text is high enough to serve as an input to the model, however it is work in progress to increase the text quality.