# Part 1: Data pre-processing

Within this part, we focus on fetching the data, or more specifically, downloading the pdf files that were used for our pipeline. Importantly, it is fully reproducible and where not, decisions taken into account have been fully disclosed for the sake of reproducibility.

Starting point: [IRRI Staff Publications](https://scientific-output.irri.org/) (based on our access point: 28.08.2023)
- The entire database was manually downloaded by using their drop-down menu and option "Downloaded as CSV". (original_data)
- The previously downloaded data was manually filtered for 80s. (1980s)
- Due to missing granularity in terms of publication type, each of the subsequent sections have been selected (e.g. Reports), filtered for 80s (198 under filter Year) and manually copied. Furthermore, the column Type was added, representing each of the sections. (reports)


This presents our initial dataset, found under excel file **Data.xlsx**. And this notebook deals with processing and extracting information from the dataset, resulting in the download or all publications of our interest.  

Our dataset is captured under the file **Data.xlsx** with following worksheets:
1. original_data
2. 1980s
3. journal_articles
4. book_chapter
5. proceedings
6. reports
7. seminar
8. working_papers


## Notebook setup

Prior to working with this setup, please create correct environment as per **part1_requirements.yml**. In case you need to prepare kernel for JupyterNotebook, use the following code:

`python -Xfrozen_modules=off -m ipykernel install --user --name=part1_env --display-name="part1_env"`



## Loading modules and functions

In [1]:
import pandas as pd
import regex as re
import time
import gdown
import os


def clean_spaces(text):
    #remove all whitespaces as part of processing strings
    
    text = (re.sub(r"\s+", "", text))
    return text


def clean_spaces_startend(text):
    #remove all whitespaces as part of processing strings
    
    text = (re.sub(r"^\s+", "", text))
    text = (re.sub(r"\s+$", "", text))
    return text

def extract_access(link):
    #extract information about access
    
    if pd.isna(link):
        return 'unknown'
    link = str(link)
    return re.sub(r'.*?(open|closed).*', r'\1', link)


def clean_links(link):
    #cleanup the original link
    
    link=(re.sub(r"\<a href=", "", link))
    link=(re.sub(r"target+.{10,110}", "", link))
    link=(re.sub(r'^\"+', "", link))
    return link


def clean_gdocs(link):
    #cleanup the original link
    
    link=(re.sub(r"https://docs.google.com/a/irri.org/file/d/", "", link))
    link=(re.sub(r"/view+.{10,110}", "", link))
    link=link[:-7]
    return link


def clean_dois(link):
    #cleanup the original link
    
    link=(re.sub(r"\s+", "", link))
    link=(re.sub(r'"', "", link))
    return link


def process_link(link):
    #sorting links
    
    link = str(link)
    if pd.notna(link):
        if re.search(r'google', link): 
            return clean_gdocs(link), "gdocs"
        else:
            return clean_dois(link), "doi"
    return link, "none"


def african_papers(title):
    #sorting african papers
    
    title = str(title)
    if re.search(r'africa', title, re.IGNORECASE): 
        return "african"
    else:
        return "other"


def clean_title(title):
    #cleanup the special characters in title
    
    title = (re.sub(" ", "_", title))
    title = (re.sub(r"\:", "_", title))
    title = (re.sub(r"\`", "_", title))
    title = (re.sub(r"\'", "_", title))
    title = (re.sub(r"\,", "_", title))
    title = (re.sub(r"\;", "_", title))
    title = (re.sub(r"\(", "_", title))
    title = (re.sub(r"\)", "_", title))
    title = (re.sub(r"\/", "_", title))
    title = (re.sub(r"\-", "_", title))
    title = (re.sub(r"\+", "_", title))
    title = (re.sub(r'\"', '_', title))
    title = (re.sub(r"&", "_", title))
    title = (re.sub(r"\.", "_", title))
    title = (re.sub(r"\´", "_", title))
    title = (re.sub(r"\–", "_", title))
    title = (re.sub(r"\?", "_", title))
    title = (re.sub(r"pdf", "", title))
    title = re.sub(r"_+", "_", title)
    return title[:120]

def download_pdf(row):
    access = row['Access']
    if pd.notna(access) and access == "open":
        title = clean_title((row['Article']))
        typ = row['Literature_type']
        year = row['Year']
        title_combo = f"{year}__{typ}__{title}"
        out = "dataset/" + title_combo + ".pdf"
        url = row['Link_snippet']
        
        
        if re.search(r'gdocs', row['Source'], re.IGNORECASE):
            try:
                gdown.download(id=url, output=out, quiet=False, fuzzy=True)
            except OSError as err:
                print("OS error:", err)
            except Exception as e:
                print(f"Unexpected error during download: {e}")
        else:
            print("download error: not suitable")
        
        time.sleep(2)

## Loading original data

Loading the **Data.xlsx** file.

In [2]:
doc = 'Data.xlsx'

all = pd.read_excel(doc, sheet_name='1980s')
all2 = all

book_chapter = pd.read_excel(doc, sheet_name='book_chapter')
proceedings = pd.read_excel(doc, sheet_name='proceedings')
reports = pd.read_excel(doc, sheet_name='reports')
seminar = pd.read_excel(doc, sheet_name='seminar')
working_papers = pd.read_excel(doc, sheet_name='working_papers')
articles = pd.read_excel(doc, sheet_name='journal_articles')

## Preparing combined dataset

Processing of the Data.xlsx file to:
- add literature_type and update information on missing values (no_type)
- add access information (open/closed)
- add source information and processing snippet (gdocs/doi)
- add information about title discussing Africa (african/other)


In [3]:
## 1 - updating the literature_type
combined_list = pd.concat([book_chapter,proceedings,reports,seminar,working_papers,articles],ignore_index=True)
combined_list['Clean_article'] = combined_list['Article'].apply(clean_spaces)


all['Literature_type'] = None
all['Clean_article'] = all['Article'].apply(clean_spaces)

merged = all.merge(combined_list[['Clean_article','Article','Literature_type']], on='Clean_article', suffixes=('', '_new'))


del merged['Literature_type'], merged['Clean_article'], merged['Article_new']
merged = merged.rename(columns={'Literature_type_new': 'Literature_type'})
merged['Literature_type'] = merged['Literature_type'].fillna('no_type')


## 2 - updating the access & link
merged['Access'] = merged['Link'].apply(extract_access)

merged['Link'] = merged['Link'].fillna('no_link')


## 3 - updating the source
merged['Link2'] = merged['Link'].apply(clean_links)
merged[['Link_snippet', 'Source']] = merged['Link2'].apply(lambda x: pd.Series(process_link(x)))

del merged['Link2']


## 4 - selecting the african papers
merged['African'] = merged['Article'].apply(african_papers)



## Downloading the pdfs

Prior to downloading the pdfs, we had a manual checkup of the african papers selected. Since some titles were focused on non-west african countries (e.g. Zambia), other parts of Africa (e.g. Central Africa), were written in other language (e.g. french) or were an actual duplicate under different title ("Can Africa feed itself?), we have manually removed the following titles:
1. "Garrity, D. P. Rice in Eastern and Southern Africa: the role of international testing In: Rice Improvement in Eastern, Central and Southern Africa, p. 29-45. Los Banos, Laguna, IRRI, 1985."
2. "International Rice Research Institute Monitoring tour on rice in Zambia and Tanzania In: Rice Improvement in Eastern, Central and Southern Africa, p. 149-154. Los Banos, Laguna, IRRI, 1985." 
3. "International Rice Research Institute Rice improvement in Eastern, Central, and Southern Africa: proceedings of the international rice workshop at Lusaka, Zambia, april 9-19,1984 Los Banos, Laguna, 1985. 159 p."
4. "Mharapara, I. M. and N. R. Mugabe Rice research in Zimbabwe In: Rice Improvement in Eastern, Central and Southern Africa, p. 113-118. Los Banos, Laguna, IRRI, 1985." 
5. "Shahi, B. B. Potential rice varieties for East Africa In: Rice Improvement in Eastern, Central and Southern Africa, p. 57-61. Los Banos, Laguna, IRRI, 1985." 
6. "Swaminathan, M. S. Evolution de la politique et des strategies de production alimentaires en inde Presented at the Workshop for African Food Policymakers, World Food Council-Government of India, New Delhi, 5 May 1986. 26 p."
7. "Swaminathan, M. S. Evolution of food production policies and strategies in India Presented at the Workshop for African Food Policymakers, World Food Council-Government of India, New Delhi, 5 May 1986. 18 p." 
8. "Swaminathan, M. S. Can Africa feed itself? an application of lessons learned in Asia to the challenge facing Africa Presented at the Twelfth Ministerial Session of the World Food Council, The Hunger Project First Annual Tanco Memorial Lecture, 17 June 1986, Rome. 35 p."

Furthermore, we have focused only on open accesed data, and hence another paper has been removed:

1. "Shukla, B. D., and A. U. Khan Parboiling of paddy with heated sand AMA, Agricultural Mechanization in Asia, Africa and Latin America 15, no. 4 (1984): 55-60. "

In [None]:
## creating new, african-only dataset

african = merged[merged['African']=='african'].reset_index(drop=True)

manual_removal_rows = [4, 5, 6, 8, 9, 12, 13, 14]

african_dataset = african.drop(index=manual_removal_rows).reset_index(drop=True)


## downloading data

os.makedirs("dataset", exist_ok=True)
african_dataset[african_dataset['Access'] == 'open'].apply(download_pdf, axis=1)

    Year                                               Link  \
0   1983  <a href="https://docs.google.com/a/irri.org/fi...   
1   1984  <a href="https://docs.google.com/a/irri.org/fi...   
2   1985  <a href="https://docs.google.com/a/irri.org/fi...   
3   1985  <a href="https://docs.google.com/a/irri.org/fi...   
4   1985  <a href="https://docs.google.com/a/irri.org/fi...   
5   1985  <a href="https://docs.google.com/a/irri.org/fi...   
6   1985  <a href="https://docs.google.com/a/irri.org/fi...   
7   1985  <a href="https://docs.google.com/a/irri.org/fi...   
8   1985  <a href="https://docs.google.com/a/irri.org/fi...   
9   1985  <a href="https://docs.google.com/a/irri.org/fi...   
10  1986  <a href="https://docs.google.com/a/irri.org/fi...   
11  1986  <a href="https://docs.google.com/a/irri.org/fi...   
12  1986  <a href="https://docs.google.com/a/irri.org/fi...   
13  1986  <a href="https://docs.google.com/a/irri.org/fi...   
14  1986  <a href="https://docs.google.com/a/irri.org/f

Downloading...
From: https://drive.google.com/uc?id=0Bw5K_GQLC_0RMXlzYkdkMkViRk0
To: c:\Users\pegerb\ricedb\dataset\1983__seminar__Ng_N_Q_M_Jacquot_A_Abifarin_K_Goli_A_Ghesquiere_and_K_Miezan_Rice_germplasm_collection_and_conservation_in_west_Africa_P.pdf
100%|██████████| 419k/419k [00:00<00:00, 6.85MB/s]
Downloading...
From: https://drive.google.com/uc?id=0Bw5K_GQLC_0RbXRaTWI1d291Z2M
To: c:\Users\pegerb\ricedb\dataset\1985__proceeding__Alam_M_S_K_Alluri_T_M_Masajo_Kaung_Zan_and_V_T_John_Upland_rice_improvement_in_humid_and_subhumid_tropics_of_West_Africa.pdf
100%|██████████| 244k/244k [00:00<00:00, 2.86MB/s]
Downloading...
From: https://drive.google.com/uc?id=0Bw5K_GQLC_0RT0E4TzlzcnVlRDA
To: c:\Users\pegerb\ricedb\dataset\1985__book_chapter__Alam_M_S_V_T_John_and_Kaung_Zan_Insect_pests_and_diseases_of_rice_in_Africa_In_Rice_Improvement_in_Eastern_Central_and_S.pdf
100%|██████████| 1.72M/1.72M [00:01<00:00, 1.71MB/s]
Downloading...
From: https://drive.google.com/uc?id=0Bw5K_GQLC_0RTXJ0

Unexpected error during download: Failed to retrieve file url:

	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses.
	Check FAQ in https://github.com/wkentaro/gdown?tab=readme-ov-file#faq.

You may still be able to access the file from the browser:

	https://drive.google.com/uc?id=0B5pnIO3KEL14Tzdfck53UXJOODg

but Gdown can't. Please check connections and permissions.


Downloading...
From: https://drive.google.com/uc?id=0Bw5K_GQLC_0RdWt6eUJrYkRncGM
To: c:\Users\pegerb\ricedb\dataset\1986__seminar__Swaminathan_M_S_Sustainable_nutrition_security_in_Africa_lessons_from_Asia_Presented_at_the_Twelfth_Ministerial_Session_.pdf
100%|██████████| 2.03M/2.03M [00:00<00:00, 4.79MB/s]
Downloading...
From: https://drive.google.com/uc?id=0B07cvVCGaVVIamM5QmJxeEFwSkU
To: c:\Users\pegerb\ricedb\dataset\1986__seminar__Swaminathan_M_S_Can_Africa_feed_itself_an_application_of_lessons_learned_in_Asia_to_the_challenge_facing_Africa_Presente.pdf
100%|██████████| 1.98M/1.98M [00:00<00:00, 2.19MB/s]


0    None
2    None
3    None
4    None
5    None
6    None
7    None
dtype: object

One paper couldn't be automatically downloaded but accessed through the listed link. Once downloaded, title has beeen adjusted to fit the previous rules from:

"1986 _ Roger,PA _ Recent studies on free-living blue-green algae and azolla at the International Rice Research Institute"

to

"1986__seminar__Roger_PA__Recent_studies_on_free_living_blue_green_algae_and_azolla_at_the_International_Rice_Research_Institute"


With this process, we have downloaded the data needed for the text processing step (Part 2)