This code will merge the data (metadata/attachments) from CGSpace with the PRMS results. At the end, we will have an output file containing all the PRMS results along with the text of the evidences attached with each results.

The evidence links mostly correspond to CGSpace Items, while some refer to Sharepoint articles. We will only consider the evidences from two sources (CGSpace, Sharepoint)

In [1]:
import os
import pandas as pd

### Reading the PRMS Results

In [2]:
prms_results_dir = "prms_results"
prms_results_file = os.path.join(prms_results_dir, "Result_AI.xlsx")
prms_comp_results = pd.read_excel(prms_results_file, sheet_name='fact_results', engine="openpyxl")
prms_evidences = pd.read_excel(prms_results_file, sheet_name='result_main_evidences', engine="openpyxl")

In [5]:
len(prms_comp_results), len(prms_evidences)

(19425, 14646)

Filtering out all the `main` evidences having valid links

In [4]:
prms_evidences = prms_evidences[(prms_evidences['evidence_type']=='main') & (prms_evidences['valid_link'].notna())].reset_index(drop=True)
prms_evidences

Unnamed: 0,id,description,gender_related,link,youth_related,is_supplementary,result_id,knowledge_product_related,Index,is_valid_link,evidence_type,valid_link
0,1,,,https://hdl.handle.net/10568/125437,,0.0,4,,1,1,main,https://hdl.handle.net/10568/125437
1,29654,,,https://hdl.handle.net/10568/151784,,0.0,13282,,1,1,main,https://hdl.handle.net/10568/151784
2,22477,,,https://hdl.handle.net/10568/135090,,0.0,13337,,1,1,main,https://hdl.handle.net/10568/135090
3,22475,,,https://hdl.handle.net/10568/135209,,0.0,13335,,1,1,main,https://hdl.handle.net/10568/135209
4,22398,,,https://hdl.handle.net/10568/135271,,0.0,13314,,1,1,main,https://hdl.handle.net/10568/135271
...,...,...,...,...,...,...,...,...,...,...,...,...
14641,4445,Related 2022 Journal Article Publication - Mod...,0.0,https://hdl.handle.net/10568/127913,0.0,0.0,874,3294.0,1,1,main,https://hdl.handle.net/10568/127913
14642,10795,The whole document (2-page brief) is about the...,0.0,https://hdl.handle.net/10568/137383,0.0,0.0,7724,7695.0,1,1,main,https://hdl.handle.net/10568/137383
14643,17107,This thesis focuses on understanding everyday ...,0.0,https://hdl.handle.net/10568/131816,0.0,0.0,6982,7033.0,1,1,main,https://hdl.handle.net/10568/131816
14644,11946,The aim of the video is to promote healthy eat...,0.0,https://hdl.handle.net/10568/131464,0.0,0.0,7180,6643.0,1,1,main,https://hdl.handle.net/10568/131464


Mapping the complete results with their respective evidence. One result ID can have multiple evidences.

Inner join is applied so that we can get only those results for which we have `main` evidence and `valid links` available 

In [11]:
prms_results = pd.merge(prms_comp_results, prms_evidences, how='left', on='result_id').reset_index(drop=True)
prms_results

Unnamed: 0,description_x,is_active,gender_tag_level_id,version_id,status,title,legacy_id,krs_url,is_krs,climate_change_score,...,description_y,gender_related,link,youth_related,is_supplementary,knowledge_product_related,Index,is_valid_link,evidence_type,valid_link
0,,1,,5,0,Encourage inter-ministrial discussions for cli...,,,,,...,,,,,,,,,,
1,,1,,5,0,Cambodia rice contract farming based on climat...,,,,,...,,,,,,,,,,
2,,1,,5,0,Program to (Advocacy) support GDA to implement...,,,,,...,,,,,,,,,,
3,,1,,5,0,Gender and youth sensitive capacity building p...,,,,,...,,,,,,,,,,
4,,1,,5,0,Extend trainer training programs across provin...,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21376,"Global food, fuel, and fertilizer prices have ...",1,1.0,1,1,Policy responses to the impacts of the Ukraine...,,https://www.cgiar.org/initiative-result/the-co...,1.0,1.0,...,https://hdl.handle.net/10568/127824,,https://hdl.handle.net/10568/127611,,0.0,2222.0,2.0,1.0,main,https://hdl.handle.net/10568/127611
21377,"Global food, fuel, and fertilizer prices have ...",1,1.0,1,1,Policy responses to the impacts of the Ukraine...,,https://www.cgiar.org/initiative-result/the-co...,1.0,1.0,...,https://hdl.handle.net/10568/125312_x000D_\nht...,,https://hdl.handle.net/10568/125314,,0.0,500.0,1.0,1.0,main,https://hdl.handle.net/10568/125314
21378,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,https://hdl.handle.net/10568/126467,,0.0,,2.0,1.0,main,https://hdl.handle.net/10568/126467
21379,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,"Entire document, see p. 14",0.0,https://hdl.handle.net/10568/126473,1.0,0.0,1612.0,1.0,1.0,main,https://hdl.handle.net/10568/126473


### Reading the CGSpace Results

In [12]:
cgspace_data_dir = "cgspace_data"
cgspace_data_list = os.listdir(cgspace_data_dir)
cgspace_comp_data = pd.DataFrame()
for file in cgspace_data_list:
    if file.endswith(".xlsx"):
        cgspace_results_file = os.path.join(cgspace_data_dir, file)
        df_temp = pd.read_excel(cgspace_results_file, engine="openpyxl")
        cgspace_comp_data = pd.concat([cgspace_comp_data, df_temp], ignore_index=True)

Applying the left join on `prms_results` with `cgspace_comp_data` on item handles. Left join is used because all the links in the PRMS are NOT handel links, some are SharePoint links and should be considered

In [36]:
prms_final_data = pd.merge(prms_results, cgspace_comp_data, how='left', left_on='valid_link', right_on='dc.identifier.uri')

In [37]:
len(prms_final_data)

31295

In [38]:
prms_final_data

Unnamed: 0,description_x,is_active,gender_tag_level_id,version_id,status,title,legacy_id,krs_url,is_krs,climate_change_score,...,cg.subject.humidtropics,cg.subject.drylands,cg.subject.icarda,dc.description.abstract,dcterms.isVersionOf,dcterms.references,dc.contributor.advisor,dc.contributor.other,dcterms.isFormatOf,dcterms.isReplacedBy
0,,1,,5,0,Encourage inter-ministrial discussions for cli...,,,,,...,,,,,,,,,,
1,,1,,5,0,Cambodia rice contract farming based on climat...,,,,,...,,,,,,,,,,
2,,1,,5,0,Program to (Advocacy) support GDA to implement...,,,,,...,,,,,,,,,,
3,,1,,5,0,Gender and youth sensitive capacity building p...,,,,,...,,,,,,,,,,
4,,1,,5,0,Extend trainer training programs across provin...,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31290,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,,,,,,,,
31291,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,,,,,,,,
31292,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,,,,,,,,
31293,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,,,,,,,,
