This code will merge the data (metadata/attachments) from CGSpace with the PRMS results. At the end, we will have an output file containing all the PRMS results along with the text of the evidences attached with each results.

The evidence links mostly correspond to CGSpace Items, while some refer to Sharepoint articles. We will only consider the evidences from two sources (CGSpace, Sharepoint)

In [1]:
import os
import pandas as pd

### Reading the PRMS Results

In [2]:
prms_results_dir = "prms_results"
prms_results_file = os.path.join(prms_results_dir, "Result_AI.xlsx")
prms_comp_results = pd.read_excel(prms_results_file, sheet_name='fact_results', engine="openpyxl")
prms_evidences = pd.read_excel(prms_results_file, sheet_name='result_main_evidences', engine="openpyxl")

In [3]:
len(prms_comp_results), len(prms_evidences)

(19425, 28743)

Filtering out all the `main` evidences having valid links

In [4]:
prms_evidences = prms_evidences[(prms_evidences['evidence_type']=='main') & (prms_evidences['valid_link'].notna())].reset_index(drop=True)
prms_evidences

Unnamed: 0,id,description,gender_related,link,youth_related,is_supplementary,result_id,knowledge_product_related,Index,is_valid_link,evidence_type,valid_link
0,1,,,https://hdl.handle.net/10568/125437,,0.0,4,,1,1,main,https://hdl.handle.net/10568/125437
1,29654,,,https://hdl.handle.net/10568/151784,,0.0,13282,,1,1,main,https://hdl.handle.net/10568/151784
2,22477,,,https://hdl.handle.net/10568/135090,,0.0,13337,,1,1,main,https://hdl.handle.net/10568/135090
3,22475,,,https://hdl.handle.net/10568/135209,,0.0,13335,,1,1,main,https://hdl.handle.net/10568/135209
4,22398,,,https://hdl.handle.net/10568/135271,,0.0,13314,,1,1,main,https://hdl.handle.net/10568/135271
...,...,...,...,...,...,...,...,...,...,...,...,...
14641,4445,Related 2022 Journal Article Publication - Mod...,0.0,https://hdl.handle.net/10568/127913,0.0,0.0,874,3294.0,1,1,main,https://hdl.handle.net/10568/127913
14642,10795,The whole document (2-page brief) is about the...,0.0,https://hdl.handle.net/10568/137383,0.0,0.0,7724,7695.0,1,1,main,https://hdl.handle.net/10568/137383
14643,17107,This thesis focuses on understanding everyday ...,0.0,https://hdl.handle.net/10568/131816,0.0,0.0,6982,7033.0,1,1,main,https://hdl.handle.net/10568/131816
14644,11946,The aim of the video is to promote healthy eat...,0.0,https://hdl.handle.net/10568/131464,0.0,0.0,7180,6643.0,1,1,main,https://hdl.handle.net/10568/131464


Mapping the complete results with their respective evidence. One result ID can have multiple evidences.

Left join is applied so that we can get only those results for which we have `main` evidence and `valid links` available 

In [5]:
prms_results = pd.merge(prms_comp_results, prms_evidences, how='left', on='result_id', suffixes=('_comp', '_evidences')).reset_index(drop=True)
prms_results

Unnamed: 0,description_comp,is_active,gender_tag_level_id,version_id,status,title,legacy_id,krs_url,is_krs,climate_change_score,...,description_evidences,gender_related,link,youth_related,is_supplementary,knowledge_product_related,Index,is_valid_link,evidence_type,valid_link
0,,1,,5,0,Encourage inter-ministrial discussions for cli...,,,,,...,,,,,,,,,,
1,,1,,5,0,Cambodia rice contract farming based on climat...,,,,,...,,,,,,,,,,
2,,1,,5,0,Program to (Advocacy) support GDA to implement...,,,,,...,,,,,,,,,,
3,,1,,5,0,Gender and youth sensitive capacity building p...,,,,,...,,,,,,,,,,
4,,1,,5,0,Extend trainer training programs across provin...,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21376,"Global food, fuel, and fertilizer prices have ...",1,1.0,1,1,Policy responses to the impacts of the Ukraine...,,https://www.cgiar.org/initiative-result/the-co...,1.0,1.0,...,https://hdl.handle.net/10568/127824,,https://hdl.handle.net/10568/127611,,0.0,2222.0,2.0,1.0,main,https://hdl.handle.net/10568/127611
21377,"Global food, fuel, and fertilizer prices have ...",1,1.0,1,1,Policy responses to the impacts of the Ukraine...,,https://www.cgiar.org/initiative-result/the-co...,1.0,1.0,...,https://hdl.handle.net/10568/125312_x000D_\nht...,,https://hdl.handle.net/10568/125314,,0.0,500.0,1.0,1.0,main,https://hdl.handle.net/10568/125314
21378,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,,,https://hdl.handle.net/10568/126467,,0.0,,2.0,1.0,main,https://hdl.handle.net/10568/126467
21379,The strengthening and scaling-up of Local Tech...,1,1.0,1,1,Implementation of climate services at national...,,https://www.cgiar.org/initiative-result/implem...,1.0,3.0,...,"Entire document, see p. 14",0.0,https://hdl.handle.net/10568/126473,1.0,0.0,1612.0,1.0,1.0,main,https://hdl.handle.net/10568/126473


In [6]:
prms_results['result_id'].nunique(), len(prms_results)

(19425, 21381)

### Reading the CGSpace Results

In [7]:
cgspace_data_dir = "cgspace_data"
cgspace_data_list = os.listdir(cgspace_data_dir)
cgspace_comp_data = pd.DataFrame()
for file in cgspace_data_list:
    if file.endswith(".xlsx"):
        cgspace_results_file = os.path.join(cgspace_data_dir, file)
        df_temp = pd.read_excel(cgspace_results_file, engine="openpyxl")
        cgspace_comp_data = pd.concat([cgspace_comp_data, df_temp], ignore_index=True)

Applying the left join on `prms_results` with `cgspace_comp_data` on item handles. Left join is used because all the links in the PRMS are NOT CGSPACE handle links, some are SharePoint links. These Sharepoint or other links which are not present in the CGSPace will have that data empty

In [8]:
prms_final_data = pd.merge(prms_results, cgspace_comp_data, how='left', left_on='valid_link', right_on='dc.identifier.uri', suffixes=('_prms', '_cgspace'))

In [9]:
prms_final_data['result_id'].nunique(), len(prms_final_data)

(19425, 31295)

Storing the combined data of PRMS and CGSpace

In [16]:
merged_file_dir = "prms_cgspace_merged"
merged_file_name = "prms_cgspace_merged_data.xlsx"
os.makedirs(merged_file_dir, exist_ok=True)
merged_file_path = os.path.join(merged_file_dir, merged_file_name)
prms_final_data.to_excel(merged_file_path, index=False, engine="openpyxl")

Now analyzing some points from the combined data to verify the data

In [17]:
prms_final_data['Type'].value_counts()

Type
Knowledge product                   16473
Innovation development               5490
Capacity sharing for development     3240
Other output                         2843
Innovation use                        938
Policy change                         825
Other outcome                         757
Complementary innovation              557
Innovation Package                    134
Capacity change                        35
Impact contribution                     3
Name: count, dtype: int64

In [18]:
prms_final_data['phase_year'].value_counts()

phase_year
2024    15516
2023    10512
2022     5267
Name: count, dtype: int64

In [19]:
prms_final_data['dcterms.issued'].value_counts()

dcterms.issued
2024-12       1166
2023           988
2024           755
2024-12-30     740
2023-12        547
              ... 
2022-01-13       1
2023-04-22       1
2023-03-11       1
2024-12-07       1
2022-08-05       1
Name: count, Length: 1053, dtype: int64

In [20]:
prms_final_data['dcterms.issued_year'] = prms_final_data['dcterms.issued'].apply(lambda x: str(x).split("-")[0] if pd.notna(x) else x).astype('Int64')

In [21]:
prms_final_data[(prms_final_data['phase_year'] != prms_final_data['dcterms.issued_year']) & (prms_final_data['dcterms.issued_year'].notna()) & (prms_final_data['dcterms.issued_year']<2020)][['result_id','phase_year','dcterms.issued_year','valid_link','dc.identifier.uri']].drop_duplicates().iloc[:20,:]

Unnamed: 0,result_id,phase_year,dcterms.issued_year,valid_link,dc.identifier.uri
5777,5209,2023,2019,https://hdl.handle.net/10568/105482,https://hdl.handle.net/10568/105482
6667,13201,2024,2015,https://hdl.handle.net/10568/71211,https://hdl.handle.net/10568/71211
6685,12814,2024,2017,https://hdl.handle.net/10568/82813,https://hdl.handle.net/10568/82813
6879,12816,2024,2017,https://hdl.handle.net/10568/82984,https://hdl.handle.net/10568/82984
6880,12816,2024,2017,https://hdl.handle.net/10568/89800,https://hdl.handle.net/10568/89800
6895,12825,2024,2018,https://hdl.handle.net/10568/103725,https://hdl.handle.net/10568/103725
6937,4853,2023,2017,https://hdl.handle.net/10568/82813,https://hdl.handle.net/10568/82813
6944,4852,2023,2019,https://hdl.handle.net/10568/79426,https://hdl.handle.net/10568/79426
6973,4855,2023,2017,https://hdl.handle.net/10568/82984,https://hdl.handle.net/10568/82984
6974,4855,2023,2017,https://hdl.handle.net/10568/89800,https://hdl.handle.net/10568/89800
