# Kumu and reference processing

* To show references linked to adaptations in LCAT, we need to process data downloaded from the Kumu project, and link this with reference data scraped from the web (via DOI).
* References have been deposited in Google Sheets, and are available as .csv. Kumu data has been downloaded as .json.
* Outputs are a .json references file that can be added to the database, and a Kumu json file that can be shipped with the client.

Note that web scraping by DOI is fairly brittle, so this might break at any time.

## Initialise

In [None]:
import os
import yaml

# The cwd should be the data folder root
os.chdir("..")

In [2]:
config_filepath = "./config.yml"

with open(config_filepath) as f:
    conf = yaml.load(f, Loader=yaml.FullLoader)

## Process Kumu

In [None]:
from src.process_kumu import ProcessKumu

In [4]:
kumu_filepath = conf["kumu_json"]

In [5]:
kumu_processor = ProcessKumu(kumu_filepath)

In [6]:
kumu_processor.filter_data()

In [7]:
output_filepath = "./processed_kumu.json"
kumu_processor.save_json(output_filepath)

## Process references

* Some references are bad. The analysis code for these is rough, and wont be added here for a while.
* For now, a dict of bad references is provided.

In [None]:
from src.process_references import ProcessReferences

In [9]:
references_filepath = conf["references_csv"]

In [10]:
reference_processor = ProcessReferences(references_filepath)

In [11]:
# Some bad references that cause the DOI lookup to fail, etc etc
bad_references = {'Journal Article': [235, 252, 301, 385, 442, 465, 469, 504, 507, 583, 654, 675, 684, 695, 730, 772, 788, 847, 882, 902, 904, 905, 914, 919, 941, 960, 989, 1003, 1007, 1008, 1048, 1053, 1060, 1120, 1134, 1192, 1224, 1225, 1226, 1230, 1232, 1234, 1250, 1269, 1346, 1348, 1383, 1387, 1389, 1393, 1417, 1423, 1457, 1555, 1577], 'Book Section': [], 'Book': [212], 'Report': [332, 333, 334, 335, 338, 341, 342, 343, 344, 357, 358, 359, 360, 997, 998, 999, 1000, 1001, 1002, 1004, 1006, 336], 'Web Page': [337, 339, 340, 1005]}

In [12]:
reference_processor.load_bad_references(bad_references)

In [13]:
reference_processor.clean_references()

In [14]:
reference_processor.df.head()

Unnamed: 0,Reference_ID,Reference_Type,DOI,URL,Replacement_URL,Notes,Title for website/report/article,Unnamed: 7
0,1,Journal Article,10.1007/s11027-017-9778-4,https://doi.org/10.1007/s11027-017-9778-4,,,,
1,2,Book Section,10.1016/B978-0-12-849887-3.00004-6,https://www.google.co.uk/books/edition/Adaptin...,,,,
2,3,Journal Article,10.1016/j.cliser.2016.10.004,https://www.sciencedirect.com/science/article/...,,,,
3,4,Journal Article,10.1186/1476-069x-8-40,https://dx.doi.org/10.1186/1476-069x-8-40,,,,
4,5,Journal Article,10.1093/epirev/mxf007,https://dx.doi.org/10.1093/epirev/mxf007,,,,


In [15]:
reference_processor.perform_doi_lookups()

1582 references with DOIs found

DOI lookup: 0 / 1582
Scraping Journal Article: 10.1007/s11027-017-9778-4...
DOI lookup: 1 / 1582
Scraping Book Section: 10.1016/B978-0-12-849887-3.00004-6...
DOI lookup: 2 / 1582
Scraping Journal Article: 10.1016/j.cliser.2016.10.004...
DOI lookup: 3 / 1582
Scraping Journal Article: 10.1186/1476-069x-8-40...
DOI lookup: 4 / 1582
Scraping Journal Article: 10.1093/epirev/mxf007...
DOI lookup: 5 / 1582
Scraping Journal Article: 10.1289/ehp.1003198...
DOI lookup: 6 / 1582
Scraping Journal Article: 10.1016/j.ecss.2015.12.016...
DOI lookup: 7 / 1582
Scraping Journal Article: 10.1038/s41572-018-0005-8...
DOI lookup: 8 / 1582
Scraping Journal Article: 10.1289/ehp.0900683...
DOI lookup: 9 / 1582
Scraping Journal Article: 10.1016/j.wace.2013.07.004...
DOI lookup: 10 / 1582
Scraping Web Page: ...
Neither Journal Article nor Book found: no DOI lookup performed
DOI lookup: 11 / 1582
Scraping Journal Article: 10.1016/j.cliser.2016.07.001...
DOI lookup: 12 / 1582
Scra

In [17]:
reference_processor.process_references()

In [18]:
output_filepath = "./processed_references.json"
reference_processor.save_json(output_filepath)

## Conclusion

Once the Kumu `.json` export and the references `.csv` have been processed, the output files can be stored alongside the other data files, and added to the `config.yml`. 

For the processed files, the keys should be as follows:

* Processed Kumu output file: `processed_kumu_json`
* Processed references output file: `processed_references_json`

As mentioned, we ship the `processed_kumu_json` file with the front end, and used the `processed_references_json` to create a references table.