![Logo](../visualisations/logos/banner_colibri.png)

---
# 🏝️ Welcome to **colibri**'s playground!

Here, you will be able to test all the features of the package. Adapt parameters and run the cells corresponding to the function you want to test to see the result. Make sure you activate the conda environment `colibri` and the Jupyter kernel before starting. A quick description of the functionalities is provided. More technical information is available in Docstrings of each function.

---
<br/>
<br/>

In [None]:
# Import colibri

import sys

sys.path.append("..")
import src


<br/>

**0. Set up your umbrella review** <br/>
Choose the scientific field you want to sythesise by setting up a search query. Select the platforms from which you want to get publications from.<br/>
*Nota bene:*
- *only scientific field studying Soil Organic Carbon is currently available*
- *only Web of Science platform is currently available*

In [None]:
search_query = "ts = (('meta*analysis' or 'systematic review') and ('soil organic carbon' or 'SOC' or 'soil organic matter' or 'SOM' or 'soil carbon'))"
platforms = ["WoS"]


<br/>

**1. Run the entire pipeline** <br/>
From scrapping publications to characterising their content. More details in README.md file.

In [None]:
src.wrapper.run_pipeline(search_query, platforms)

<br/>

**2. Scrape, merge and clean publications** <br/>
Scrape publications on various platforms with a specific search query. DOIs, titles, abstracts and keywords (when available) of each publication will be stored in Pickle files in directory `colibri/data/<yyyy>-<mm>-<dd>_<hh>-<mm>-<ss>` with the corresponding timestamp of your scrapping.

In [None]:
src.scrapper.merger_cleaner(src.scrapper.scrape(search_query, platforms))

<br/>

**3. Plot scrapping statistics** <br/>
Update the visualisation of the number of papers scrapped over time given a specific search query and platforms. The graph is stored in `colibri/visualisations/scrapping_over_time.png` file.

In [None]:
src.scrapper.scrapping_over_time()

<br/>

**4. Fine-tuning of the classification model** <br/>
Fine-tune the DistilBERT model and save the weights in `colibri/data/distilbert_runs/<yyyy>-<mm>-<dd>_<hh>-<mm>-<ss>/fine_tuned_model.pt` file.

In [None]:
config = {
    "epochs": 50,
    "batch_size": 32,
    "learning_rate": 1e-3,
    "dropout": 0.3,
    "padding_length": 100,
    "testset_size": 0.2,
    "distilbert_trainset_path": "/home/er/Documents/Cirad/colibri/data/distilbert_trainset/trainset.pkl",
}

src.filter.train_distilbert(config)

<br/>

**5. Download pulications PDFs from DOIs** <br/>
From a list of DOIs, download PDFs corresponding to publications, then stored in `colibri/data/pub_pdf` directory and mapped in `colibri/data/pub_pdf/pdf_mapping.pkl` file.

In [None]:
doi_list = [
    "10.1016/j.earscirev.2022.104214",
    "10.1016/j.jaridenv.2017.02.001",
    "10.1016/j.soilbio.2018.06.014",
    "10.1186/s13750-021-00221-3",
    "10.1016/j.catena.2021.105227",
    "10.1111/ejss.12492",
    "10.1111/gcb.15489",
    "10.1016/j.catena.2023.107409",
    "10.5194/soil-7-785-2021",
    "10.1111/nph.18458",
]

src.characteriser.get_pdf(doi_list)

<br/>

**6. Pandas Dataframe to JSON file** <br/>
Convert a Dataframe with specific columns into the final JSON output database (cf. `data/template_output_database`)

In [None]:
import json
import pandas as pd

# Example of data
data = {
    "DOI": ["10.1111/gcbb.12234", "10.1890/10-0660.1", "10.1016/j.still.2020.104575"],
    "Title": [
        "Emission of CO2 from biochar-amended soils and implications for soil organic carbon",
        "Fire effects on temperate forest soil C and N storage",
        "A calculator to quantify cover crop effects on soil health and productivity",
    ],
    "Abstract": ["Abstract 1", "Abstract 2", "Abstract 3"],
    "Keywords": [
        [
            "additive effects",
            "carbon sequestration",
            "decomposition",
            "priming",
            "pyrogenic organic matter",
            "recalcitrance",
        ],
        [
            "carbon sinks",
            "fire",
            "forest management",
            "meta-analysis",
            "soil carbon",
            "soil",
            "nitrogen",
            "temperate forests",
        ],
        ["conservation agriculture", "soil quality", "meta-analysis"],
    ],
    "Authors": [
        ["Sagrilo, E", "Jeffery, S", "Hoffland, E", "Kuyper, TW"],
        ["Nave, LE", "Vance, ED", "Swanston, CW", "Curtis, PS"],
        ["Jian, JS", "Lester, BJ", "Du, X", "Reiter, MS", "Stewart, RD"],
    ],
    "Publication year": [2015, 2011, 2020],
    "Journal": [
        "Global change biology bioenergy",
        "Ecological applications",
        "Soil & tillage research",
    ],
    "Platforms origin": [["WoS"], ["WoS"], ["WoS"]],
    "Population": ["Cropland", "Forest land", "Cropland"],
    "Intervention": ["Management", "Management", "Management"],
    "Sub-intervention": ["Amendments pyrogenic", "Fire", "Cover crops"],
    "Sub-sub-intervention": ["Biochar", "N/A", "Annual crops"],
    "Control": ["No amendments pyrogenic", "No fire", "No cropland"],
    "Outcome": ["Bulk soil", "Bulk soil", "Soil fractions"],
    "Sub-outcome": ["SOC concentration", "SOC stock", "Dissolved organic carbon"],
    "Measure": [
        {"Lower CI": 0.9329, "Mean": 1.0335, "Upper CI": 1.1432},
        {"Lower CI": -58.05, "Mean": -46.15, "Upper CI": -34.05},
        {
            "Lower CI": -82.0462850182704,
            "Mean": -75.3227771010962,
            "Upper CI": -66.2606577344701,
        },
    ],
    "Paired data": [179, 38, 83],
    "PS DOIs": [
        [
            "10.14454/qn00-qx85",
            "10.14454/3w3z-sa82",
            "10.14454/3bpw-w381",
            "10.1590/s0102-69922012000200010",
            "...",
        ],
        [
            "10.1111/j.1468-1293.2012.01029_12.x",
            "10.1016/j.apcatb.2012.06.004",
            "10.1016/j.powtec.2011.12.057",
            "...",
        ],
        "N/A",
    ],
}

df = pd.DataFrame(data)
output_file_path = "path/to/your/file.json"

src.characteriser.df2json(df, output_file_path)

with open(output_file_path, "r") as json_file:
    json_data = json.load(json_file)
print(json.dumps(json_data, indent=4))