![Logo](../visualisations/logos/banner_colibri.png)

---
# 🏝️ Welcome to **colibri**'s playground!

Here, you will be able to test all the features of the package. Adapt parameters and run the cells corresponding to the function you want to test to see the result. Make sure you activate the conda environment `colibri` and the Jupyter kernel before starting. A quick description of the functionalities is provided. More technical information is available in Docstrings of each function.

---
<br/>
<br/>

In [None]:
# Import colibri

import sys

sys.path.append("..")
import src


<br/>

**0. Set up your umbrella review** <br/>
Choose the scientific field you want to sythesise by setting up a search query. Select the platforms from which you want to get publications from.<br/>
*Nota bene:*
- *only scientific field studying Soil Organic Carbon is currently available*
- *only Web of Science platform is currently available*

In [None]:
search_query = "ts = (('meta*analysis' or 'systematic review') and ('soil organic carbon' or 'SOC' or 'soil organic matter' or 'SOM' or 'soil carbon'))"
platforms = ["WoS"]


<br/>

**1. Run the entire pipeline** <br/>
From scrapping publications to characterising their content. More details in README.md file.

In [None]:
src.wrapper.run_pipeline(search_query, platforms)

<br/>

**2. Scrape, merge and clean publications** <br/>
Scrape publications on various platforms with a specific search query. DOIs, titles, abstracts and keywords (when available) of each publication will be stored in Pickle files in directory `colibri/data/<yyyy>-<mm>-<dd>_<hh>-<mm>-<ss>` with the corresponding timestamp of your scrapping.

In [None]:
src.scrapper.merger_cleaner(src.scrapper.scrape(search_query, platforms))

<br/>

**3. Plot scrapping statistics** <br/>
Update the visualisation of the number of papers scrapped over time given a specific search query and platforms. The graph is stored in `colibri/visualisations/scrapping_over_time.png` file.

In [None]:
src.scrapper.scrapping_over_time()

<br/>

**4. Fine-tuning of the classification model** <br/>
Fine-tune the DistilBERT model and save the weights in `colibri/data/distilbert_runs/<yyyy>-<mm>-<dd>_<hh>-<mm>-<ss>/fine_tuned_model.pt` file.

In [None]:
config = {
    "epochs": 50,
    "batch_size": 32,
    "learning_rate": 1e-3,
    "dropout": 0.3,
    "padding_length": 100,
    "testset_size": 0.2,
    "distilbert_trainset_path": "/home/er/Documents/Cirad/colibri/data/distilbert_trainset/trainset.pkl",
}

src.filter.train_distilbert(config)

<br/>

**5. Download pulications PDFs from DOIs** <br/>
From a list of DOIs, download PDFs corresponding to publications, then stored in `colibri/data/pub_pdf` directory and mapped in `colibri/data/pub_pdf/pdf_mapping.pkl` file.

In [None]:
doi_list = [
    "10.1016/j.earscirev.2022.104214",
    "10.1016/j.jaridenv.2017.02.001",
    "10.1016/j.soilbio.2018.06.014",
    "10.1186/s13750-021-00221-3",
    "10.1016/j.catena.2021.105227",
    "10.1111/ejss.12492",
    "10.1111/gcb.15489",
    "10.1016/j.catena.2023.107409",
    "10.5194/soil-7-785-2021",
    "10.1111/nph.18458",
]

src.characteriser.get_pdf(doi_list)