In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from daemon_analysis_tools.file_handling import load_and_process_csv
from daemon_analysis_tools.data.utils import group_questions_by_journal
from daemon_analysis_tools.data.publisher import Publisher
from daemon_analysis_tools.file_handling import (
    save_answers_to_yaml,
    load_answers_from_yaml,
)

Load and process data:
- Group answers by publisher and journal, trying to uniform names written in slightly different ways.
- Store in a DataFrame

In [3]:
data = load_and_process_csv("../../data/raw/rdp.csv")

Get a `dict` labeled by publisher names of `dict`s labeled by journal names of `dict`s of `Question` instances. The `.answer` attribute contains the answers given by the respondents and the explanations text to motivate it.

In [4]:
grouped_questions = group_questions_by_journal(data)

## Resolve discrepancies

The `Question` class has a `.resolve_discrepancies` method which updates `Question.anwsers` with the correct answer.

For example, let's consider IOP's 2D Materials. Question 7 has discrepancies.

In [5]:
for journal, data in grouped_questions["Bentham Science"].items():
    print("#############################################################")
    print(journal)
    for question, answer in data.items():
        if answer.has_discrepancies():
            answer.print_qa()

#############################################################
current_nanoscience
3. Data sharing requirements in RDP
  Resp. 0:
    Answer: Data sharing encouraged but optional.
    Explanation: Bentham Science encourages authors to share the source of data and materials in the manuscript, in support of the findings.
  Resp. 1:
    Answer: Public data sharing of all data required.
    Explanation: All datasets on which the paper's conclusions are based must be made accessible to reviewers and readers, according to the journal's rules. Prior to peer review, authors must either deposit their datasets in publicly accessible repositories or provide them as supplementary materials with their submission.
5. Citability and findability of data 
  Resp. 0:
    Answer: DOIs or other persistent identifiers required for datasets or codes.
    Explanation: Whether the data was developed by the author(s) or researcher(s), all publicly available data referenced in the preparation of an article shoul

Inconsistencies can be removed manually, passing the index of the correct respondent.

In [6]:
for journal, data in grouped_questions["Bentham Science"].items():
    print("#############################################################")
    print(journal)
    grouped_questions["Bentham Science"][journal][3].resolve_discrepancy(
        correct_answer=0,
        discrepancy_reason="Selected answer is better interpretation of RDP.",
    )
    grouped_questions["Bentham Science"][journal][5].resolve_discrepancy(
        correct_answer=1,
        discrepancy_reason="Selected answer is better interpretation of RDP.",
    )
    grouped_questions["Bentham Science"][journal][8].resolve_discrepancy(
        correct_answer=0,
        discrepancy_reason="Selected answer is better interpretation of RDP.",
    )
    grouped_questions["Bentham Science"][journal][10].resolve_discrepancy(
        correct_answer=0, discrepancy_reason="Missed text."
    )
    grouped_questions["Bentham Science"][journal][11].resolve_discrepancy(
        correct_answer=1, discrepancy_reason="Same answer but different wording."
    )
    grouped_questions["Bentham Science"][journal][12].resolve_discrepancy(
        correct_answer=1, discrepancy_reason="Same answer but different wording."
    )
    grouped_questions["Bentham Science"][journal][16].resolve_discrepancy(
        correct_answer=1, discrepancy_reason="Same answer but different wording."
    )

#############################################################
current_nanoscience
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Missed text.
Same answer but different wording.
Same answer but different wording.
Same answer but different wording.
#############################################################
current_organic_chemistry
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Missed text.
Same answer but different wording.
Same answer but different wording.
Same answer but different wording.
#############################################################
current_organic_synthesis
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Selected answer is better interpretation of RDP.
Missed text.
Same answer but different wording.
Same answer but d

In [7]:
for journal, data in grouped_questions["Bentham Science"].items():
    print("#############################################################")
    print(journal)
    for question, answer in data.items():
        if answer.has_discrepancies() and answer.discrepancy_reason is None:
            answer.print_qa()

#############################################################
current_nanoscience
#############################################################
current_organic_chemistry
#############################################################
current_organic_synthesis
#############################################################
current_organocatalysis
#############################################################
letters_in_organic_chemistry
#############################################################
mini-reviews_in_organic_chemistry


In [8]:
save_answers_to_yaml(
    grouped_questions,
    parent_folder="../../data/processed/all_answers",
    save_only=["Bentham Science"],
)

After doing this, the `.get_final_answer()` method returns the correct answer.