In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from daemon_analysis_tools.file_handling import load_and_process_csv
from daemon_analysis_tools.data.utils import group_questions_by_journal
from daemon_analysis_tools.data.publisher import Publisher
from daemon_analysis_tools.file_handling import (
    save_answers_to_yaml,
    load_answers_from_yaml,
)

Load and process data:
- Group answers by publisher and journal, trying to uniform names written in slightly different ways.
- Store in a DataFrame

In [3]:
data = load_and_process_csv("../../data/raw/rdp.csv")

Get a `dict` labeled by publisher names of `dict`s labeled by journal names of `dict`s of `Question` instances. The `.answer` attribute contains the answers given by the respondents and the explanations text to motivate it.

In [4]:
grouped_questions = group_questions_by_journal(data)

Visualize the questions and answers given by the encoders (e.g. for ACS Accounts of Chemical Research)

In [19]:
for q in grouped_questions["ACS"]["accounts_of_chemical_research"]:
    print(q)
    for a in grouped_questions["ACS"]["accounts_of_chemical_research"][q].answers:
        print(a.text)
    print()

1
Research Data Policy (RDP) exists.
Research Data Policy (RDP) exists.

3
Data sharing encouraged but optional.
Data sharing required but not publicly (e.g. available upon request is allowed).

4
Public data sharing on a FAIR repository required only for specific types of data (e.g. genetic data has to be shared on a FAIR repository but no other data).
Public data sharing on a FAIR repository encouraged.

2
Mentioned in the RDP but optional.
Mentioned in the RDP but optional.

5
DOIs or other persistent identifiers recommended for datasets or codes.
DOIs or other persistent identifiers recommended for datasets or codes.

7
Required data must be available after an embargo period.
Required data must be available prior to official publication.

8
Public online repositories recommended in RDP.
Data sharing in supplementary material or hosting by journal recommended in RDP.

9
Explicit mention of a certain license/license type in RDP.
Explicit mention of a certain license/license type in R

Questions and answer can be saved to `yaml` files.

In [20]:
save_answers_to_yaml(grouped_questions, parent_folder='../../data/processed/all_answers', save_only = ['IOP'])

ACS/accounts_of_chemical_research.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/accounts_of_materials_research.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/acs_applied_materials_and_interfaces.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/acs_energy_letters.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/acs_es_and_t_engineering.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/acs_omega.yaml already exists. No data was written to prevent overwriting files modified by users. Manually delete these files if necessary.
ACS/applied_materials_and_interfaces.y

## Resolve discrepancies

The `Question` class has a `.resolve_discrepancies` method which updates `Question.anwsers` with the correct answer.

For example, let's consider IOP's 2D Materials. Question 7 has discrepancies.

In [21]:
grouped_questions["IOP"]["2d_materials"][7].has_discrepancies()

True

These are the answers given by the two encoders:

In [22]:
grouped_questions["IOP"]["2d_materials"][7].print_qa()

7. Timing of data release
  Resp. 0:
    Answer: Required data must be available prior to official publication.
    Explanation: Authors must specify the reason why they are unable to make their research data publicly available at the point of publication and this reason will be included in the published article.
  Resp. 1:
    Answer: Required data must be available after official publication.
    Explanation: "Authors must agree to make any data required to support or replicate claims made in an article available privately to the journal’s editors, reviewers and readers without undue restriction or delay if requested."


Inconsistencies can be removed manually, passing the index of the correct respondent and the reason for the discrepancy. The latter should be one of:
- Text missing
- Language understanding
- Difficulty in matching information and question
- Other: free text

In [23]:
grouped_questions["IOP"]["2d_materials"][7].resolve_discrepancy(
    correct_answer=0, discrepancy_reason="Language understanding"
)

After doing this, the `.get_final_answer()` method returns the correct answer.

In [24]:
grouped_questions["IOP"]["2d_materials"][7].get_final_answer()

Answer(text=Required data must be available prior to official publication., explanation=Authors must specify the reason why they are unable to make their research data publicly available at the point of publication and this reason will be included in the published article.)

Alternatively, inconsistencies can be solved with small additions to the `yaml` files written before.

Each file, one per journal, has this form:

```
...
...
3:
  text: 3. Data sharing requirements in RDP
  has_discrepancies: true
  0:
    text: Data sharing encouraged but optional.
    explanation: All ACS journals strongly encourage authors to make the research
      data underlying their articles publicly available at the time of publication.
  1:
    text: Data sharing required but not publicly (e.g. available upon request is allowed).
    explanation: "Text from Journal Research Data Policy : \n\" All ACS journals strongly\
      \ encourage authors to make the research data underlying their articles publicly\
      \ available at the time of publication. \""
  correct_answer: null
...
...
```

Only questions where `has_discrepancies` is `true` should be taken care of. 

The correct answer can be chosen writing the encoder id in `correct_answer`. 

For example, in this case it seems that the first answer is correct, as the RDP only encourages (albeit strongly) data publication.

Therefore, one can write

```
  correct_answer: 0
```

and save the file. Once all questions with discrepancies are fixed, one can save the `yaml` file and proceed with the next journal.

For the sake of illustrating the next functions, let's assume the correct answer is always the first, and manually "fix" all the q&a.

In [37]:
from copy import deepcopy

fixed_grouped_questions = deepcopy(grouped_questions)
for p in fixed_grouped_questions:
    for j in fixed_grouped_questions[p]:
        for q in fixed_grouped_questions[p][j]:
            fixed_grouped_questions[p][j][q].resolve_discrepancy(
                correct_answer=0,
                discrepancy_reason="text missing",
            )

## `Publisher` and `Journal` classes

In case all the q&a files are fixed, load the `yaml` files and define a new, fixed, `grouped_questions` dictionary, uncommenting the next cell.

Otherwise, keep the current `grouped_questions` dictionary as given by the preceding cell.

In [None]:
# fixed_grouped_questions = load_answers_from_yaml(parent_folder='../../data/processed/all_answers')

Define a dictionary of `Publisher` instances containing `Journal` instances with the information taken from `grouped_questions`.

In [None]:
publishers = {}

for publisher_name, questions in grouped_questions.items():
    publishers[publisher_name] = Publisher.from_questions(publisher_name, questions)

For example, the ACS publisher instance is

In [None]:
publishers["ACS"]

and the journals it contains are

In [None]:
publishers["ACS"].list_journals()

and one of the jornals, e.g. JACS, is

In [None]:
publishers["ACS"].get_journal("jacs")