In [105]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [106]:
from daemon_analysis_tools.file_handling import load_and_process_csv
from daemon_analysis_tools.data.utils import group_questions_by_journal
from daemon_analysis_tools.data.publisher import Publisher
from daemon_analysis_tools.file_handling import save_answers_to_yaml, load_answers_from_yaml

Load and process data:
- Group answers by publisher and journal, trying to uniform names written in slightly different ways.
- Store in a DataFrame

In [107]:
data = load_and_process_csv('../../data/raw/rdp.csv')

Get a `dict` labeled by publisher names of `dict`s labeled by journal names of `dict`s of `Question` instances. The `.answer` attribute contains the answers given by the respondents and the explanations text to motivate it.

In [108]:
grouped_questions = group_questions_by_journal(data)

Visualize the questions and answers given by the encoders (e.g. for ACS Accounts of Chemical Research)

In [109]:
for q in grouped_questions['ACS']['accounts_of_chemical_research']:
    print(q)
    for a in grouped_questions['ACS']['accounts_of_chemical_research'][q].answers:
        print(a.text)
    print()

1
 https://publish.acs.org/publish/data_policy
Research Data Policy (RDP) exists.

3
Research Data Policy
All ACS journals strongly encourage authors to make the research data underlying their articles publicly available at the time of publication.

Research data is defined as materials and information used in the experiments that enable the validation of the conclusions drawn in the article, including primary data produced by the authors for the study being reported, secondary data reused or analyzed by the authors for the study, and any other materials necessary to reproduce or replicate the results.

The ACS Research Data Policy provides additional information on Data Availability Statements, Data Citation, and Data Repositories.
Data sharing required but not publicly (e.g. available upon request is allowed).

4
All ACS journals strongly encourage authors to make the research data underlying their articles publicly available at the time of publication.
Public data sharing on a FAIR 

Questions and answer can be saved to `yaml` files.

In [110]:
save_answers_to_yaml(grouped_questions, parent_folder='../../data/processed/all_answers', save_only = ['APS', 'Taylor & Francis'])

APS
Taylor & Francis


## Resolve discrepancies

The `Question` class has a `.resolve_discrepancies` method which updates `Question.anwsers` with the correct answer.

For example, let's consider IOP's 2D Materials. Question 7 has discrepancies.

In [111]:
for journal, data in grouped_questions['Taylor & Francis'].items():
    print('#############################################################')
    print(journal)
    for question, answer in data.items():
        if answer.has_discrepancies():
            answer.print_qa()

#############################################################
advanced_composite_materials
7. Timing of data release
  Resp. 0:
    Answer: Required data must be available prior to official publication.
    Explanation: At the point of submission, you will be asked if there is a data set associated with the paper. If you reply yes, you will be asked to provide the DOI, pre-registered DOI, hyperlink, or other persistent identifier associated with the data set(s).
  Resp. 1:
    Answer: Required data must be available prior to review process.
    Explanation: At the point of submission, you will be asked if there is a data set associated with the paper. If you reply yes, you will be asked to provide the DOI, pre-registered DOI, hyperlink, or other persistent identifier associated with the data set(s). If you have selected to provide a pre-registered DOI, please be prepared to share the reviewer URL associated with your data deposit, upon request by reviewers.
############################

These are the answers given by the two encoders:

Inconsistencies can be removed manually, passing the index of the correct respondent.

In [113]:
grouped_questions['Taylor & Francis']['advanced_composite_materials'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['analytical_letters'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['green_chemistry_letters_and_reviews'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['journal_of_macromolecular_science_part_b'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['materials_research_letters'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['molecular_physics'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['waves_in_random_and_complex_media'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['science_and_technology_of_advanced_materials'][7].resolve_discrepancy(correct_answer = 1)
grouped_questions['Taylor & Francis']['polycyclic_aromatic_compounds'][7].resolve_discrepancy(correct_answer = 1)

for i in range(1, 14):
    if not i == 6:
        grouped_questions['Taylor & Francis']['ferroelectrics'][i].resolve_discrepancy(correct_answer = 0)

grouped_questions['Taylor & Francis']['ferroelectrics'][i].resolve_discrepancy(correct_answer = 0)



In [115]:

save_answers_to_yaml(grouped_questions, parent_folder='../../data/processed/all_answers', save_only = ['APS', 'Taylor & Francis'])


APS
Taylor & Francis


After doing this, the `.get_final_answer()` method returns the correct answer.

In [116]:
grouped_questions['Taylor & Francis']['polycyclic_aromatic_compounds'][7].get_final_answer()

Answer(text=Required data must be available prior to review process., explanation=At the point of submission, you will be asked if there is a data set associated with the paper. If you reply yes, you will be asked to provide the DOI, pre-registered DOI, hyperlink, or other persistent identifier associated with the data set(s). If you have selected to provide a pre-registered DOI, please be prepared to share the reviewer URL associated with your data deposit, upon request by reviewers.)

Alternatively, inconsistencies can be solved with small additions to the `yaml` files written before.

Each file, one per journal, has this form:

```
...
...
3:
  text: 3. Data sharing requirements in RDP
  has_discrepancies: true
  0:
    text: Data sharing encouraged but optional.
    explanation: All ACS journals strongly encourage authors to make the research
      data underlying their articles publicly available at the time of publication.
  1:
    text: Data sharing required but not publicly (e.g. available upon request is allowed).
    explanation: "Text from Journal Research Data Policy : \n\" All ACS journals strongly\
      \ encourage authors to make the research data underlying their articles publicly\
      \ available at the time of publication. \""
  correct_answer: null
...
...
```

Only questions where `has_discrepancies` is `true` should be taken care of. 

The correct answer can be chosen writing the encoder id in `correct_answer`. 

For example, in this case it seems that the first answer is correct, as the RDP only encourages (albeit strongly) data publication.

Therefore, one can write

```
  correct_answer: 0
```

and save the file. Once all questions with discrepancies are fixed, one can save the `yaml` file and proceed with the next journal.

For the sake of illustrating the next functions, let's assume the correct answer is always the first, and manually "fix" all the q&a.

In [102]:
from copy import deepcopy
fixed_grouped_questions = deepcopy(grouped_questions)
for p in fixed_grouped_questions:
    for j in fixed_grouped_questions[p]:
        for q in fixed_grouped_questions[p][j]:
            fixed_grouped_questions[p][j][q].resolve_discrepancy(correct_answer = 0)

## `Publisher` and `Journal` classes

In case all the q&a files are fixed, load the `yaml` files and define a new, fixed, `grouped_questions` dictionary, uncommenting the next cell.

Otherwise, keep the current `grouped_questions` dictionary as given by the preceding cell.

In [103]:
# fixed_grouped_questions = load_answers_from_yaml(parent_folder='../../data/processed/all_answers')

Define a dictionary of `Publisher` instances containing `Journal` instances with the information taken from `grouped_questions`.

In [104]:
publishers = {}

for publisher_name, questions in grouped_questions.items():
    publishers[publisher_name] = Publisher.from_questions(publisher_name, questions)

ValueError: Correct answer is unknown. Resolve discrepancies, if present.

For example, the ACS publisher instance is

In [None]:
publishers['ACS']

and the journals it contains are

In [None]:
publishers['ACS'].list_journals()

and one of the jornals, e.g. JACS, is

In [None]:
publishers['ACS'].get_journal('jacs')