In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from daemon_analysis_tools.file_handling import load_and_process_csv
from daemon_analysis_tools.data.utils import group_questions_by_journal
from daemon_analysis_tools.data.publisher import Publisher
from daemon_analysis_tools.file_handling import save_answers_to_yaml, load_answers_from_yaml

Load and process data:
- Group answers by publisher and journal, trying to uniform names written in slightly different ways.
- Store in a DataFrame

In [None]:
data = load_and_process_csv('../../data/raw/rdp.csv')

Get a `dict` labeled by publisher names of `dict`s labeled by journal names of `dict`s of `Question` instances. The `.answer` attribute contains the answers given by the respondents and the explanations text to motivate it.

In [None]:
grouped_questions = group_questions_by_journal(data)

Visualize the questions and answers given by the encoders (e.g. for ACS Accounts of Chemical Research)

In [None]:
for q in grouped_questions['ACS']['accounts_of_chemical_research']:
    print(q)
    for a in grouped_questions['ACS']['accounts_of_chemical_research'][q].answers:
        print(a.text)
    print()

You can instantiate `Publisher` from a `dict` or `pd.DataFrame` defined from `group_questions_by_journal`

## Resolve discrepancies

The `Question` class has a `.resolve_discrepancies` method which updates `Question.anwsers` with the correct answer.

For example, let's consider IOP's 2D Materials. Question 7 has discrepancies.

In [None]:
grouped_questions['IOP']['2d_materials'][7].has_discrepancies()

These are the answers given by the two encoders:

In [None]:
grouped_questions['IOP']['2d_materials'][7].print_qa()

Inconsistencies can be removed manually, passing the index of the correct respondent.

In [None]:
grouped_questions['IOP']['2d_materials'][7].resolve_discrepancy(correct_answer = 0)

After doing this, the `.get_final_answer()` method returns the correct answer.

In [None]:
grouped_questions['IOP']['2d_materials'][7].get_final_answer()

Alternatively, data can be saved to `yaml` files containing questions & answers given by the encoders. Inconsistencies can be solved manually modifying these files writing the encoder id who gave the correct answer:

In [None]:
save_answers_to_yaml(grouped_questions, parent_folder='../../data/processed/all_answers')

Each file, one per journal, has this form:

```
...
...
3:
  text: 3. Data sharing requirements in RDP
  has_discrepancies: true
  0:
    text: Data sharing encouraged but optional.
    explanation: All ACS journals strongly encourage authors to make the research
      data underlying their articles publicly available at the time of publication.
  1:
    text: Data sharing required but not publicly (e.g. available upon request is allowed).
    explanation: "Text from Journal Research Data Policy : \n\" All ACS journals strongly\
      \ encourage authors to make the research data underlying their articles publicly\
      \ available at the time of publication. \""
  correct_answer: null
...
...
... continues similarly for other questions
```

Only questions where `has_discrepancies` is `true` should be taken care of. 

The correct answer can be chosen writing the encoder id in `correct_answer`. 

For example, in this case it seems that the first answer is correct, as the RDP only encourages (albeit strongly) data publication.

Therefore, one can write

```
  correct_answer: 0
```

and save the file. Once all questions with discrepancies are fixed, one can save the `yaml` file and proceed with the next journal.

For the sake of illustrating the next functions, let's assume the correct answer is always the first, and manually "fix" all the q&a.

In [None]:
for p in grouped_questions:
    for j in grouped_questions[p]:
        for q in grouped_questions[p][j]:
            grouped_questions[p][j][q].resolve_discrepancy(correct_answer = 0)

## `Publisher` and `Journal` classes

In case all the q&a files are fixed, load the `yaml` files and define a new, fixed, `grouped_questions` dictionary, uncommenting the next cell.

Otherwise, keep the current `grouped_questions` dicitonary as given by the preceding cell.

In [None]:
# fixed_grouped_questions = load_answers_from_yaml(parent_folder='../../data/processed/all_answers')

Define a dictionary of `Publisher` instances containing `Journal` instances with the information taken from `grouped_questions`.

In [None]:
publishers = {}

for publisher_name, questions in grouped_questions.items():
    publishers[publisher_name] = Publisher.from_questions(publisher_name, questions)

For example, the ACS publisher instance is

In [None]:
publishers['ACS']

and the journals it contains are

In [None]:
publishers['ACS'].list_journals()

and one of the jornals, e.g. JACS, is

In [None]:
publishers['ACS'].get_journal('jacs')