Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for model evaluation directly from cdqa #104

Closed
fmikaelian opened this issue Apr 25, 2019 · 17 comments
Closed

Allow for model evaluation directly from cdqa #104

fmikaelian opened this issue Apr 25, 2019 · 17 comments

Comments

@fmikaelian
Copy link
Collaborator

fmikaelian commented Apr 25, 2019

The idea is to implement the evaluate.py script inside the package under /utils

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 6, 2019

You should now be able to evaluate with:

from cdqa.utils.metrics import evaluate, evaluate_from_files

evaluate(dataset, predictions) # as json objects
evaluate_from_files(dataset_file='data/dev-v1.1.json', prediction_file='logs/bert_qa_squad_v1.1_sklearn/predictions.json') # as json files

fmikaelian added a commit that referenced this issue May 6, 2019
* Update CONTRIBUTING.md with new tree structure #112

* Build REST API using QAPipeline() #118

* start updating README with cdqa pipeline method

* Allow for model evaluation directly from cdqa #104

* add filter script in utils for all data cleaning tasks

* update badges pypi

* Allow for model evaluation directly from cdqa #104

* api style fixing + update demo notebook

* Build REST API using QAPipeline() #118

* update README and naming

* update README

* update tree structure
@fmikaelian
Copy link
Collaborator Author

@andrelmfarias Are we evaluating the reader only or the full QAPipeline?

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 19, 2019

The evaluation of the reader can be done with the all_predictions object:

https://github.com/fmikaelian/cdQA/blob/3583256cbf73e8f3674f57182e824a5ca7c4f7be/cdqa/reader/bertqa_sklearn.py#L623-L634

https://github.com/fmikaelian/cdQA/blob/3583256cbf73e8f3674f57182e824a5ca7c4f7be/cdqa/reader/bertqa_sklearn.py#L649-L650

Here is a reproducible example on dev SQuAD:

import os
from ast import literal_eval
import pandas as pd
import joblib
import json

from cdqa.utils.filters import filter_paragraphs
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA
from cdqa.utils.metrics import evaluate

df = pd.read_csv('../data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})

df['paragraphs'] = df['paragraphs'].apply(filter_paragraphs)
df['content'] = df['paragraphs'].apply(lambda x: ' '.join(x))

reader = joblib.load('../models/bert_qa_squad_v1.1_sklearn/bert_qa_squad_v1.1_sklearn.joblib')
reader.output_dir = '../logs/'

processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X='../data/dev-v1.1.json') # replace by custom dataset
preds = reader.predict((examples, features))

with open('../data/dev-v1.1.json') as dataset_file:
    dataset_json = json.load(dataset_file)
    dataset = dataset_json['data']

with open('../logs/predictions.json') as all_predictions:
    all_predictions = json.load(all_predictions)

evaluate(dataset, all_predictions) # as json objects

{'exact_match': 81.2488174077578, 'f1': 88.43242225358777}

To evaluate on a custom dataset we just need to replace the input dataset by dev-v1.1.json.

Evaluating the whole pipeline seems to be a different story though... We will need to use our brains for a bit!

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 19, 2019

I think if QAPipeline's predict() returns something in the same format of all_predictions that correponds to the custom input dataset labelled in json, we're good.

@andrelmfarias
Copy link
Collaborator

Didn't we agree on #135 to create a method prepare_evaluation() in https://github.com/fmikaelian/cdQA/blob/develop/cdqa/utils/metrics.py to handle this instead of doing it directly in QAPipeline.predict()?

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented May 19, 2019

Yes, true. It should use qapipelin and returns an object similar to all_predictions, linked to the custom json dataset.

@andrelmfarias
Copy link
Collaborator

I think it just should take as input the output of QAPipeline.predict() and we will need to run the code:

predictions = QAPipeline.predict()
pred_for_eval = prepare_evaluation(predictions)

@fmikaelian
Copy link
Collaborator Author

@andrelmfarias

I think in that case pred_for_eval() must convert predictions to the all_predictions format?

That way we could do:

evaluate(dataset, pred_for_eval)

instead of:

evaluate(dataset, all_predictions)

@andrelmfarias
Copy link
Collaborator

the predictions output from cdqa does not have the question_id of the question for evaluation. This is needed to convert predictions to all_predictions format in order to evaluate.

We need that BertQA sends the question_id on its prediction.

I will do a pull request with that feature.

@andrelmfarias
Copy link
Collaborator

Just realized that we have a problem...

Our annotated dataset (from annotator) does not have a question_id. The evaluate method needs it in order to link the questions and answer.

However, the BertQA class does send a question_id as output, but I don't understand which idi it is... Do you know where does it come from @fmikaelian ?

What I understand is that it is not the question_id we need...

@andrelmfarias
Copy link
Collaborator

Just found where does it come from... generate_squad_examples . It is a randomly generated id:

https://github.com/fmikaelian/cdQA/blob/f824db1ad5dab2c2b2527faaa4a6d9f97a308e2a/cdqa/utils/converters.py#L97

We have to discuss it tomorrow, I don't really have an idea on how we can handle it.

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented Jun 4, 2019

You're right. It is something I didn't think of when building the annotator. The question_id is not appended. See https://github.com/fmikaelian/cdQA-annotator/blob/271e005d8ba0bd21277ebc752fa10b1ad8839d37/src/components/AnnotationsPage.vue#L114

What we can do is the following:

  1. Manually add a question_id to the current annotated json
  2. Provide generate_squad_examples tuples of questions + their question_id
  3. In generate_squad_examples set the question_id to the one provided if any
  4. Let predict(), probably write_predictions() also output the question_id

We can rediscuss this afternoon

@andrelmfarias
Copy link
Collaborator

You're right. It is something I didn't think of when building the annotator. The question_id is not appended. See https://github.com/fmikaelian/cdQA-annotator/blob/271e005d8ba0bd21277ebc752fa10b1ad8839d37/src/components/AnnotationsPage.vue#L114

What we can do is the following:

  1. Manually add a question_id to the current annotated json
  2. Provide generate_squad_examples tuples of questions + their question_id
  3. In generate_squad_examples set the question_id to the one provided if any
  4. Let predict(), probably write_predictions() also output the question_id

We can rediscuss this afternoon

  1. I agree
  2. Yes, it will be a major change in the code. We have to be attentive when doing it
  3. Actually will need both ids... because we use the current randomly generated id to sort the paragraphs chosen by the retriever and send it to the reader
  4. We could do it as an option on predict send_id = True where its default value is False as we will only need it for evaluation.

@andrelmfarias
Copy link
Collaborator

@mamrou will work on 1.

@andrelmfarias
Copy link
Collaborator

@fmikaelian

I will work on 2. 3. 4.

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented Jun 10, 2019

Reporting @mamrou's solution for 1. (Manually add a question_id to the current annotated json)

import json
import uuid

with open("/path_to_json") as json_file:  
    data = json.load(json_file) 
	
articles = data["data"]
for article in range(0, len(articles)):
    paragraphs = data["data"][article]['paragraphs']
    
    for paragraph in range(0, len(paragraphs)):
        questions = data["data"][article]['paragraphs'][paragraph]['qas']
        
        for question in range(0, len(questions)):
            unique_id = uuid.uuid4()
            data["data"][article]['paragraphs'][paragraph]['qas'][question]['id'] = str(unique_id)

with open('path_to_output_json', 'w') as outfile:  
    json.dump(data, outfile)

@fmikaelian
Copy link
Collaborator Author

fmikaelian commented Jun 16, 2019

Thanks for your help!

fmikaelian added a commit that referenced this issue Jun 25, 2019
* initial structure 🏗

* Update requirements.txt

* Add config for packaging, CI and tests #11

* removing pytest to debug CI

* quiet pip install

* add pytest

* add dummy test

* Add structure for samples and examples #14

* Add Travis CI badge #16

* Implement download.py script (SQuAD fetch) #9

* Create LICENSE

* Add document retriever script #8

* fix typo name title column

* Move retriever example in /examples (currently in /samples) #24

* Add utils script to convert pandas df (title, content) to SQuAD #13

* fix typo

* Find a name for our QA software #23

* Find a name for our QA software #23

* Upload weights and metrics and update download.py script #21

* Add run_squad.py in cdqa/reader #34

* Add scrapper folder under /cdqa #30

* Upload weights and metrics and update download.py script #21

* Find a robust method to get articles paragraphs #1

* Update converter.py

* Update run_converter.py

* Find a robust method to get articles paragraphs #1

* Find a robust method to get articles paragraphs #1

* Initiate README structure #32

* add contributing guidelines

* precision on  repo structure and workflow

* small typo fixes

* continue fixes

* Add fetch of BNP Paribas Newsroom dataset v.1.0 to download.py #46

* small fixes download/converter

* Changed url address for squad/evaluate-v1.1.py script #35

* Adapt retriever.py to BNP Paribas Newsroom dataset v.1.0 #45

* Adapt retriever.py to BNP Paribas Newsroom dataset v.1.0 #45

* Split run_squad.py in processing/train/predict #47

* Split run_squad.py in processing/train/predict #47

* add sklearn wrapper to run_squad.py to be able to call model from cdqa/pipeline

* add sklearn wrapper to run_squad.py to be able to call model from cdqa/pipeline

*  Wrong dataset filename in examples #60

* small fixes

* add BertQA.fit() code

* add custom_weigths params in BertQA.fit()

* typo fixes

* remove uncessary args in BertQA() class

* Add scikit-learn wrapper interface for BertForQuestionAnswering

* update imports train/predict

* json file or object for input + combine doc retriever+reader in pipeline

* fix bad indents

* small fixes

* fix examples/features casting

* read_squad_examples() does not work with our custom input object #61

* small fixes in estimator/transformer classes

* Update run_squad with latest commits #64

* Split run squad.py in processing/train/predict (#66)

* small fixes and updates

* Adapt or disable verbose during model fit() #63

* Update run_squad.py

* Update run_squad with latest commits #64

* NameError: name 'device' is not defined in predict() method #68 (#69)

* #71 #65 (#73)

Small fixes

* #75 (#76)

* fix #74 #36 #33 (#78)

* continue fix best answer across paragraphs (#80)

* fix #74 #36 #33

* update predict.py results

*  Be compliant with the Github open source guide #81

* start new structure docs

* synchronise run-squad.py #82 (#83)

* Disable logger info for BertProcessor() #77 (#84)

* Disable logger info for BertProcessor() #77

* Adapt or disable verbose during model fit() #63

* Add comments + docstrings + changelog #79 (#86)

* Add comments + docstrings + changelog #79

* Add comments + docstrings + changelog #79

* Add comments + docstrings + changelog #79

* small typo fixes

* small typo in download.py

* fix typo readme (#88)

* Add comments + docstrings + changelog #79 (#89)

* Add example (#93)

* Add comments + docstrings + changelog #79

* add example notebook for prediction + small changes

* add example notebook for prediction + small changes

* add codecov

* add codecov badge

* sync with HF example (#94)

* Sync hf (#98)

* sync HF

* update docstring

* fix typo

* Added download of CPU version of model to download.py (#100)

* update example notebook and docstrings (#92, #90,  #79) (#102)

* update example notebook and docstrings (#92, #90,  #79)

* update docstrings #79

* continue #79

* add flake8 to pytest in CI

* start integrating rest api #35

* add info readme

* basic api #35

* update reqs

* add refs and badges #87 (#105)

* add refs and badges #87

* sync HF

* first version of paper

*  Add sklearn wrapper for retriever as well #95

* Add sklearn wrapper for retriever as well #95

* update readme and clean repo

* update evaluation section in README

* debug-minor-updates (#106)

* Add github badges #87

* Disable verbose during predictions #103

* fix typos and tests #95

* Rename variables and scripts #108

* adapt notebook to new retriever class (#109)

* adapt notebook to new retriever class

* remove samples dir

* clean up repo and rename #108

* Fix predict berqa (#113)

* Rename variables and scripts #108

* Rename variables and scripts #108

* BertQA().predict() should return only 1 final predictions object #110

* Created a sklearn wrapper for the QA Pipeline (#101)

* Implemented QAPipeline object that do the whole process for question-answering

* Added option to attribute model: path (string) or joblib object

* corrected typo

* Created example of jupyter notebook for use of qa_pipeline

* Update notebook example

* Added description of QAPipeline class"

* Added descriptions to all methods of QAPipeline class"

* Corrected typo

* Added download of CPU version of model to download.py (#100)

* update example notebook and docstrings (#92, #90,  #79) (#102)

* update example notebook and docstrings (#92, #90,  #79)

* update docstrings #79

* continue #79

* add flake8 to pytest in CI

* start integrating rest api #35

* add info readme

* basic api #35

* update reqs

* add refs and badges #87 (#105)

* add refs and badges #87

* sync HF

* first version of paper

*  Add sklearn wrapper for retriever as well #95

* Add sklearn wrapper for retriever as well #95

* update readme and clean repo

* update evaluation section in README

* debug-minor-updates (#106)

* Add github badges #87

* Disable verbose during predictions #103

* fix typos and tests #95

* Rename variables and scripts #108

* adapt notebook to new retriever class (#109)

* adapt notebook to new retriever class

* remove samples dir

* clean up repo and rename #108

* Fix predict berqa (#113)

* Rename variables and scripts #108

* Rename variables and scripts #108

* BertQA().predict() should return only 1 final predictions object #110

* Implemented QAPipeline object that do the whole process for question-answering

* Added option to attribute model: path (string) or joblib object

* corrected typo

* Created example of jupyter notebook for use of qa_pipeline

* Update notebook example

* Added description of QAPipeline class"

* Added descriptions to all methods of QAPipeline class

* Corrected typo

* Changed code from qa_pipeline.py to cdqa_sklearn.py

* seperated kwargs for declaration of different classes within QAPipeline

* removed qa_pipeline.py

* Implemented predict() and retriever part of fit()

* Implemented reader training in fit() and completed documentation

* Modified documentation for predict() method

* Deleted useless tutorial

* Created notebook example for pipeline

* Modified converter.py to correct for the creation of repeated articles… (#116)

* Modified converter.py to correct for the creation of repeated articles in generate_squad_examples

* included options for min and max length in filter_paragraphs()

* Implement automatic pypi upload on master release #107

* Debug, small fixes and doc updates (#117)

* Update CONTRIBUTING.md with new tree structure #112

* Build REST API using QAPipeline() #118

* start updating README with cdqa pipeline method

* Allow for model evaluation directly from cdqa #104

* add filter script in utils for all data cleaning tasks

* update badges pypi

* Allow for model evaluation directly from cdqa #104

* api style fixing + update demo notebook

* Build REST API using QAPipeline() #118

* update README and naming

* update README

* update tree structure

* corrected typo related to sync with HF (#126)

* Updated BertQA to enable multiple trainings and handled some errors (#130)

* modified BertQA class to enable multiple calls to fit()

* cerrected typo

* Deleted tokenizer saving inside BertQA.fit

* handled problem with self.output_dir

* Implemented fit_reader() method and fixed fit() method. (#131)

* replaced self.model by self.reader

* Implemented fit_reader(), fixed fit() and updated doc

* sync HF + auto export json in scrapper + move filters (#129)

* sync HF + auto export json in scrapper + move filters

* change wording converter => converters

* fix typo api

* update version of dataset

* update version of dataset (2)

* predict() method should also give back index of document + paragraph #91 (#132)

* update API and notebook

* Implemented multiple prediction in QAPipeline.predict() (#135)

* replaced self.model by self.reader

* Implemented fit_reader(), fixed fit() and updated doc

* --

* Implemented multiple predictions in qa_pipeline

* removed not used import

* Improved doc of predict method

* Handled error for predictions on GPU

* filter paragraphs script (#140)

* implemented methods to send reader to GPU or CPU inside QAPipeline (#143)

* debug and update filters (#141)

* small fixes

* Fixed some errors (#145)

* fixed typo

* added other needed changes when sending to different devices

* add instructions for reader training on SQuAD

* Put all the display of messages under the verbose condition (#147)

* Update issue templates (#148)

* Update issue templates

* remove old issue template method

* Be compliant with the Github open source guide #81

* Deleted not used folders and included option to save logs with default False (#150)

* Deleted useless folders for github repo

* Added option in BertQA to save logs, the default is false

* Implemented a better way to have the option to save logs

* Updated documentation

* Removed useless parameter in BertQA

*  Updated explanations to train and to evaluate reader on README (#151)

* Updated explanation to train reader on README file

* Updated explanation to evaluate reader on README file

* Added explanation to evaluate pipeline

* Implemented function to evaluate pipeline (#152)

* Implemented function to evaluate pipeline

* modified name of module from metrics.py to evaluate.py

* Changed evaluate_from_files to evaluate_reader / modified name of the module (evaluate.py to evaluation.py)

* README updates

* update run_squad.py

* create content column inside qa_pipeline

* update refs

* clean up docstrings after content col removal

* fix typos and start write unit tests #136 (#155)

* + pdf_converter (#149)

* + pdf_converter

* README updates

* Use the sys.argv and save the data on a csv

* change '\n'.join() by ' '.join() in order to correct the csv

* update run_squad.py

* create content column inside qa_pipeline

* update refs

* clean up docstrings after content col removal

* minor bugs

* fix typos and start write unit tests #136 (#155)

* + pdf_converter

* Use the sys.argv and save the data on a csv

* change '\n'.join() by ' '.join() in order to correct the csv

* minor bugs

* update README

* change param name to sth meaningful

* fix typos and start migration to org

* update URLs

* corrected bug when predicting with log and set do_lower_case do False as default (the default BERT model we use is uncased) (#160)

* change LICENSE + fix typo README

* Included the whole team in the author paramater in setup.py (#161)

* Prepare repo for release #159 (#162)

* Prepare repo for release #159

* add GPU saving to example of reader training on SQuAD

* remove useless dep

* update README

* Updated download.py and removed docs repo (#163)

* updated download.py

* deleted docs repo

* fix LICENSE badge bug (#166)

* Last updates on download / setup / NB example to official release (#168)

* moved download.py to root and updated it to download models and BNP Paribas dataset

* changed version in setup.py to 1.0.0

* updated tutorial example with changes in repository

* done some minor updates on download.py

* removed PyGithub from requirements as we do not use it anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment