Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created a sklearn wrapper for the QA Pipeline #101

Merged
merged 31 commits into from
May 2, 2019
Merged

Conversation

andrelmfarias
Copy link
Collaborator

Created a QAPipeline class that works as a sklearn estimator to ease the process of question-answer

An instance of this class can be created by using a version of the Bert model with the sklearn wrapper. It will train a tf-idf matrix on the documents fed as a dataframe to the method QAPipeline().fit()

In order to obtain an answer to a question one only need to call the method QAPipeline().answer() on a question passed as string.

@fmikaelian
Copy link
Collaborator

Just made a quick review of this:

To understand clearly, you want to implement a sklearn pipeline that can use fit() to train both retriever and reader and predict() to output final answer directly. Did I understood correctly?

@andrelmfarias
Copy link
Collaborator Author

andrelmfarias commented Apr 23, 2019

Not exactly.

This pipeline has two different methods for training: fit() and fit_bert()

The first one trains the retriever and the second one trains the reader (what we are rarely going to do, as we already have pre-trained models)

I also has a predict() method, which return the same results as the predict() method of the BertQA()class; and an answer() method, which only returns the answer to a question (what we are actually interested in).

@fmikaelian
Copy link
Collaborator

I see obvious benefits to the new class you built, among them a better usability and less complexity thanks to a single pipeline. Can you see other reasons why you implement this?

@andrelmfarias
Copy link
Collaborator Author

andrelmfarias commented Apr 23, 2019

These are the only reasons I thought when I implemented it.

I still have other ideas for future improvements.

For example, instead of passing a dataframe with a such specific structure as we are doing now, we could create this dataframe inside the fit() method and pass a list of documents (articles) to it. The fit()method would then create the paragraphs and the dataframe.

I think the package will be cleaner and easier to use with this feature. We can discuss it further later on.

@fmikaelian
Copy link
Collaborator

Ok, very clear. I agree on the principle of this pipeline.

I would like to discuss more about some naming conventions and API details though. Let me review this in more details tomorrow and we can do a quick catch up call to discuss if needed.

Thank you for your work!

verbose : bool, optional
Parameter for BertProcessor. If true, all of the warnings related to data processing will be printed.

If parameter `model` is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this?

Copy link
Collaborator Author

@andrelmfarias andrelmfarias Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no pre-trained model fed as input when creating an instance of QAPipeline, it will create an instance of BertQA using the following parameters and will assign it to self.model attribute.

Please see line 124 of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, let's talk over the phone 📞

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

I am available for a call tomorrow afternoon if that suits you.


return self.model.predict((examples, features))

def answer(self, question, paragraphs=None, top_n=3, verbose=False):
Copy link
Collaborator

@fmikaelian fmikaelian Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should stick the sklearn API fit() and predict() methods only. Adding new API methods like fit_bert() or answer() to use our model is adding complexity to the user when we could have merged these methods easily.

Copy link
Collaborator Author

@andrelmfarias andrelmfarias Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point.

The problem with merging fit() and fit_bert() is that it will lead to a training of the BERT model each time we call fit(). We are not interested in this behavior as it will lead to needless computation time.

I agree with merging between predict() and answer(). However, I would prefer that a predict() method will only return the answer, instead of the 4 variables that are sent as a return as it is done with BertQA().predict().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make an empathy exercise for the end users:

The problem with merging fit() and fit_bert() is that it will lead to a training of the BERT model each time we call fit(). We are not interested in this behavior as it will lead to needless computation time.

As a basic user of cdqa and having access to pre-trained models, I would like to fit() the pipeline on my custom dataset and make predictions on it given my questions. Therefore the fit() method should not fit the reader but only the retriever since I don't need to train the base model of fine-tune it and I want to see results as quickly as possible.

As a advanced user of cdqa, I would like to fit() the pipeline on my custom dataset and make predictions on it given my questions with the best perfomance possible. I am confident that I can have better results with fine tuning the reader on my dataset and I have a GPU for that.

I therefore think that the fit method should allow for both retriever and reader training, at the choice of the user.

I agree with merging between predict() and answer(). However, I would prefer that a predict() method will only return the answer, instead of the 4 variables that are sent as a return as it is done with BertQA().predict().

I agree with you on this one. Should we do this in BertQA().predict() directly?

Copy link
Collaborator Author

@andrelmfarias andrelmfarias Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I therefore think that the fit method should allow for both retriever and reader training, at the choice of the user.

So you are proposing to merge both fit() functions and add a parameter to it, where the user will choose which model they will train? I agree with that

I agree with you on this one. Should we do this in BertQA().predict() directly?

Maybe. Honestly, I do not see any advantages in returning all_predictions, all_nbest_json, scores_diff_json together with final_prediction

Copy link
Collaborator

@fmikaelian fmikaelian Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are proposing to merge both fit() functions and add a parameter to it, where the user will choose which model they will train? I agree with that

Yes. I actually don't know if setting params in fit() is a good sklearn practice but I think it's a way to solve this temporarily.

Maybe. Honestly, I do not see any advantages in returning all_predictions, all_nbest_json, scores_diff_json together with final_prediction

I actually implemented it this way for debugging purposes because I could not differientiate all of them, but now that the workflow is validated removing it makes sense 😅 I'll update bert_sklearn.py accordingly.

@fmikaelian
Copy link
Collaborator

Also want to highlight the fact that such a pipeline will impact the README.md documentation with the new (and more simple) workflow you introduced.

* update example notebook and docstrings (#92, #90,  #79)

* update docstrings #79

* continue #79

* add flake8 to pytest in CI

* start integrating rest api #35

* add info readme

* basic api #35

* update reqs
* add refs and badges #87

* sync HF

* first version of paper

*  Add sklearn wrapper for retriever as well #95

* Add sklearn wrapper for retriever as well #95

* update readme and clean repo

* update evaluation section in README
* Add github badges #87

* Disable verbose during predictions #103

* fix typos and tests #95

* Rename variables and scripts #108
@fmikaelian
Copy link
Collaborator

* adapt notebook to new retriever class

* remove samples dir

* clean up repo and rename #108
* Rename variables and scripts #108

* Rename variables and scripts #108

* BertQA().predict() should return only 1 final predictions object #110
@andrelmfarias
Copy link
Collaborator Author

@fmikaelian

Thank you for this.

So now I can create an instance of this class inside the qa_pipeline and commit before the merge ?

@fmikaelian
Copy link
Collaborator

fmikaelian commented Apr 29, 2019

You can add your pipeline in the cdqa_sklearn.py script with the changes we discussed:

https://github.com/fmikaelian/cdQA/blob/develop/cdqa/pipeline/cdqa_sklearn.py

I am still not very clear about this #101 (comment)

But this step is the last one before we build a REST API to serve a demo UI ! I started implementing it there: https://github.com/fmikaelian/cdQA/blob/develop/api.py

@andrelmfarias
Copy link
Collaborator Author

So should I create a new branch in order to add the pipeline to the cdqa_sklearn.py script?

What about this branch?

@andrelmfarias
Copy link
Collaborator Author

andrelmfarias commented May 2, 2019

@fmikaelian
I have implemented the changes as discussed. I tested the fit() method for the retriever and the predict() method in a notebook included in examples. Everything is working fine.

Could you please test the fit() method for the reader in GPU to check if everything is ok?

Please note that the current implementation of QAPipeline still uses the TfidfRetriever as you have implemented (passing the dataframe column as input to TfidfRetriever.fit(df['content'])). It should be changed once you will have implemented the improvement I proposed in #95.

@fmikaelian
Copy link
Collaborator

Thank you for sticking to that PR to make it viable. This is what a good data scientist should do.

Let's merge your work! 🙏

@fmikaelian fmikaelian merged commit 2fb5f06 into develop May 2, 2019
@fmikaelian fmikaelian deleted the qa_pipeline branch May 2, 2019 14:28
fmikaelian added a commit that referenced this pull request Jun 25, 2019
* initial structure 🏗

* Update requirements.txt

* Add config for packaging, CI and tests #11

* removing pytest to debug CI

* quiet pip install

* add pytest

* add dummy test

* Add structure for samples and examples #14

* Add Travis CI badge #16

* Implement download.py script (SQuAD fetch) #9

* Create LICENSE

* Add document retriever script #8

* fix typo name title column

* Move retriever example in /examples (currently in /samples) #24

* Add utils script to convert pandas df (title, content) to SQuAD #13

* fix typo

* Find a name for our QA software #23

* Find a name for our QA software #23

* Upload weights and metrics and update download.py script #21

* Add run_squad.py in cdqa/reader #34

* Add scrapper folder under /cdqa #30

* Upload weights and metrics and update download.py script #21

* Find a robust method to get articles paragraphs #1

* Update converter.py

* Update run_converter.py

* Find a robust method to get articles paragraphs #1

* Find a robust method to get articles paragraphs #1

* Initiate README structure #32

* add contributing guidelines

* precision on  repo structure and workflow

* small typo fixes

* continue fixes

* Add fetch of BNP Paribas Newsroom dataset v.1.0 to download.py #46

* small fixes download/converter

* Changed url address for squad/evaluate-v1.1.py script #35

* Adapt retriever.py to BNP Paribas Newsroom dataset v.1.0 #45

* Adapt retriever.py to BNP Paribas Newsroom dataset v.1.0 #45

* Split run_squad.py in processing/train/predict #47

* Split run_squad.py in processing/train/predict #47

* add sklearn wrapper to run_squad.py to be able to call model from cdqa/pipeline

* add sklearn wrapper to run_squad.py to be able to call model from cdqa/pipeline

*  Wrong dataset filename in examples #60

* small fixes

* add BertQA.fit() code

* add custom_weigths params in BertQA.fit()

* typo fixes

* remove uncessary args in BertQA() class

* Add scikit-learn wrapper interface for BertForQuestionAnswering

* update imports train/predict

* json file or object for input + combine doc retriever+reader in pipeline

* fix bad indents

* small fixes

* fix examples/features casting

* read_squad_examples() does not work with our custom input object #61

* small fixes in estimator/transformer classes

* Update run_squad with latest commits #64

* Split run squad.py in processing/train/predict (#66)

* small fixes and updates

* Adapt or disable verbose during model fit() #63

* Update run_squad.py

* Update run_squad with latest commits #64

* NameError: name 'device' is not defined in predict() method #68 (#69)

* #71 #65 (#73)

Small fixes

* #75 (#76)

* fix #74 #36 #33 (#78)

* continue fix best answer across paragraphs (#80)

* fix #74 #36 #33

* update predict.py results

*  Be compliant with the Github open source guide #81

* start new structure docs

* synchronise run-squad.py #82 (#83)

* Disable logger info for BertProcessor() #77 (#84)

* Disable logger info for BertProcessor() #77

* Adapt or disable verbose during model fit() #63

* Add comments + docstrings + changelog #79 (#86)

* Add comments + docstrings + changelog #79

* Add comments + docstrings + changelog #79

* Add comments + docstrings + changelog #79

* small typo fixes

* small typo in download.py

* fix typo readme (#88)

* Add comments + docstrings + changelog #79 (#89)

* Add example (#93)

* Add comments + docstrings + changelog #79

* add example notebook for prediction + small changes

* add example notebook for prediction + small changes

* add codecov

* add codecov badge

* sync with HF example (#94)

* Sync hf (#98)

* sync HF

* update docstring

* fix typo

* Added download of CPU version of model to download.py (#100)

* update example notebook and docstrings (#92, #90,  #79) (#102)

* update example notebook and docstrings (#92, #90,  #79)

* update docstrings #79

* continue #79

* add flake8 to pytest in CI

* start integrating rest api #35

* add info readme

* basic api #35

* update reqs

* add refs and badges #87 (#105)

* add refs and badges #87

* sync HF

* first version of paper

*  Add sklearn wrapper for retriever as well #95

* Add sklearn wrapper for retriever as well #95

* update readme and clean repo

* update evaluation section in README

* debug-minor-updates (#106)

* Add github badges #87

* Disable verbose during predictions #103

* fix typos and tests #95

* Rename variables and scripts #108

* adapt notebook to new retriever class (#109)

* adapt notebook to new retriever class

* remove samples dir

* clean up repo and rename #108

* Fix predict berqa (#113)

* Rename variables and scripts #108

* Rename variables and scripts #108

* BertQA().predict() should return only 1 final predictions object #110

* Created a sklearn wrapper for the QA Pipeline (#101)

* Implemented QAPipeline object that do the whole process for question-answering

* Added option to attribute model: path (string) or joblib object

* corrected typo

* Created example of jupyter notebook for use of qa_pipeline

* Update notebook example

* Added description of QAPipeline class"

* Added descriptions to all methods of QAPipeline class"

* Corrected typo

* Added download of CPU version of model to download.py (#100)

* update example notebook and docstrings (#92, #90,  #79) (#102)

* update example notebook and docstrings (#92, #90,  #79)

* update docstrings #79

* continue #79

* add flake8 to pytest in CI

* start integrating rest api #35

* add info readme

* basic api #35

* update reqs

* add refs and badges #87 (#105)

* add refs and badges #87

* sync HF

* first version of paper

*  Add sklearn wrapper for retriever as well #95

* Add sklearn wrapper for retriever as well #95

* update readme and clean repo

* update evaluation section in README

* debug-minor-updates (#106)

* Add github badges #87

* Disable verbose during predictions #103

* fix typos and tests #95

* Rename variables and scripts #108

* adapt notebook to new retriever class (#109)

* adapt notebook to new retriever class

* remove samples dir

* clean up repo and rename #108

* Fix predict berqa (#113)

* Rename variables and scripts #108

* Rename variables and scripts #108

* BertQA().predict() should return only 1 final predictions object #110

* Implemented QAPipeline object that do the whole process for question-answering

* Added option to attribute model: path (string) or joblib object

* corrected typo

* Created example of jupyter notebook for use of qa_pipeline

* Update notebook example

* Added description of QAPipeline class"

* Added descriptions to all methods of QAPipeline class

* Corrected typo

* Changed code from qa_pipeline.py to cdqa_sklearn.py

* seperated kwargs for declaration of different classes within QAPipeline

* removed qa_pipeline.py

* Implemented predict() and retriever part of fit()

* Implemented reader training in fit() and completed documentation

* Modified documentation for predict() method

* Deleted useless tutorial

* Created notebook example for pipeline

* Modified converter.py to correct for the creation of repeated articles… (#116)

* Modified converter.py to correct for the creation of repeated articles in generate_squad_examples

* included options for min and max length in filter_paragraphs()

* Implement automatic pypi upload on master release #107

* Debug, small fixes and doc updates (#117)

* Update CONTRIBUTING.md with new tree structure #112

* Build REST API using QAPipeline() #118

* start updating README with cdqa pipeline method

* Allow for model evaluation directly from cdqa #104

* add filter script in utils for all data cleaning tasks

* update badges pypi

* Allow for model evaluation directly from cdqa #104

* api style fixing + update demo notebook

* Build REST API using QAPipeline() #118

* update README and naming

* update README

* update tree structure

* corrected typo related to sync with HF (#126)

* Updated BertQA to enable multiple trainings and handled some errors (#130)

* modified BertQA class to enable multiple calls to fit()

* cerrected typo

* Deleted tokenizer saving inside BertQA.fit

* handled problem with self.output_dir

* Implemented fit_reader() method and fixed fit() method. (#131)

* replaced self.model by self.reader

* Implemented fit_reader(), fixed fit() and updated doc

* sync HF + auto export json in scrapper + move filters (#129)

* sync HF + auto export json in scrapper + move filters

* change wording converter => converters

* fix typo api

* update version of dataset

* update version of dataset (2)

* predict() method should also give back index of document + paragraph #91 (#132)

* update API and notebook

* Implemented multiple prediction in QAPipeline.predict() (#135)

* replaced self.model by self.reader

* Implemented fit_reader(), fixed fit() and updated doc

* --

* Implemented multiple predictions in qa_pipeline

* removed not used import

* Improved doc of predict method

* Handled error for predictions on GPU

* filter paragraphs script (#140)

* implemented methods to send reader to GPU or CPU inside QAPipeline (#143)

* debug and update filters (#141)

* small fixes

* Fixed some errors (#145)

* fixed typo

* added other needed changes when sending to different devices

* add instructions for reader training on SQuAD

* Put all the display of messages under the verbose condition (#147)

* Update issue templates (#148)

* Update issue templates

* remove old issue template method

* Be compliant with the Github open source guide #81

* Deleted not used folders and included option to save logs with default False (#150)

* Deleted useless folders for github repo

* Added option in BertQA to save logs, the default is false

* Implemented a better way to have the option to save logs

* Updated documentation

* Removed useless parameter in BertQA

*  Updated explanations to train and to evaluate reader on README (#151)

* Updated explanation to train reader on README file

* Updated explanation to evaluate reader on README file

* Added explanation to evaluate pipeline

* Implemented function to evaluate pipeline (#152)

* Implemented function to evaluate pipeline

* modified name of module from metrics.py to evaluate.py

* Changed evaluate_from_files to evaluate_reader / modified name of the module (evaluate.py to evaluation.py)

* README updates

* update run_squad.py

* create content column inside qa_pipeline

* update refs

* clean up docstrings after content col removal

* fix typos and start write unit tests #136 (#155)

* + pdf_converter (#149)

* + pdf_converter

* README updates

* Use the sys.argv and save the data on a csv

* change '\n'.join() by ' '.join() in order to correct the csv

* update run_squad.py

* create content column inside qa_pipeline

* update refs

* clean up docstrings after content col removal

* minor bugs

* fix typos and start write unit tests #136 (#155)

* + pdf_converter

* Use the sys.argv and save the data on a csv

* change '\n'.join() by ' '.join() in order to correct the csv

* minor bugs

* update README

* change param name to sth meaningful

* fix typos and start migration to org

* update URLs

* corrected bug when predicting with log and set do_lower_case do False as default (the default BERT model we use is uncased) (#160)

* change LICENSE + fix typo README

* Included the whole team in the author paramater in setup.py (#161)

* Prepare repo for release #159 (#162)

* Prepare repo for release #159

* add GPU saving to example of reader training on SQuAD

* remove useless dep

* update README

* Updated download.py and removed docs repo (#163)

* updated download.py

* deleted docs repo

* fix LICENSE badge bug (#166)

* Last updates on download / setup / NB example to official release (#168)

* moved download.py to root and updated it to download models and BNP Paribas dataset

* changed version in setup.py to 1.0.0

* updated tutorial example with changes in repository

* done some minor updates on download.py

* removed PyGithub from requirements as we do not use it anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants