Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to specify text col name in TextClassificationProcessor and RegressionProcessor #387

Merged
merged 18 commits into from
Jun 8, 2020

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented May 30, 2020

PR for #379

TODO

  • check if and how tests can be written -> added parametrize to the tests

@Timoeller Timoeller self-requested a review May 31, 2020 16:09
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Having tests also helps a lot. Thanks for this great addition.

I added a small comment in the column renaming.
Lets wait for our CI issue to be fixed, than we can merge it safely.

farm/data_handler/utils.py Outdated Show resolved Hide resolved
@Timoeller
Copy link
Contributor

Could you do a dummy commit to restart the CI, please? (I do not have access to the pipeline, today is public holiday in Germany...)

@PhilipMay
Copy link
Contributor Author

rebased master

@PhilipMay
Copy link
Contributor Author

Fixed test_processor_saving_loading the same way as #390 does.

@PhilipMay
Copy link
Contributor Author

The CI system was gracious to me and all tests are green now. :-)

Can you please do a review?

Thanks
Philip

@PhilipMay PhilipMay requested a review from Timoeller June 3, 2020 12:24
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small check is needed, than we can merge.

I would also merge irrespective of our CI status, since the pipeline ran through already. The problems seem to come from Azure side. Apparently there are not enough resources for open source code at all times and we currently investigate alternatives like github actions.

farm/data_handler/processor.py Show resolved Hide resolved
@PhilipMay
Copy link
Contributor Author

PhilipMay commented Jun 6, 2020

Made the change discussed above.

I am passing usecols=columns_needed to pd.read_csv to read only those cols we need. Right from the beginning. This way the line df = df[columns_needed] is not needed anymore.

The advantage is memory usage and early crashing. If one of the cols does not exist (because the user mistyped it) pd.read_csv crashes with a meaningful error. Example: ValueError: Usecols do not match columns, columns expected but not found: ['text_other_']

I think this should adress the concerns from @Timoeller in the review.

@PhilipMay
Copy link
Contributor Author

rebased master (mainly to trigger CI again)

@PhilipMay
Copy link
Contributor Author

YAY! CI is green. :-)

@Timoeller
Copy link
Contributor

Timoeller commented Jun 8, 2020

CSV data reading is handled much more user friendly now.
At some point we need to adjust the read_tsv_sentence_pair() to reflect these changes as well.

Thanks @PhilipMay for your effort.

@Timoeller Timoeller merged commit 6edf91e into deepset-ai:master Jun 8, 2020
@PhilipMay PhilipMay deleted the text-col-name branch June 8, 2020 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants