Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML strip processor #41888

Merged
merged 3 commits into from May 9, 2019

Conversation

Projects
None yet
4 participants
@spinscale
Copy link
Member

commented May 7, 2019

This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well and retrieve it in
the original JSON.

Note, that the character filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.

P.S. I created this for a small project and figured it might make sense to get have it in the core, feel free to close, if you do not agree.

Add HTML strip processor
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.

Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.
@elasticmachine

This comment has been minimized.

Copy link

commented May 8, 2019

@martijnvg
Copy link
Member

left a comment

I think this processor is a great addition! I left one minor comment, LGTM otherwise.

import java.io.StringReader;
import java.util.Map;

public class HtmlStripProcessor extends AbstractStringProcessor<String> {

This comment has been minimized.

Copy link
@martijnvg

martijnvg May 8, 2019

Member

make class final?

@spinscale

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

@elasticmachine run elasticsearch-ci/packaging-sample

1 similar comment
@spinscale

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

@elasticmachine run elasticsearch-ci/packaging-sample

@spinscale spinscale merged commit 2a9da80 into elastic:master May 9, 2019

9 checks passed

CLA All commits in pull request signed
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/bwc Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/docs-check Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

spinscale added a commit that referenced this pull request May 9, 2019

Add HTML strip processor (#41888)
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.

Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.

Megamiun added a commit to Megamiun/elasticsearch that referenced this pull request May 18, 2019

Add HTML strip processor (elastic#41888)
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.

Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019

Add HTML strip processor (elastic#41888)
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.

Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.