Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAT-54: Tika based document analyzer #240

Merged
merged 29 commits into from
May 4, 2024

Conversation

Claudenw
Copy link
Contributor

@Claudenw Claudenw commented Apr 18, 2024

Switches to Tika to determine document type.

** this change depends upon #233 **
After the above change is merged this change will be approx 10 files -- most of them tests.

json files are now STANDARD not BINARY

fixes RAT-20 Detection of binaries should be smarter
fixes RAT-54 MIME Detection Using Tika
fixes RAT-147 binary guesser design improvement
fixes RAT-150 RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default
fixes RAT-211 Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
fixes RAT-301 Chinese characters comments are not recognized as binary anymore (thanks to Tika)

Checklist

  • add documentation for default file exclusion listing *.json files as the excluded file. Also note that setting the exclusion will replace the existing entry, so it should be included in replacements. NOT NEEDED as .json is no longer excluded but again categorized as BINARY.
  • add documentation for the default directory exclusion of hidden (.*) files. Also note that setting the exclusion will replace the existing entry. (There is no non-programatic access to changing the directory)
  • Add tests for ArchiveWalker. (https://issues.apache.org/jira/browse/RAT-372)

@Claudenw Claudenw added enhancement work in progress Do not yet merge. java Pull requests that update Java code labels Apr 18, 2024
@Claudenw Claudenw self-assigned this Apr 18, 2024
@ottlinger
Copy link
Contributor

I'd prefer a file filter that allows ignoring no-comment plaintext files such as JSON.

@Claudenw Claudenw removed the work in progress Do not yet merge. label Apr 25, 2024
@Claudenw Claudenw changed the title Tika based document analyzer - DO NOT MERGE RAT-54: Tika based document analyzer Apr 25, 2024
@Claudenw
Copy link
Contributor Author

I am adding a file filter to remove json files. Initially this will be hard coded. I will open a subsequent ticket to generalize it so that we can define a list of extensions to remove. This will probably mean more command line options. So I'll have to open that can of worms as well.

@ottlinger
Copy link
Contributor

Pls add a reference to all the old tickets in the changelog & thanks for taking care of the old tickets/bugs.

@Claudenw Claudenw force-pushed the tika_based_document_analyzer branch from 438a352 to f55d091 Compare April 26, 2024 17:03
@Claudenw
Copy link
Contributor Author

Well this blew up to something bigger than I wanted but...

I added default exclusion for "*.json" files in the Default class and used that to configure the ReportConfiguration defaults.
In the future the default exclusions should be configured in Default class.

I changed the parameter names and instance variables to "filesToIgnore" and "directoriesToIgnore" to make it clear what the filters were doing.

I cleaned up the DirectoryWalker and Walker classes.

I ensured that the filesToIgnore and directoriesToIgnore alwasy have a value (not null). If set to null in the configuration the value is translated into a filter that always returns false.

I think there may be an issue in the directoreisToIgnore processing, but it has been there from the beginning. I will investigate and open a ticket if necessary.

@Claudenw Claudenw marked this pull request as ready for review April 26, 2024 17:17
@Claudenw Claudenw requested a review from ottlinger April 27, 2024 05:51
Copy link
Contributor

@ottlinger ottlinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the fixes while introducing Tika to RAT - big kudos.

Should we add a note to the webpage or is it just an entry in the release changelog concerning the use of Tika for file type detection? WDYT?

Changed to detecting binary by content not name.
</action>
<action issue="RAT-147" type="fix" dev="claudenw">
Change to detect non UTF-8 text as text not binary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what RAT-301 is all about? I got curious when I read your comment in the changelog here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

147 is improve the guesser so that non UTF-8 text is not detected as binary.
301 is extended (Chinese in the report) characters that are UTF-8 encoded being detected as binary.

related but not the same. I was waiting for example to test 301 with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ottlinger ottlinger May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Claudenw would you mind integrating above file in order to see if its properly detected with Tika? This could allow solving RAT-301 as well :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Claudenw I brought in changes related to RAT-301 - is that fine for you or should the Chinese character example go somewhere else? Thanks

@Claudenw Claudenw force-pushed the tika_based_document_analyzer branch from 9e2574d to 9de3ca4 Compare April 28, 2024 09:47
@Claudenw Claudenw requested a review from ottlinger April 29, 2024 10:48
@Claudenw Claudenw requested a review from ottlinger May 1, 2024 06:06
@ottlinger
Copy link
Contributor

@Claudenw the extraction into the Tika-class looks very nice - thanks.

@Claudenw
Copy link
Contributor Author

Claudenw commented May 2, 2024

@ottlinger If you approve I can merge this. If you want more eyes on it, lets's invite a few reviewers.

@ottlinger
Copy link
Contributor

@ottlinger If you approve I can merge this. If you want more eyes on it, lets's invite a few reviewers.

In the PR's main description you've created a check list - is that already done?
I'm still thinking if we should add more documentation about the change or leave to a bigger entry in the release notes about possibly changing behaviour of RAT.

@Claudenw
Copy link
Contributor Author

Claudenw commented May 4, 2024

I updated the checklist.

@ottlinger ottlinger self-requested a review May 4, 2024 13:29
@ottlinger
Copy link
Contributor

@Claudenw pls review my latest additions concerning RAT-301, after that go ahead with the merge. Thanks for your work and the cool addition of more functionality to RAT #kudos

@ottlinger ottlinger mentioned this pull request May 4, 2024
5 tasks
Copy link
Contributor

@ottlinger ottlinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work.

@Claudenw Claudenw merged commit 8fcb1cf into apache:master May 4, 2024
16 checks passed
@Claudenw Claudenw deleted the tika_based_document_analyzer branch May 22, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement java Pull requests that update Java code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants