-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAT-54: Tika based document analyzer #240
RAT-54: Tika based document analyzer #240
Conversation
I'd prefer a file filter that allows ignoring no-comment plaintext files such as JSON. |
I am adding a file filter to remove json files. Initially this will be hard coded. I will open a subsequent ticket to generalize it so that we can define a list of extensions to remove. This will probably mean more command line options. So I'll have to open that can of worms as well. |
Pls add a reference to all the old tickets in the changelog & thanks for taking care of the old tickets/bugs. |
438a352
to
f55d091
Compare
Well this blew up to something bigger than I wanted but... I added default exclusion for "*.json" files in the Default class and used that to configure the ReportConfiguration defaults. I changed the parameter names and instance variables to "filesToIgnore" and "directoriesToIgnore" to make it clear what the filters were doing. I cleaned up the DirectoryWalker and Walker classes. I ensured that the filesToIgnore and directoriesToIgnore alwasy have a value (not null). If set to null in the configuration the value is translated into a filter that always returns false. I think there may be an issue in the directoreisToIgnore processing, but it has been there from the beginning. I will investigate and open a ticket if necessary. |
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java
Outdated
Show resolved
Hide resolved
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the fixes while introducing Tika to RAT - big kudos.
Should we add a note to the webpage or is it just an entry in the release changelog concerning the use of Tika for file type detection? WDYT?
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java
Outdated
Show resolved
Hide resolved
apache-rat-core/src/main/java/org/apache/rat/report/claim/ClaimStatistic.java
Outdated
Show resolved
Hide resolved
apache-rat-core/src/main/java/org/apache/rat/walker/DirectoryWalker.java
Outdated
Show resolved
Hide resolved
apache-rat-core/src/main/java/org/apache/rat/walker/Walker.java
Outdated
Show resolved
Hide resolved
src/changes/changes.xml
Outdated
Changed to detecting binary by content not name. | ||
</action> | ||
<action issue="RAT-147" type="fix" dev="claudenw"> | ||
Change to detect non UTF-8 text as text not binary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that what RAT-301 is all about? I got curious when I read your comment in the changelog here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
147 is improve the guesser so that non UTF-8 text is not detected as binary.
301 is extended (Chinese in the report) characters that are UTF-8 encoded being detected as binary.
related but not the same. I was waiting for example to test 301 with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Claudenw would you mind integrating above file in order to see if its properly detected with Tika? This could allow solving RAT-301 as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Claudenw I brought in changes related to RAT-301 - is that fine for you or should the Chinese character example go somewhere else? Thanks
9e2574d
to
9de3ca4
Compare
apache-rat-core/src/main/java/org/apache/rat/analysis/DefaultAnalyserFactory.java
Outdated
Show resolved
Hide resolved
@Claudenw the extraction into the Tika-class looks very nice - thanks. |
@ottlinger If you approve I can merge this. If you want more eyes on it, lets's invite a few reviewers. |
apache-rat-core/src/test/java/org/apache/rat/document/impl/guesser/BinaryGuesserTest.java
Outdated
Show resolved
Hide resolved
In the PR's main description you've created a check list - is that already done? |
I updated the checklist. |
… in IDE searches in RAT's codebase
@Claudenw pls review my latest additions concerning RAT-301, after that go ahead with the merge. Thanks for your work and the cool addition of more functionality to RAT #kudos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work.
Switches to Tika to determine document type.
** this change depends upon #233 **
After the above change is merged this change will be approx 10 files -- most of them tests.
json files are now STANDARD not BINARY
fixes RAT-20 Detection of binaries should be smarter
fixes RAT-54 MIME Detection Using Tika
fixes RAT-147 binary guesser design improvement
fixes RAT-150 RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default
fixes RAT-211 Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
fixes RAT-301 Chinese characters comments are not recognized as binary anymore (thanks to Tika)
Checklist