Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check tika library for different metadata types in some videos #24051

Closed
3 tasks
Tracked by #21988
fmontes opened this issue Feb 7, 2023 · 7 comments · Fixed by dotCMS/plugin-com.dotcms.tika#2 or #24684
Closed
3 tasks
Tracked by #21988

Comments

@fmontes
Copy link
Member

fmontes commented Feb 7, 2023

Problem Statement

Some MP4 videos end up having a application/mp4 and others video/mp4 in their metadata, and this is breaking some queries.

Steps to Reproduce

  1. Run master full starter
  2. Edit a blog
  3. Add two videos
  4. Edit the video > go to history, click in the JSON
  5. See the property assetMetaData.contentType.

Acceptance Criteria

  • Why contentType is different?
    • Is tika bad parsing? fix
    • Is the video are different?

dotCMS Version

master

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

No response

Assumptions & Initiation Needs

Based on this, we are doing the queries to add videos to block editor and the ones with application/mp4 are not showing in the query.

This is the info of the video file

image

And there are the settings that I use to record the video:

image

Sub-Tasks & Estimates

No response

@jdotcms
Copy link
Contributor

jdotcms commented Mar 22, 2023

I have followed this issue and the mimetype is being returned by Tika, this might be resolved by:
#23934

@nollymar
Copy link
Contributor

Tika was upgraded to version 2.7.0.

PRs:
dotCMS/plugin-com.dotcms.tika#2
#24684

@fabrizzio-dotCMS
Copy link
Contributor

I tried several mp4 video files
now they all have contentType:video/mp4
Also tried uploading different media files including PDF, JPG, GIF, WEBP, Zip
They all present a coherent contentType
I've noticed that some files show an improved and more detailed contentType;
like file site-search.vtl
now shows text/velocity while before it was text/plain; charset=UTF-8

However, the new tika failed to identify a .properties file as text it shows me unknown while the old tike was able to
But this is not always the case. Most files that are supposed to be identified as text are identified correctly

I don't think this is a big deal and I'm passing it.

@wezell
Copy link
Contributor

wezell commented May 1, 2023

Note for QA:
Tika is used for both file metadata AND full text indexing. It also drives site search. Please make sure to test:

  • that we can still search text in documents, .pdf,.doc, ppt, xls
  • that site search still works
  • that site search can still search documents.

@josemejias11
Copy link

Approved QA - Tested on 23.06_1d391801_SNAPSHOT // Docker // macOS 13.0 // FF v113.0

@nollymar
Copy link
Contributor

Internal QA: As site search is not resolving urlmap titles, I'm putting this card back

@bryanboza
Copy link
Member

Fixed, tested trying to search for blogs in the site search index and now it is returning the blog post in the results.
Image

Tested on release-23.06 // Docker // FF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment