Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added language detection for non-english youtube videos #1362

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

anantoj
Copy link

@anantoj anantoj commented May 9, 2024

Description

Invoking app.add() on non-english youtube videos would cause a NoTranscriptFound error. There is already a pull request that fixes the issue by specify an additional argument in the config file, but it has not been merged until today. Ideally app.add() also automatically detects the language in the youtube transcript, instead having to manually specify the language in the beginning.

Fixes #385

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

I simply invoked app.add() on various non-english youtube videos, and the NoTranscriptFound error no longer appears.

Please delete options that are not relevant.

  • Unit Test

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Made sure Checks passed

@LeonieFreisinger
Copy link
Contributor

@anantoj thanks for providing that fix! I ran & reviewed your code locally ... it looks good to me.
However in my case, the test_youtube_video/test_load_data fails. Is this the same for you too?

@anantoj
Copy link
Author

anantoj commented May 14, 2024

@LeonieFreisinger Thanks for pointing out the issue. I was not aware that the test suite uses a dummy VIDEO_ID, which breaks the video_id extraction. There's also another issue caused a recent commit 78301ee that modifies the metadata. I fixed all the issues in my recent commit and test_youtube_video/test_load_data passed on my end. Let me know if it's all good for you

@LeonieFreisinger
Copy link
Contributor

@anantoj thanks for your fast answer!
I tested your code with this url: https://www.youtube.com/watch?v=RG7Edrbhdpk. I get an error "Could not retrieve a transcript for the video". Can you have a look into it?
Besides that, your code works fine for me. Thanks!

@anantoj
Copy link
Author

anantoj commented May 17, 2024

@LeonieFreisinger it seems that your video just doesn't have a transcript, which causes the issue.

It's also a warning raised by youtube_transcript_api instead of an error. The error, at least on my end, is a ValueError on embedchain's Youtube video loader here which is raised when the transcript is not found.

@LeonieFreisinger
Copy link
Contributor

@anantoj You are right! So it should be fine.
@deshraj @taranjeet The changes in the PR look good to me. Can you please re-trigger the workflows and merge the PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: youtube video in other language
2 participants