Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize User Data from YTDL-generated JSON Metadata File #119

Closed
hunter0002 opened this issue Jun 7, 2020 · 21 comments
Closed

Sanitize User Data from YTDL-generated JSON Metadata File #119

hunter0002 opened this issue Jun 7, 2020 · 21 comments

Comments

@hunter0002
Copy link

hunter0002 commented Jun 7, 2020

Currently, the youtube-dl JSON uploaded to archive.org includes various metadata, including the full directory name of the video file and googlevideo.com source URLs. With default settings this leaks the user's home directory name and his/her OS, and often his/her IP address as well. tubeup also does not inform the user that the data will be stored publicly on archive.org.

Either:

  • tubeup should inform the user that identifying data is going to be made public
  • tubeup should somehow remove the identifying data from the JSON generated by youtube-dl
@hunter0002 hunter0002 changed the title tubeup leaks the name of the user directory (privacy problem) tubeup leaks the name of the user directory in the JSON _filename field (privacy problem) Jun 7, 2020
@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

Also your issue isn't correct, this isn't our problem but youtube-dls. We don't actually create the JSON metadata, that program does.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

@brandongalbraith thoughts?

@hunter0002
Copy link
Author

Also your issue isn't correct

thanks, I've corrected the issue description

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

My testing with youtube-dl shows JSON is a all or nothing affair. We need JSON to both preserve important metadata, both for preservation and for the items creation. Youcan't just tell youtube-dl to not collect the filename. You need to take this to them, it's not our issue. I don't want to get into the habit of post-processing metadata.

Closing, but I'm going to add a warning on the README.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

This may have to do with how we point the file to a certain directory. it's not reproducing the full directory when I separately generate JSON with youtube-dl.

@jjjake Would you suggest handling this on our end (somehow) or y'all stripping directory information as apart of the deriving process?

@hunter0002
Copy link
Author

hunter0002 commented Jun 7, 2020

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address. This doesn't seem to happen for all requests.

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

The tubeup JSON does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with youtube-dl. Not sure why

Thats not our JSON, also please start properly indenting examples so we can see what you're talking about.

@hunter0002
Copy link
Author

hunter0002 commented Jun 7, 2020

I've tried to make the wording clearer. I'm not sure if it's related to the ip key/value being part of the URL only for certain googlevideo.com requests.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with regular youtube-dl. Not sure why

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

That looks like the IP address resolved from Youtube, not you. So not a major problem. The directory thing is a minor problem that can either be done in a code fix (send a pull request) or on IAs end.

Hiding that IP makes it so I can't WHOIS to check for sure.

@hunter0002
Copy link
Author

I can confirm it's my IP address, I tested with both searching DuckDuckGo for "ip" and running dig @resolver1.opendns.com ANY myip.opendns.com +short.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

I concur, youtube-dl prints the public IP address and user directory of the video file. The directory thing, not a huge issue but a issue. IP, thats a larger problem.

Submit a pull request with a fix and I'll test.

@hunter0002
Copy link
Author

I can't write in Python so unfortunately I'll have to leave a PR to someone else.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 7, 2020

Neither can I. This goes beyond my skillset of minor tweaks.

@hunter0002 hunter0002 changed the title tubeup leaks the name of the user directory in the JSON _filename field (privacy problem) tubeup leaks IP address and the name of the user directory in (video).info.json (privacy problem) Jun 7, 2020
@brandongalbraith
Copy link
Collaborator

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

@brandongalbraith brandongalbraith changed the title tubeup leaks IP address and the name of the user directory in (video).info.json (privacy problem) Sanitize User Data from YTDL-generated JSON Metadata File Jun 10, 2020
@brandongalbraith
Copy link
Collaborator

brandongalbraith commented Jun 10, 2020

@vxbinaca @hunter0002 Dropped some context in ytdl-org/youtube-dl#25576, asked if the youtube-dl folks are willing to sanitize the data (which is preferable versus future metadata spot checks and sanitization updates on our end). If not, should be trivial for us to regex out the IP addresses from the format links and drop the _filename k/v entirely.

@vxbinaca
Copy link
Collaborator

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

You have to sanitize v6 IPs too which is more complex. Doing the paths is easier.

@hunter0002
Copy link
Author

I've filed a new issue (25681) since 25576 was closed, can't be reopened and is apparently being ignored

@hunter0002
Copy link
Author

hunter0002 commented Jun 15, 2020

The new issue has been closed, which helpfully answers the open questions:

  • The path being different is probably tubeup's fault (or dstftw doesn't believe that it's happening)
  • The URLs will not be modified on the youtube-dl end, since the URLs don't work without the IP address parameter (a 403 is returned)

So the former will have to be changed on the tubeup end, and for the latter it would have to be decided whether or not the working URLs are worth keeping? Generating working URLs which don't contain the IP address might require e.g. sending all requests through Tor or another open proxy by default.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 16, 2020

@hunter0002 I have some bad news for you: They need that information.

It's looking like right now, if you use Tubeup you need to live with the reality that your public facing IP and a path to a file will be in metadata. Unless IA processes it out, youtube-dl won't fix it so stop making issues with youtube-dl. They will not fix it.

It's either gonna be:

  1. Fixed on our end via post-processing
  2. Done as apart of a process on IAs end in the deriving process
  3. You live with that information being in metadata which by the way if you were using a VPS like the README recommends wouldn't be a serious issue - especially since you'd be connecting to the box SSH keylessly with good generated keys, right?

I know I deserve a Pwnie Award for saying that. I don't care. You might not get this issue resolved is what I'm saying.

Edit: We, I, set path because I don't want morons who use this script to dump 100 gigabytes of video and metadata into the CWD. We also need a fixed directory for the archive file for the previously ripped videos.

@hunter0002
Copy link
Author

hunter0002 commented Jun 17, 2020

I made the issue because Brandon's comment was apparently being ignored. Since the new issue was responded to very quickly we can move on from that, and I'm not going to make any further issues or comments in the tubeup repository.

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS even if it's best practice to do so. Especially considering that tubeup is still the easiest way to get a YouTube video to display in the Wayback Machine, I would expect the majority of tubeup users to inevitably have only uploaded a few videos each, with a very small minority of power users having uploaded the majority of videos (given that this sort of usage curve exists for most other software/projects for which data is available). And if you're only going to upload five videos, why the hell would you bother with a more complex setup like that? It might well take longer to set up the VPS than for the videos to be download and then uploaded to IA.

@vxbinaca
Copy link
Collaborator

vxbinaca commented Jun 17, 2020

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS

You could use a VPN too. Doesn't fix the file path issue but it's a start. Tor is slower old people getting off a bus.

Either someone will come up with a way post-process metadata or users will live with it. I'm simply too busy to learn Python to fix this. Edit: This is ontop of the fact Tubeup works with a destination website, and a program with hundreds of possible source websites. This is a complex situation to deal with.

Closing, and I'm going to add a disclaimer to the front of the page that your examples users never read anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants