Sanitize User Data from YTDL-generated JSON Metadata File #119

hunter0002 · 2020-06-07T19:09:48Z

Currently, the youtube-dl JSON uploaded to archive.org includes various metadata, including the full directory name of the video file and googlevideo.com source URLs. With default settings this leaks the user's home directory name and his/her OS, and often his/her IP address as well. tubeup also does not inform the user that the data will be stored publicly on archive.org.

Either:

tubeup should inform the user that identifying data is going to be made public
tubeup should somehow remove the identifying data from the JSON generated by youtube-dl

vxbinaca · 2020-06-07T20:19:33Z

Also your issue isn't correct, this isn't our problem but youtube-dls. We don't actually create the JSON metadata, that program does.

vxbinaca · 2020-06-07T20:20:24Z

@brandongalbraith thoughts?

hunter0002 · 2020-06-07T20:26:20Z

Also your issue isn't correct

thanks, I've corrected the issue description

vxbinaca · 2020-06-07T20:28:41Z

My testing with youtube-dl shows JSON is a all or nothing affair. We need JSON to both preserve important metadata, both for preservation and for the items creation. Youcan't just tell youtube-dl to not collect the filename. You need to take this to them, it's not our issue. I don't want to get into the habit of post-processing metadata.

Closing, but I'm going to add a warning on the README.

vxbinaca · 2020-06-07T20:58:30Z

This may have to do with how we point the file to a certain directory. it's not reproducing the full directory when I separately generate JSON with youtube-dl.

@jjjake Would you suggest handling this on our end (somehow) or y'all stripping directory information as apart of the deriving process?

hunter0002 · 2020-06-07T21:01:55Z

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address. This doesn't seem to happen for all requests.

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

vxbinaca · 2020-06-07T21:02:48Z

The tubeup JSON does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with youtube-dl. Not sure why

Thats not our JSON, also please start properly indenting examples so we can see what you're talking about.

hunter0002 · 2020-06-07T21:07:53Z

I've tried to make the wording clearer. I'm not sure if it's related to the ip key/value being part of the URL only for certain googlevideo.com requests.

vxbinaca · 2020-06-07T21:08:51Z

The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address, but I couldn't reproduce this with regular youtube-dl. Not sure why

"formats": [{"format_id": "249", "url": "[...]googlevideo.com[...]ip=xxx.xxx.xxx.xxx

That looks like the IP address resolved from Youtube, not you. So not a major problem. The directory thing is a minor problem that can either be done in a code fix (send a pull request) or on IAs end.

Hiding that IP makes it so I can't WHOIS to check for sure.

hunter0002 · 2020-06-07T21:12:23Z

I can confirm it's my IP address, I tested with both searching DuckDuckGo for "ip" and running dig @resolver1.opendns.com ANY myip.opendns.com +short.

vxbinaca · 2020-06-07T21:38:52Z

I concur, youtube-dl prints the public IP address and user directory of the video file. The directory thing, not a huge issue but a issue. IP, thats a larger problem.

Submit a pull request with a fix and I'll test.

hunter0002 · 2020-06-07T21:43:07Z

I can't write in Python so unfortunately I'll have to leave a PR to someone else.

vxbinaca · 2020-06-07T21:44:01Z

Neither can I. This goes beyond my skillset of minor tweaks.

brandongalbraith · 2020-06-10T01:51:39Z

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

brandongalbraith · 2020-06-10T02:19:48Z

@vxbinaca @hunter0002 Dropped some context in ytdl-org/youtube-dl#25576, asked if the youtube-dl folks are willing to sanitize the data (which is preferable versus future metadata spot checks and sanitization updates on our end). If not, should be trivial for us to regex out the IP addresses from the format links and drop the _filename k/v entirely.

vxbinaca · 2020-06-10T17:16:52Z

Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive.

@vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever

You have to sanitize v6 IPs too which is more complex. Doing the paths is easier.

hunter0002 · 2020-06-15T20:19:56Z

I've filed a new issue (25681) since 25576 was closed, can't be reopened and is apparently being ignored

hunter0002 · 2020-06-15T20:30:25Z

The new issue has been closed, which helpfully answers the open questions:

The path being different is probably tubeup's fault (or dstftw doesn't believe that it's happening)
The URLs will not be modified on the youtube-dl end, since the URLs don't work without the IP address parameter (a 403 is returned)

So the former will have to be changed on the tubeup end, and for the latter it would have to be decided whether or not the working URLs are worth keeping? Generating working URLs which don't contain the IP address might require e.g. sending all requests through Tor or another open proxy by default.

vxbinaca · 2020-06-16T22:15:34Z

@hunter0002 I have some bad news for you: They need that information.

It's looking like right now, if you use Tubeup you need to live with the reality that your public facing IP and a path to a file will be in metadata. Unless IA processes it out, youtube-dl won't fix it so stop making issues with youtube-dl. They will not fix it.

It's either gonna be:

Fixed on our end via post-processing
Done as apart of a process on IAs end in the deriving process
You live with that information being in metadata which by the way if you were using a VPS like the README recommends wouldn't be a serious issue - especially since you'd be connecting to the box SSH keylessly with good generated keys, right?

I know I deserve a Pwnie Award for saying that. I don't care. You might not get this issue resolved is what I'm saying.

Edit: We, I, set path because I don't want morons who use this script to dump 100 gigabytes of video and metadata into the CWD. We also need a fixed directory for the archive file for the previously ripped videos.

hunter0002 · 2020-06-17T16:08:50Z

I made the issue because Brandon's comment was apparently being ignored. Since the new issue was responded to very quickly we can move on from that, and I'm not going to make any further issues or comments in the tubeup repository.

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS even if it's best practice to do so. Especially considering that tubeup is still the easiest way to get a YouTube video to display in the Wayback Machine, I would expect the majority of tubeup users to inevitably have only uploaded a few videos each, with a very small minority of power users having uploaded the majority of videos (given that this sort of usage curve exists for most other software/projects for which data is available). And if you're only going to upload five videos, why the hell would you bother with a more complex setup like that? It might well take longer to set up the VPS than for the videos to be download and then uploaded to IA.

vxbinaca · 2020-06-17T22:05:38Z

I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS

You could use a VPN too. Doesn't fix the file path issue but it's a start. Tor is slower old people getting off a bus.

Either someone will come up with a way post-process metadata or users will live with it. I'm simply too busy to learn Python to fix this. Edit: This is ontop of the fact Tubeup works with a destination website, and a program with hundreds of possible source websites. This is a complex situation to deal with.

Closing, and I'm going to add a disclaimer to the front of the page that your examples users never read anyway.

hunter0002 changed the title ~~tubeup leaks the name of the user directory (privacy problem)~~ tubeup leaks the name of the user directory in the JSON _filename field (privacy problem) Jun 7, 2020

vxbinaca closed this as completed Jun 7, 2020

hunter0002 mentioned this issue Jun 7, 2020

JSON contains full file name (and possibly other identifying metadata) and this cannot be disabled ytdl-org/youtube-dl#25576

Closed

6 tasks

vxbinaca reopened this Jun 7, 2020

hunter0002 changed the title ~~tubeup leaks the name of the user directory in the JSON _filename field (privacy problem)~~ tubeup leaks IP address and the name of the user directory in (video).info.json (privacy problem) Jun 7, 2020

brandongalbraith changed the title ~~tubeup leaks IP address and the name of the user directory in (video).info.json (privacy problem)~~ Sanitize User Data from YTDL-generated JSON Metadata File Jun 10, 2020

vxbinaca closed this as completed Jun 17, 2020

pukkandan mentioned this issue Feb 1, 2021

Some queries regarding the code #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize User Data from YTDL-generated JSON Metadata File #119

Sanitize User Data from YTDL-generated JSON Metadata File #119

hunter0002 commented Jun 7, 2020 •

edited

vxbinaca commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

vxbinaca commented Jun 7, 2020 •

edited

hunter0002 commented Jun 7, 2020 •

edited

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020 •

edited

vxbinaca commented Jun 7, 2020 •

edited

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

brandongalbraith commented Jun 10, 2020

brandongalbraith commented Jun 10, 2020 •

edited

vxbinaca commented Jun 10, 2020

hunter0002 commented Jun 15, 2020

hunter0002 commented Jun 15, 2020 •

edited

vxbinaca commented Jun 16, 2020 •

edited

hunter0002 commented Jun 17, 2020 •

edited

vxbinaca commented Jun 17, 2020 •

edited

Sanitize User Data from YTDL-generated JSON Metadata File #119

Sanitize User Data from YTDL-generated JSON Metadata File #119

Comments

hunter0002 commented Jun 7, 2020 • edited

vxbinaca commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

vxbinaca commented Jun 7, 2020 • edited

hunter0002 commented Jun 7, 2020 • edited

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020 • edited

vxbinaca commented Jun 7, 2020 • edited

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

hunter0002 commented Jun 7, 2020

vxbinaca commented Jun 7, 2020

brandongalbraith commented Jun 10, 2020

brandongalbraith commented Jun 10, 2020 • edited

vxbinaca commented Jun 10, 2020

hunter0002 commented Jun 15, 2020

hunter0002 commented Jun 15, 2020 • edited

vxbinaca commented Jun 16, 2020 • edited

hunter0002 commented Jun 17, 2020 • edited

vxbinaca commented Jun 17, 2020 • edited

hunter0002 commented Jun 7, 2020 •

edited

vxbinaca commented Jun 7, 2020 •

edited

hunter0002 commented Jun 7, 2020 •

edited

hunter0002 commented Jun 7, 2020 •

edited

vxbinaca commented Jun 7, 2020 •

edited

brandongalbraith commented Jun 10, 2020 •

edited

hunter0002 commented Jun 15, 2020 •

edited

vxbinaca commented Jun 16, 2020 •

edited

hunter0002 commented Jun 17, 2020 •

edited

vxbinaca commented Jun 17, 2020 •

edited