-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize User Data from YTDL-generated JSON Metadata File #119
Comments
Also your issue isn't correct, this isn't our problem but youtube-dls. We don't actually create the JSON metadata, that program does. |
@brandongalbraith thoughts? |
thanks, I've corrected the issue description |
My testing with youtube-dl shows JSON is a all or nothing affair. We need JSON to both preserve important metadata, both for preservation and for the items creation. Youcan't just tell youtube-dl to not collect the filename. You need to take this to them, it's not our issue. I don't want to get into the habit of post-processing metadata. Closing, but I'm going to add a warning on the README. |
This may have to do with how we point the file to a certain directory. it's not reproducing the full directory when I separately generate JSON with youtube-dl. @jjjake Would you suggest handling this on our end (somehow) or y'all stripping directory information as apart of the deriving process? |
The JSON generated by tubeup's youtube-dl download does also seem to expose some other identifying information, for example the URLs contain the user's IP address. This doesn't seem to happen for all requests.
|
Thats not our JSON, also please start properly indenting examples so we can see what you're talking about. |
I've tried to make the wording clearer. I'm not sure if it's related to the |
That looks like the IP address resolved from Youtube, not you. So not a major problem. The directory thing is a minor problem that can either be done in a code fix (send a pull request) or on IAs end. Hiding that IP makes it so I can't WHOIS to check for sure. |
I can confirm it's my IP address, I tested with both searching DuckDuckGo for "ip" and running |
I concur, youtube-dl prints the public IP address and user directory of the video file. The directory thing, not a huge issue but a issue. IP, thats a larger problem. Submit a pull request with a fix and I'll test. |
I can't write in Python so unfortunately I'll have to leave a PR to someone else. |
Neither can I. This goes beyond my skillset of minor tweaks. |
Sorry I'm late, was in the mountains for a bit. Investigating sanitizing the JSON meta of anything that could be considered sensitive. @vxbinaca Anyone chatting with IA patron services yet about this? The number of JSON files out there from tubeup is significant, and that personal data leaked in the metadata can't be left hanging out there forever |
@vxbinaca @hunter0002 Dropped some context in ytdl-org/youtube-dl#25576, asked if the youtube-dl folks are willing to sanitize the data (which is preferable versus future metadata spot checks and sanitization updates on our end). If not, should be trivial for us to regex out the IP addresses from the format links and drop the |
You have to sanitize v6 IPs too which is more complex. Doing the paths is easier. |
I've filed a new issue (25681) since 25576 was closed, can't be reopened and is apparently being ignored |
The new issue has been closed, which helpfully answers the open questions:
So the former will have to be changed on the tubeup end, and for the latter it would have to be decided whether or not the working URLs are worth keeping? Generating working URLs which don't contain the IP address might require e.g. sending all requests through Tor or another open proxy by default. |
@hunter0002 I have some bad news for you: They need that information. It's looking like right now, if you use Tubeup you need to live with the reality that your public facing IP and a path to a file will be in metadata. Unless IA processes it out, youtube-dl won't fix it so stop making issues with youtube-dl. They will not fix it. It's either gonna be:
I know I deserve a Pwnie Award for saying that. I don't care. You might not get this issue resolved is what I'm saying. Edit: We, I, set path because I don't want morons who use this script to dump 100 gigabytes of video and metadata into the CWD. We also need a fixed directory for the archive file for the previously ripped videos. |
I made the issue because Brandon's comment was apparently being ignored. Since the new issue was responded to very quickly we can move on from that, and I'm not going to make any further issues or comments in the tubeup repository. I don't think it's reasonable for a developer of a command line program to expect every single user to set up a VPS even if it's best practice to do so. Especially considering that tubeup is still the easiest way to get a YouTube video to display in the Wayback Machine, I would expect the majority of tubeup users to inevitably have only uploaded a few videos each, with a very small minority of power users having uploaded the majority of videos (given that this sort of usage curve exists for most other software/projects for which data is available). And if you're only going to upload five videos, why the hell would you bother with a more complex setup like that? It might well take longer to set up the VPS than for the videos to be download and then uploaded to IA. |
You could use a VPN too. Doesn't fix the file path issue but it's a start. Tor is slower old people getting off a bus. Either someone will come up with a way post-process metadata or users will live with it. I'm simply too busy to learn Python to fix this. Edit: This is ontop of the fact Tubeup works with a destination website, and a program with hundreds of possible source websites. This is a complex situation to deal with. Closing, and I'm going to add a disclaimer to the front of the page that your examples users never read anyway. |
Currently, the youtube-dl JSON uploaded to archive.org includes various metadata, including the full directory name of the video file and googlevideo.com source URLs. With default settings this leaks the user's home directory name and his/her OS, and often his/her IP address as well. tubeup also does not inform the user that the data will be stored publicly on archive.org.
Either:
The text was updated successfully, but these errors were encountered: