Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsc client with Windows and NAS #136

Closed
Gutomba opened this issue Feb 26, 2022 · 20 comments
Closed

dsc client with Windows and NAS #136

Gutomba opened this issue Feb 26, 2022 · 20 comments
Labels
question Further information is requested

Comments

@Gutomba
Copy link

Gutomba commented Feb 26, 2022

Hi everyone

I'm trying to use dsc export feature to export all data out of my docspell instance on my Synology NAS. I tried to look everywhere I could think of, but I didn't manage to even connect dsc to my NAS. I know there should be a config file somewhere, or it should be created somehow, but even with dsc --help I have no clue how to connect with my NAS using dsc on Windows.

Do you have any starting point on how to proceed?

@eikek eikek added the question Further information is requested label Feb 26, 2022
@eikek
Copy link
Member

eikek commented Feb 26, 2022

Hi, I'm not sure what exactly you want to do. Where is docspell installed? You want to download the files to your windows?

In general, dsc accepts the -d option to pass it the base url of your docspell server, like http://mynas.home:7880 or something - whatever you type in your browser. Of course, this depends on your network setup etc and/or from which machine you run this command.

@Gutomba
Copy link
Author

Gutomba commented Feb 26, 2022

Hi eikek, thanks for your reply. Docspell is installed via Docker on my NAS. It is accessible with http://myNasIP:7880. I want to use the CLI export functionality to regularly export all documents with metadata. This is to have a sort of backup, aside from the docker volume backups, and mainly to test on how I would set up an exit stragety, if I had to leave docspell.

I downloaded dsc.exe to my windows machine and tried to use in CMD: dsc -d http://myNasIP:7880, but it just returns the general help. I also tried to login with dsc, but should make no sense at this point as dsc should only point at the local windows machine, not the NAS.

@eikek
Copy link
Member

eikek commented Feb 26, 2022

The dsc is a general client, that is with just dsc -d http://myNasIp:7880 it doesn't know what you want :-) In the help you mentioned it says how to use it: dsc [OPTIONS] <SUBCOMMAND>, following a list of sub-commands you can choose from. There is also some information in this README and here - please check it out.

One of them is export - the one you want. Via dsc export --help you'll see help to this specific command.

But you also need to decide where to run this. The dsc tool will download to the machine it is running on. If you want to download it to your windows machine, then run it there. When you later want to export it to your NAS, I would recommend to add another container for this; but first I would test it on some machine interactively until it does what you want.

So… for doing the export, you must login via dsc first via the login sub-command. Then you can run dsc export …. Here is a quick example:

# First login 
dsc -d http://myNasIP:7880 login --user YOUR_LOGIN --password YOUR_PASSWORD

# WHen using --*-links, I'd clean these first (don't know how that works on windows, the following line is for linux…)
rm -rf /path/to/export-directory/by_*

# Then export all 
dsc -d http://myNasIP:7880 export --all --target /path/to/export-directory --link-naming name --tag-links --date-links --correspondent-links

Please see the help to the sub-command for what the options mean. The --*-links are optional and may not work on windows. I would also recommend to create a separate user in docspell only for the export when you run it periodically.

@Gutomba
Copy link
Author

Gutomba commented Feb 27, 2022

That was great help, thanks. I had already read though the links you have provided. My error was that I didn't think of concatenating the arguments. I started with dsc -d http://myNasIP:7880 and thought dsc login --user xx --password yy had to follow. Now that's clear.

The export works, also symlinks are created using Windows. You suggest # WHen using --*-links, I'd clean these first (don't know how that works on windows, the following line is for linux…) rm -rf /path/to/export-directory/by_*. I don't know what this is for, would you do this before every new export? I could manually delete the symlinks or find the correct Windows commands.

As I said the export works, but I get an error message, that does not seem to have any impact (at least I can't see any impact): Cmd: Export: Error creating a file: Die Syntax für den Dateinamen, Verzeichnisnamen oder die Datenträgerbezeichnung ist falsch. (os error 123) (English: The syntax for the file name, directory name or disk label is incorrect). I'm not sure where this is coming from. My command was dsc -d http://myNasIP:7880 export --all --target C:\Users\myuser\path\dsc_export\ --link-naming name --tag-links --date-links --correspondent-links

But now that the export works, I wonder if I could get a single metadata.json file which contains metadata for all exported items? Or how would you process all exported items with indivial metadata.json files in individual folders? What I mean is that I want to test a full migration scenario, where I need to recover all metadata as hassle-free as possible.

@eikek
Copy link
Member

eikek commented Feb 27, 2022

Ah, this error is not good probably. Are you sure that everything was exported? You can run with dsc -vv -d … to have more logging turned on. Maybe it gives a clue. The command looks good. Could be that a name had unsupported characters…. You could try without all the --*-links and without --link-naming options, to see if that removes the error.

The export is producing a metadata.json per item in a defined directory structure. I'm not sure what exactly you want to do (where do you want to migrate to?). To process all exported items, I would traverse the directory tree and on each item do whatever I need to do. For example, with find export/items -name "metadata.json" you'll get all metadata.json files. From there you know that all files are in the files subdirectory and the itemid (should you need it) is the parent directory.

@Gutomba
Copy link
Author

Gutomba commented Feb 28, 2022

The error also appears without --*-links and --link-naming. Two items in fact were not exported. The line Exported item: xyz.pdf does not appear in two cases. But I can see nothing useful in the log. There is one item that has a "§" character in its name (at least the original, not the working item in docspell), for the other item that was not exported I can see no cause for the error.

About the migration: I don't have a migration destination yet. But in case I need to migriate, I want to be prepared.
So if I needed to migrate to another DMS or back to a simple folder system, I'd have to do this mostly manually I think. Now with a full spreadsheet (Excel) I'd have the filenames, tags, correspondents and other in one big table that I could use it to find and organize tags for all items. I'm not sure, if this is really works or is too cumbersome. But with single metadata.json files for maybe thousands of items, I'd never be able to extract metadata efficiently.

Or is there a way with partly automated commands or steps to convert the export into a normal file structure, that I don't see at this point? I don't want to lose all work that I'm about to invest in docspell some day to come.

@eikek
Copy link
Member

eikek commented Feb 28, 2022

Regarding the error: I could imagine that some filesystems have problems with specific characters like §. The export is only exporting the original files (there is a feature request to have it export the converted pdfs instead). This would be mitigated by also using the attachment id and not the name when exporting (but that's not implemented yet). I think currently you would need to change the original file name in the database (docspell doesn't allow you to change the original filename). Or wait until the 'export converted files' feature is there.

Regarding the migration: I think it is really hard to prepare for something you don't know - like the system you want to migrate to. The only thing to me is making sure you have access to all the data in a machine readable way (because you don't want to migrate manually). The migration then will require manual steps anyways. So for me it doesn't matter at all whether the data is in an Excel file or in thousands of json files. Actually having thousands of lines in an excel file is much worse, because it is very hard to work with it in my experience (you need to convert to csv or something first usually). With the json files you first don't need to load everything in memory and secondly have the data in an machine readable format right away. Imho it is so much easier to work with data than with an Excel file. I would argue that you can extract data more efficiently with the json files than with a huge Excel file.

Or is there a way with partly automated commands or steps to convert the export into a normal file structure, that I don't see at this point? I don't want to lose all work that I'm about to invest in docspell some day to come.

You could write a script that unifies all json files into a single file in whatever format you want. But I'm not sure why you would want this, especially when the target system you want to migrate to is unknown. The export produces a well defined file structure. What do you mean with "normal"?

@Gutomba
Copy link
Author

Gutomba commented Feb 28, 2022

Thanks for your further input.

The § character was the problem. I deleted this file within docspell UI and then the export worked without errors. One thing to note: Now the other missing file that was not exported before is now exported as wll. It seems that the problematic §-character-file stopped the export of (one single) remaining files. Probably that's not the case, as it sounds strange.

Your thoughts about migration are very viable and draw me away from my former point of view. My idea was to have a full json file, as this is what paperless-ng export does (I came to docspell because I compared those two and like docspell). In a converted excel file you could use filters etc. But you're right, I don't want to do a manual migration.

Now what I mean with "normal file structure" (sorry for being fuzzy) is to simply have PDF files in windows (or any OS) file system - so without DMS features. I'm about to have my PDF files well maintained in docspell and if I had to leave some day (well I hope this projects lives a very long time ;) ) I don't want to have a ton of files without any organization. I'd like to arrange a folder structure then, that maybe includes tag names in the file names for example. I see that the export folder structure supports in this way, but still you'd have to extract the items in folders without symlinks and without subfolders for each item.

Right now I wouldn't know how to do that without going crazy. That is my only concern about going with any DMS.

@eikek
Copy link
Member

eikek commented Feb 28, 2022

Ah right, I can totally see now the idea with Excel, it makes it easy to do changes on the whole data at once. It is a good idea, if you don't have too many data, I think. I still think with more data, it will be less convenient. But… you need to script / program something with the json file approach, though hopefully not much. But if one doesn't feel comfortable with scripting, then a bunch of json files can look scary :-). A huge single one doesn't seem that much better, I would say (for me personally both is fine, I tend to prefer many smaller files to a single huge one - but that is probably matter of taste :)). 🤔 Hm, I think the current metadata export really requires to do some sort of scripting to consume it later.

Regarding the file structure, I think I see what you are after in general - that is exactly what the --*-link options are for. They create several different symlink trees to group documents by tags/correspondents etc. Using symlinks we still store each file just once. It is currently not possible to specify a filename pattern, but there is a feature request already for this. What kind of structure do you have in mind or would help you? Maybe this can be added as well. ( I think I don't understand your last sentences, sorry).

Edit: regarding the failed file - this is probably a bug. It exports in batches, maybe when one file fails the whole batch fails. Just a guess, I need to look into this.

@Gutomba
Copy link
Author

Gutomba commented Feb 28, 2022

What I mean with the last sentences is that I would be overwhelmed with the export result if I had to quit using the DMS. Even with sysmlinks - I'd had to (re-)create a folder structure without symlinks.

What file structure would help me - Well this idea again comes from paperless-ng export: I could choose how the file tree is set up, for example [correspondent]/[tag list]/[individual files]. Then I would have x folders for my x correspondents (like a bank or insurance company), in each of those folders are subfolders named by tags or tag lists that refer to files, and in this folder would be the matching files. The file name would contain the issued date of the document and a title. This would not be perfect for sure, but from there on I could manually sort the files and have most information about the files usable. If this really is a good method when you have many files - I don't know. I might experience trouble then.

I see where you're coming from with indivial json files as this is machine readable, if you have some script. I guess with a script it would be possible to have the result I've just suggested. Personally, I'd have to evaluate how difficult this would be for me.

I hope you don't mind me referring to paperless-ng. I like both docspell and paperless-ng. Docspell has really great features and usability that appeals to me, and paperless-ng might be abandonned for any further maintanance. Having a possibility for an exit strategy is really important I think. I can't tell if one method is better than the other, after you explained your approach.

I see that there is no ready to use and worry free solution to migrate from a DMS, and I'm willing to put some work into it once I have to. But still I fear that this could be some kind of a trap for me if I failed at that. Maybe that's that. Do you have any other thoughts about this?

@eikek
Copy link
Member

eikek commented Feb 28, 2022

No worries at all! I totally understand the importance of an exit strategy! I also did this evaluation for myself, I might want to migrate, too, at some day maybe. I think it's a great idea to look at the output of the export and think whether that fits you or not. I would probably do the same. For me, having the data in some machine readable form would allow me to shape it for the next tool. I also sometimes look at the DB schema to see if I can make sense of it. It would be another way to get the data out without relying on the application. But I see that this approach is not suitable for everyone. I don't have good ideas for a general approach - the json files are some sort of middle way to have a common format that can be parsed easily. I'm always open to ideas.

Regarding the folder structure: I think if #114 is done (might not be very soon), you should be able to create at least something close. It would still be a symlink tree, though. Creating directories for tags with real files in them could be problematic, since you can assign multiple tags. Files would need to be duplicated. That's why symlinks are used - to me this feels quite the same as "real" files. For example, if you share it via samba - the user won't notice a difference. On Linux at least it's not hard to replace the links with a copy of their target file. (The folder structure is real, only the files at the and are symlinks; just to be on the same page not sure how close you looked at it). The idea is that you can use the "items" folder as input to scripts, because it is always the same structure. And the symlink trees are for humans to look at.

But yes, you need to decide for yourself at the end - as you just found, a silly bug can also be a problem. With some work, you can create some safety nets if needed. In general (with any self hosted and more important application) I would try to not only create data backups, but also do this for the environment and software binaries/packages. With docker, vms and such things, this has become more convenient now. (only testing the backups is like always a pain :)). The idea is that even if the software is abandoned that you can always use the last version, which may have served you reliably for some time (ofc, there are caveats - I would not open it up anymore, but only use internally). This could give you some breath for migrating or you can keep using it just as well.

@Gutomba
Copy link
Author

Gutomba commented Mar 1, 2022

I'd absolutely use the software for some time, even if further development was stopped.

About the symlinks: I'll have a look into it if I can replace them with the file that the symlink is pointed to in Windows.

I see a clear advantage of the single json files, as they are located right at the single files, so you don't need to rely on one big Excel file. But let's assume you would come around with a dsc file structure after the export before you first saw docspell. Now you want put all items and tags into docspell. Would you have an idea on how to start? In docspell there is an import functionality for the original paperless project, but I think you would need to anything manually coming from a plain export. I don't expect a full manual, but if I had to do the task now, I wouldn't know how to start.

@eikek
Copy link
Member

eikek commented Mar 1, 2022

This is hard to answer for the general case. I think I would look what data is supported in the new tool and what ways exist to bring it into the tool. Then I would start looking at one item in the items directory. There is a metadata.json containing all the meta infos and files are in files/ subdirectory. I would try to write a script that given such a location would read the json file, create requests to the new tool to add as much as possible and upload the files. When I get stuck, I would ask for support. After some trial&error I would then apply it to the whole directory (maybe doing it 2 or 4 times, to cover more scenarios).

For docspell specifically, you can first go through all items and upload the files. In a second pass, each id can be obtained by sending a sha256 of a file. With this id it is then possible to associate tags, correspondents etc. I would maybe start with 10 or so to see how it goes and then apply to all. The dsc tool supports some of the operations required for this, but not all (you need to reach out for some tool like curl). At some point it might support all of them, but it's a long way still.

About the symlinks: I'll have a look into it if I can replace them with the file that the symlink is pointed to in Windows.

If you are using a NAS, you could do the export there and share the directory via Samba. From windows you should be able to access and/or copy files (i.e. the symlinks are not observed, they are represented as normal files iirc).

@Gutomba
Copy link
Author

Gutomba commented Mar 6, 2022

Hi, sorry I was off for a few days, and I did not have proper input to answer.

create requests to the new tool to add as much as possible and upload the files.

You mean that you would request the feature to mass import tags, or you would request to have atrributes like correspondents in a tool?

In a second pass, each id can be obtained by sending a sha256 of a file

I didn't look this up, but is this a functionaliy of dsc to mass extract IDs?

With this id it is then possible to associate tags, correspondents etc

This means there are commands in dsc that allow me to mass upload e.g. tags or corresponents?

Edit: Corrected quotes

@eikek
Copy link
Member

eikek commented Mar 7, 2022

No worries for being off - it is good to be that sometimes :)

I think I don't understand the questions, sorry.

You mean that you would request the feature to mass import tags, or you would request to have atrributes like correspondents in a tool?
This means there are commands in dsc that allow me to mass upload e.g. tags or corresponents?

dsc can set tags for an item, but otherwise there is not much functionality yet for changing/adding metadata. But since it is only a client to the quite comprehensive api, you can always use curl for things not yet in dsc.

I didn't look this up, but is this a functionaliy of dsc to mass extract IDs?

I'm not sure what you mean with "extracting IDs". You were asking how I would go about this for docspell - in this case you can upload files and later ask for the docspell ID for each file to associate tags and other metadata.

@Gutomba
Copy link
Author

Gutomba commented Mar 15, 2022

Sorry, I was off again :) Now I need to find time to finish my DMS project.

Basically I wanted to ask if there is a way to:

  1. Upload all files to docspell first
  2. Then extract all ID (if needed for step 3)
  3. Then assign all available metadata in a mass upload to the uploaded files from the first step

If there is some way, then migration to docspell would be doable without too much manual effort.

I'll have a look to the api and curl, but this would a quite unkown approach for me. I'll see if this is too intimidating or something suitable to learn for me.

@eikek
Copy link
Member

eikek commented Mar 16, 2022

Hi and no worries :) - yes, this is exactly how it would work. You can first upload all documents. then ask for the id given the sha256 of a file and now you can attach tags and other metadata. But you need to code something together, that is true. Uploading and getting the id for a file is supported in dsc - adding metadata is not really, for this you'd need to reach out to curl.

@Gutomba
Copy link
Author

Gutomba commented Mar 17, 2022

Ok, thanks a lot for your help. I guess I'd have to dig deep into it and see how it would work for me. But I see that I can work with Docspell, and if I had to migrate to another tool sometime, there likely should be a tool that supports mass input of data - but I'd need to have some understanding of coding to get it done.

Do you think this is doable for a non-programmer? I know a bit of coding, but I have no feeling what are the skills and experience needed for such a task.

@eikek
Copy link
Member

eikek commented Mar 17, 2022

For a very non-programmer it is probably a tough journey. But with only a little coding experience, I would say it is doable. I think a bit experience in shell scripting is a good thing, knowing tools like jq, curl and JSON. But if you are experienced in other languages, say python, it is even better.

The problem is that preparing something for a "future unknown tool" to process is difficult. The role of dsc is focused only around docspell (and still unfinished…). But how you get the data out of or in some tool, is a problem that remains. I don't know a good way to provide something for the general case. Should a more concrete case come up, it could maybe help to get a better feel of it.

@Gutomba
Copy link
Author

Gutomba commented Mar 18, 2022

That sounds like a good conclusion. Thanks a lot for your help and input! And thanks for the great work you do! I'll see how I get on from here. I might have a closer look on jq, curl and python, now that I have some real purpose to do that.

@Gutomba Gutomba closed this as completed Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants