input file list #55

yucongalicechen · 2024-05-10T18:37:24Z

closes specify an inputs directory name #23, check if args.input_file is specified #24.
File list tests passed.
For test case of "./input_dir", we also read the two file lists in the directory. In my current commit here I read the two file lists and append the data files in our input directories too. Please let me know if this is ideal.
In tools.py: I added _parse_input_paths in addition to _parse_file_list_file and it returns a set of file paths for file list/single data file. list(set(input_paths)) removes the duplicates.

sbillinge · 2024-05-10T20:26:50Z

We can discuss, but this is probably undesirable behavior. I think we should only read a file_list.txt file if it is explicitly specified by the user as a file or in a list of files.

…

On Fri, May 10, 2024, 2:37 PM Yucong (Alice) Chen ***@***.***> wrote: - closes #23 <#23>, #24 <#24>. - File list tests passed. - For test case of "./input_dir", we also read the two file lists in the directory. In my current commit here I read the two file lists and append the data files in our input directories too. Please let me know if this is ideal. - In tools.py: I added _parse_input_paths in addition to _parse_file_list_file and it returns a set of file paths for file list/single data file. list(set(input_paths)) removes the duplicates. ------------------------------ You can view, comment on, or merge this pull request online at: #55 Commit Summary - 99d73aa <99d73aa> added functionality of handling an input of a file list File Changes (3 files <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files>) - *M* src/diffpy/labpdfproc/labpdfprocapp.py <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-5dac184d5afea97c0b77258be37462d3d08aabaa60ba14882d8bf3a502d51927> (2) - *M* src/diffpy/labpdfproc/tests/test_tools.py <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-972495a76a6a469ac3d0228bccf570e688e77c3a5962d02ed2da0b2ffbd9fdc9> (5) - *M* src/diffpy/labpdfproc/tools.py <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-c982ef80888d672f298d53201af1be02db7e8713509ca9e05dbeb1bb8f71d211> (26) Patch Links: - https://github.com/diffpy/diffpy.labpdfproc/pull/55.patch - https://github.com/diffpy/diffpy.labpdfproc/pull/55.diff — Reply to this email directly, view it on GitHub <#55>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAOWUOH43BGOBOQBS4P2F3ZBUHXTAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TAMRSGEZTIMY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

sbillinge · 2024-05-10T20:29:21Z

The reasoning is minimizing "magic"... I also removed set from the test to make the test stricter. There may be a bug that duplicates files for some reason and we wouldn't find it.

…

On Fri, May 10, 2024, 4:26 PM Simon Billinge ***@***.***> wrote: We can discuss, but this is probably undesirable behavior. I think we should only read a file_list.txt file if it is explicitly specified by the user as a file or in a list of files. On Fri, May 10, 2024, 2:37 PM Yucong (Alice) Chen < ***@***.***> wrote: > > - closes #23 <#23>, > #24 <#24>. > - File list tests passed. > - For test case of "./input_dir", we also read the two file lists in > the directory. In my current commit here I read the two file lists and > append the data files in our input directories too. Please let me know if > this is ideal. > - In tools.py: I added _parse_input_paths in addition to > _parse_file_list_file and it returns a set of file paths for file > list/single data file. list(set(input_paths)) removes the duplicates. > > ------------------------------ > You can view, comment on, or merge this pull request online at: > > #55 > Commit Summary > > - 99d73aa > <99d73aa> > added functionality of handling an input of a file list > > File Changes > > (3 files <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files>) > > - *M* src/diffpy/labpdfproc/labpdfprocapp.py > <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-5dac184d5afea97c0b77258be37462d3d08aabaa60ba14882d8bf3a502d51927> > (2) > - *M* src/diffpy/labpdfproc/tests/test_tools.py > <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-972495a76a6a469ac3d0228bccf570e688e77c3a5962d02ed2da0b2ffbd9fdc9> > (5) > - *M* src/diffpy/labpdfproc/tools.py > <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-c982ef80888d672f298d53201af1be02db7e8713509ca9e05dbeb1bb8f71d211> > (26) > > Patch Links: > > - https://github.com/diffpy/diffpy.labpdfproc/pull/55.patch > - https://github.com/diffpy/diffpy.labpdfproc/pull/55.diff > > — > Reply to this email directly, view it on GitHub > <#55>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABAOWUOH43BGOBOQBS4P2F3ZBUHXTAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TAMRSGEZTIMY> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

…only if user specifies it

yucongalicechen · 2024-05-10T20:39:52Z

I agree with what you said. Now the code has _parse_file_list_file to distinguish between file list and a data file. I added another extra condition on glob to skip appending the file path if the name contains file_list.
In the test I have to change assert list(actual_args.input_paths).sort() == expected_paths.sort() to assert sorted(actual_args.input_paths) == sorted(expected_paths) because in the first line both terms return None.

sbillinge

please see inline comments.

sbillinge · 2024-05-11T01:41:58Z

src/diffpy/labpdfproc/tools.py

@@ -28,6 +28,13 @@ def set_output_directory(args):
    return output_dir


+def _parse_file_list_file(input_path):


let's call this argument file_list_path to make it clearer

sbillinge · 2024-05-11T01:46:22Z

src/diffpy/labpdfproc/tools.py

@@ -28,6 +28,13 @@ def set_output_directory(args):
    return output_dir


+def _parse_file_list_file(input_path):
+    with open(input_path, "r") as f:
+        lines = [line.strip() for line in f]


I would move the "Path" to this line and instead of line call it filepath or sthg like that, so it reads more like file_paths = [Path(file_path.strip()) for file_path in f] which is more intelligible of what the code is meant to be doing. I wonder if it is more readable if we use readlines too? though this is ok too.

sbillinge · 2024-05-11T01:51:55Z

src/diffpy/labpdfproc/tools.py

@@ -51,16 +58,21 @@ def set_input_lists(args):
        input_path = Path(input).resolve()
        if input_path.exists():
            if input_path.is_file():
-                input_paths.append(input_path)
+                if "file_list" in input_path.name:


I think the code will be cleaner and easier to read if we just do a prefilter, so we don't change any code down here but we change the args.input up above to "expand" file_list into its list of files.

It could maybe even not be a private function but a public function that is called before set_input_file_list. We could call it expand_list_file and it just operates on strings.

If the objective is to handle duplicated files, would it be easier to use set(input_paths) at the end, so no other function is needed?

no, it is just keeping the code easier to read, so it doesn't get like spaghetti with lots of complicated conditionals handling edge cases all the way through it. it is about maintainability in the future. None of it matters that much, but it is also a kind of learning opportunity where we develop better practices in the group as a whole.

I think we will find if we do it this way the code will be easier to follow and the different tasks will be handled in a nicely separated way. For example we could call this function or not call it and the code would still work either way, but by calling it we get new functionality allowing the user to input lists of files from a text file.

I just pushed a version with expand_list_file. I think I get what you mean but this code probably needs some more comments. I have expand_list_file to identify if there are any file lists, and parse that list if so (so expand_list_file and parse_file_list both operate on string now). So that in the function for set_input_lists, we only have to consider if the existed path is a file or directory.
This is not passing one of our tests if we have a missing file in file list (i.e. it will raise an error instead of skipping it).

…list with missing files

sbillinge

This is getting there.

But my suggestion is to move the expand.... function outside the set_input_lists function and have it operate on the args,...so args are passed in and modified args are passed back. Then later the modified args are passed into the set_input_lists for that function to do its work. tl;dr, remove the nesting of these functions.

Also, in this case I wouldn't move the file reading out of the expand... function. The code will will be easier to read if we just have that read context manager inside the ewxpand function explicitly. i.e., summary, get rid of the private function.

Leave the functionality that removes duplicates where it is. This is a bit magic, but very mild magic and may be desirable behavior.

Thanks for the explanation of the change to the sort

…est with missing file to errorous case

yucongalicechen · 2024-05-11T17:50:38Z

I edited the function so that it calls expand_list_file first to return args with modified args, and then set_input_lists. I changed the varibale "input" to "input_name" because I read that input() is a built-in function, so it might be confusing. Since we read files for file lists the same way as we do for individual inputs, it raises an error if one file in a file list is invalid, and I moved the file list case with a missing file to one of the bad cases.

sbillinge · 2024-05-11T17:52:05Z

phew, that was a lot of work, but we got there....nice job!

sbillinge · 2024-05-11T17:53:22Z

Thanks, this is the desired behavior so all good! I merged! 🎉 S

…

On Sat, May 11, 2024 at 1:51 PM Yucong (Alice) Chen < ***@***.***> wrote: I edited the function so that it calls expand_list_file first to return args with modified args, and then set_input_lists. I changed the varibale "input" to "input_name" because I read that input() is a built-in function, so it might be confusing. Since we read files for file lists the same way as we do for individual inputs, it raises an error if one file in a file list is invalid, and I moved the file list case with a missing file to one of the bad cases. — Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAOWUO4WE374UG5T5GJKGTZBZLALAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVHE3TEMRUGY> . You are receiving this because you commented.Message ID: <diffpy/diffpy. ***@***.***>

-- Simon Billinge Professor, Department of Applied Physics and Applied Mathematics Columbia University

added functionality of handling an input of a file list

99d73aa

used sorted for tests and edited input function to include file list …

93cf5b4

…only if user specifies it

sbillinge reviewed May 11, 2024

View reviewed changes

initial commit on using expand_list_file, not passing tests for file_…

a9facce

…list with missing files

sbillinge reviewed May 11, 2024

View reviewed changes

removed nesting functions and the private function, moved file list t…

13b1e4f

…est with missing file to errorous case

sbillinge merged commit c45022d into diffpy:main May 11, 2024
2 checks passed

This was referenced May 11, 2024

user input files UCs #48

Closed

Implement input function to handle skipped files and check ouputs and header #52

Open

yucongalicechen deleted the input_dir3 branch May 13, 2024 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input file list #55

input file list #55

yucongalicechen commented May 10, 2024

sbillinge commented May 10, 2024 via email

sbillinge commented May 10, 2024 via email

yucongalicechen commented May 10, 2024

sbillinge left a comment

sbillinge May 11, 2024

sbillinge May 11, 2024

sbillinge May 11, 2024

yucongalicechen May 11, 2024

sbillinge May 11, 2024

yucongalicechen May 11, 2024

sbillinge left a comment

yucongalicechen commented May 11, 2024

sbillinge commented May 11, 2024

sbillinge commented May 11, 2024 via email

		@@ -28,6 +28,13 @@ def set_output_directory(args):
		return output_dir


		def _parse_file_list_file(input_path):

input file list #55

input file list #55

Conversation

yucongalicechen commented May 10, 2024

sbillinge commented May 10, 2024 via email

sbillinge commented May 10, 2024 via email

yucongalicechen commented May 10, 2024

sbillinge left a comment

Choose a reason for hiding this comment

sbillinge May 11, 2024

Choose a reason for hiding this comment

sbillinge May 11, 2024

Choose a reason for hiding this comment

sbillinge May 11, 2024

Choose a reason for hiding this comment

yucongalicechen May 11, 2024

Choose a reason for hiding this comment

sbillinge May 11, 2024

Choose a reason for hiding this comment

yucongalicechen May 11, 2024

Choose a reason for hiding this comment

sbillinge left a comment

Choose a reason for hiding this comment

yucongalicechen commented May 11, 2024

sbillinge commented May 11, 2024

sbillinge commented May 11, 2024 via email