-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input file list #55
input file list #55
Conversation
yucongalicechen
commented
May 10, 2024
- closes specify an inputs directory name #23, check if args.input_file is specified #24.
- File list tests passed.
- For test case of "./input_dir", we also read the two file lists in the directory. In my current commit here I read the two file lists and append the data files in our input directories too. Please let me know if this is ideal.
- In tools.py: I added _parse_input_paths in addition to _parse_file_list_file and it returns a set of file paths for file list/single data file. list(set(input_paths)) removes the duplicates.
We can discuss, but this is probably undesirable behavior. I think we
should only read a file_list.txt file if it is explicitly specified by the
user as a file or in a list of files.
…On Fri, May 10, 2024, 2:37 PM Yucong (Alice) Chen ***@***.***> wrote:
- closes #23 <#23>,
#24 <#24>.
- File list tests passed.
- For test case of "./input_dir", we also read the two file lists in
the directory. In my current commit here I read the two file lists and
append the data files in our input directories too. Please let me know if
this is ideal.
- In tools.py: I added _parse_input_paths in addition to
_parse_file_list_file and it returns a set of file paths for file
list/single data file. list(set(input_paths)) removes the duplicates.
------------------------------
You can view, comment on, or merge this pull request online at:
#55
Commit Summary
- 99d73aa
<99d73aa>
added functionality of handling an input of a file list
File Changes
(3 files <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files>)
- *M* src/diffpy/labpdfproc/labpdfprocapp.py
<https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-5dac184d5afea97c0b77258be37462d3d08aabaa60ba14882d8bf3a502d51927>
(2)
- *M* src/diffpy/labpdfproc/tests/test_tools.py
<https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-972495a76a6a469ac3d0228bccf570e688e77c3a5962d02ed2da0b2ffbd9fdc9>
(5)
- *M* src/diffpy/labpdfproc/tools.py
<https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-c982ef80888d672f298d53201af1be02db7e8713509ca9e05dbeb1bb8f71d211>
(26)
Patch Links:
- https://github.com/diffpy/diffpy.labpdfproc/pull/55.patch
- https://github.com/diffpy/diffpy.labpdfproc/pull/55.diff
—
Reply to this email directly, view it on GitHub
<#55>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAOWUOH43BGOBOQBS4P2F3ZBUHXTAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TAMRSGEZTIMY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The reasoning is minimizing "magic"...
I also removed set from the test to make the test stricter. There may be a
bug that duplicates files for some reason and we wouldn't find it.
…On Fri, May 10, 2024, 4:26 PM Simon Billinge ***@***.***> wrote:
We can discuss, but this is probably undesirable behavior. I think we
should only read a file_list.txt file if it is explicitly specified by the
user as a file or in a list of files.
On Fri, May 10, 2024, 2:37 PM Yucong (Alice) Chen <
***@***.***> wrote:
>
> - closes #23 <#23>,
> #24 <#24>.
> - File list tests passed.
> - For test case of "./input_dir", we also read the two file lists in
> the directory. In my current commit here I read the two file lists and
> append the data files in our input directories too. Please let me know if
> this is ideal.
> - In tools.py: I added _parse_input_paths in addition to
> _parse_file_list_file and it returns a set of file paths for file
> list/single data file. list(set(input_paths)) removes the duplicates.
>
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
> #55
> Commit Summary
>
> - 99d73aa
> <99d73aa>
> added functionality of handling an input of a file list
>
> File Changes
>
> (3 files <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files>)
>
> - *M* src/diffpy/labpdfproc/labpdfprocapp.py
> <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-5dac184d5afea97c0b77258be37462d3d08aabaa60ba14882d8bf3a502d51927>
> (2)
> - *M* src/diffpy/labpdfproc/tests/test_tools.py
> <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-972495a76a6a469ac3d0228bccf570e688e77c3a5962d02ed2da0b2ffbd9fdc9>
> (5)
> - *M* src/diffpy/labpdfproc/tools.py
> <https://github.com/diffpy/diffpy.labpdfproc/pull/55/files#diff-c982ef80888d672f298d53201af1be02db7e8713509ca9e05dbeb1bb8f71d211>
> (26)
>
> Patch Links:
>
> - https://github.com/diffpy/diffpy.labpdfproc/pull/55.patch
> - https://github.com/diffpy/diffpy.labpdfproc/pull/55.diff
>
> —
> Reply to this email directly, view it on GitHub
> <#55>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABAOWUOH43BGOBOQBS4P2F3ZBUHXTAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TAMRSGEZTIMY>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
…only if user specifies it
I agree with what you said. Now the code has _parse_file_list_file to distinguish between file list and a data file. I added another extra condition on glob to skip appending the file path if the name contains file_list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see inline comments.
src/diffpy/labpdfproc/tools.py
Outdated
@@ -28,6 +28,13 @@ def set_output_directory(args): | |||
return output_dir | |||
|
|||
|
|||
def _parse_file_list_file(input_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this argument file_list_path
to make it clearer
src/diffpy/labpdfproc/tools.py
Outdated
@@ -28,6 +28,13 @@ def set_output_directory(args): | |||
return output_dir | |||
|
|||
|
|||
def _parse_file_list_file(input_path): | |||
with open(input_path, "r") as f: | |||
lines = [line.strip() for line in f] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the "Path" to this line and instead of line call it filepath or sthg like that, so it reads more like file_paths = [Path(file_path.strip()) for file_path in f]
which is more intelligible of what the code is meant to be doing. I wonder if it is more readable if we use readlines too? though this is ok too.
src/diffpy/labpdfproc/tools.py
Outdated
@@ -51,16 +58,21 @@ def set_input_lists(args): | |||
input_path = Path(input).resolve() | |||
if input_path.exists(): | |||
if input_path.is_file(): | |||
input_paths.append(input_path) | |||
if "file_list" in input_path.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code will be cleaner and easier to read if we just do a prefilter, so we don't change any code down here but we change the args.input up above to "expand" file_list into its list of files.
It could maybe even not be a private function but a public function that is called before set_input_file_list
. We could call it expand_list_file
and it just operates on strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the objective is to handle duplicated files, would it be easier to use set(input_paths) at the end, so no other function is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it is just keeping the code easier to read, so it doesn't get like spaghetti with lots of complicated conditionals handling edge cases all the way through it. it is about maintainability in the future. None of it matters that much, but it is also a kind of learning opportunity where we develop better practices in the group as a whole.
I think we will find if we do it this way the code will be easier to follow and the different tasks will be handled in a nicely separated way. For example we could call this function or not call it and the code would still work either way, but by calling it we get new functionality allowing the user to input lists of files from a text file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a version with expand_list_file. I think I get what you mean but this code probably needs some more comments. I have expand_list_file to identify if there are any file lists, and parse that list if so (so expand_list_file and parse_file_list both operate on string now). So that in the function for set_input_lists, we only have to consider if the existed path is a file or directory.
This is not passing one of our tests if we have a missing file in file list (i.e. it will raise an error instead of skipping it).
…list with missing files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting there.
But my suggestion is to move the expand....
function outside the set_input_lists
function and have it operate on the args,...so args are passed in and modified args are passed back. Then later the modified args are passed into the set_input_lists
for that function to do its work. tl;dr, remove the nesting of these functions.
Also, in this case I wouldn't move the file reading out of the expand...
function. The code will will be easier to read if we just have that read context manager inside the ewxpand
function explicitly. i.e., summary, get rid of the private function.
Leave the functionality that removes duplicates where it is. This is a bit magic, but very mild magic and may be desirable behavior.
Thanks for the explanation of the change to the sort
…est with missing file to errorous case
I edited the function so that it calls expand_list_file first to return args with modified args, and then set_input_lists. I changed the varibale "input" to "input_name" because I read that input() is a built-in function, so it might be confusing. Since we read files for file lists the same way as we do for individual inputs, it raises an error if one file in a file list is invalid, and I moved the file list case with a missing file to one of the bad cases. |
phew, that was a lot of work, but we got there....nice job! |
Thanks, this is the desired behavior so all good! I merged! 🎉
S
…On Sat, May 11, 2024 at 1:51 PM Yucong (Alice) Chen < ***@***.***> wrote:
I edited the function so that it calls expand_list_file first to return
args with modified args, and then set_input_lists. I changed the varibale
"input" to "input_name" because I read that input() is a built-in function,
so it might be confusing. Since we read files for file lists the same way
as we do for individual inputs, it raises an error if one file in a file
list is invalid, and I moved the file list case with a missing file to one
of the bad cases.
—
Reply to this email directly, view it on GitHub
<#55 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAOWUO4WE374UG5T5GJKGTZBZLALAVCNFSM6AAAAABHRCMWP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVHE3TEMRUGY>
.
You are receiving this because you commented.Message ID: <diffpy/diffpy.
***@***.***>
--
Simon Billinge
Professor,
Department of Applied Physics and Applied Mathematics
Columbia University
|