Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input file list #55

Merged
merged 4 commits into from
May 11, 2024
Merged

input file list #55

merged 4 commits into from
May 11, 2024

Conversation

yucongalicechen
Copy link
Collaborator

  • closes specify an inputs directory name #23, check if args.input_file is specified #24.
  • File list tests passed.
  • For test case of "./input_dir", we also read the two file lists in the directory. In my current commit here I read the two file lists and append the data files in our input directories too. Please let me know if this is ideal.
  • In tools.py: I added _parse_input_paths in addition to _parse_file_list_file and it returns a set of file paths for file list/single data file. list(set(input_paths)) removes the duplicates.

@sbillinge
Copy link
Contributor

sbillinge commented May 10, 2024 via email

@sbillinge
Copy link
Contributor

sbillinge commented May 10, 2024 via email

@yucongalicechen
Copy link
Collaborator Author

I agree with what you said. Now the code has _parse_file_list_file to distinguish between file list and a data file. I added another extra condition on glob to skip appending the file path if the name contains file_list.
In the test I have to change assert list(actual_args.input_paths).sort() == expected_paths.sort() to assert sorted(actual_args.input_paths) == sorted(expected_paths) because in the first line both terms return None.

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline comments.

@@ -28,6 +28,13 @@ def set_output_directory(args):
return output_dir


def _parse_file_list_file(input_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this argument file_list_path to make it clearer

@@ -28,6 +28,13 @@ def set_output_directory(args):
return output_dir


def _parse_file_list_file(input_path):
with open(input_path, "r") as f:
lines = [line.strip() for line in f]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the "Path" to this line and instead of line call it filepath or sthg like that, so it reads more like file_paths = [Path(file_path.strip()) for file_path in f] which is more intelligible of what the code is meant to be doing. I wonder if it is more readable if we use readlines too? though this is ok too.

@@ -51,16 +58,21 @@ def set_input_lists(args):
input_path = Path(input).resolve()
if input_path.exists():
if input_path.is_file():
input_paths.append(input_path)
if "file_list" in input_path.name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code will be cleaner and easier to read if we just do a prefilter, so we don't change any code down here but we change the args.input up above to "expand" file_list into its list of files.

It could maybe even not be a private function but a public function that is called before set_input_file_list. We could call it expand_list_file and it just operates on strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the objective is to handle duplicated files, would it be easier to use set(input_paths) at the end, so no other function is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is just keeping the code easier to read, so it doesn't get like spaghetti with lots of complicated conditionals handling edge cases all the way through it. it is about maintainability in the future. None of it matters that much, but it is also a kind of learning opportunity where we develop better practices in the group as a whole.

I think we will find if we do it this way the code will be easier to follow and the different tasks will be handled in a nicely separated way. For example we could call this function or not call it and the code would still work either way, but by calling it we get new functionality allowing the user to input lists of files from a text file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a version with expand_list_file. I think I get what you mean but this code probably needs some more comments. I have expand_list_file to identify if there are any file lists, and parse that list if so (so expand_list_file and parse_file_list both operate on string now). So that in the function for set_input_lists, we only have to consider if the existed path is a file or directory.
This is not passing one of our tests if we have a missing file in file list (i.e. it will raise an error instead of skipping it).

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting there.

But my suggestion is to move the expand.... function outside the set_input_lists function and have it operate on the args,...so args are passed in and modified args are passed back. Then later the modified args are passed into the set_input_lists for that function to do its work. tl;dr, remove the nesting of these functions.

Also, in this case I wouldn't move the file reading out of the expand... function. The code will will be easier to read if we just have that read context manager inside the ewxpand function explicitly. i.e., summary, get rid of the private function.

Leave the functionality that removes duplicates where it is. This is a bit magic, but very mild magic and may be desirable behavior.

Thanks for the explanation of the change to the sort

@yucongalicechen
Copy link
Collaborator Author

I edited the function so that it calls expand_list_file first to return args with modified args, and then set_input_lists. I changed the varibale "input" to "input_name" because I read that input() is a built-in function, so it might be confusing. Since we read files for file lists the same way as we do for individual inputs, it raises an error if one file in a file list is invalid, and I moved the file list case with a missing file to one of the bad cases.

@sbillinge sbillinge merged commit c45022d into diffpy:main May 11, 2024
2 checks passed
@sbillinge
Copy link
Contributor

phew, that was a lot of work, but we got there....nice job!

@sbillinge
Copy link
Contributor

sbillinge commented May 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

specify an inputs directory name
2 participants