Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 157 warn about missing translations #564

Conversation

KeynesYouDigIt
Copy link
Contributor

Closes #157

Why is this the best possible solution? Were any other approaches considered?

This approach ensures that the warning work has been done along side the work that puts in placeholders for missing translations. It is focused on making sure we never re-evaluate the form for missing languages but uses what is already implied in the form build.

What are the regression risks?

Last time I attempted to solve this issue, I assumed the existence of certain pieces of the path string and tried to build warning messages using those pieces. When the warnings function tried to parse paths without what I assumed to be present, it broke. I do not believe this risk is still present, but I do assume things like path and content type are not null in this function. This feels like a safe assumption, so I dont think the risks are great.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests_v1
  • run nosetests and verified all tests pass
  • run black pyxform to format code
  • verified that any code or assets from external sources are properly credited in comments

@KeynesYouDigIt KeynesYouDigIt force-pushed the issue_157_warn_about_missing_translations branch 2 times, most recently from 8b8de98 to 5670deb Compare October 9, 2021 21:36
@KeynesYouDigIt
Copy link
Contributor Author

@lindsay-stevens Thanks so much for that big PR fixing up the tests!!! it got CI to pass!

@lognaturel I was able to clean up and shorten this PR, Hoping it can get a look soon?

return path[path.find(":") + 1 :]

# if the ":" is not present, there may be a choice label present.
# First check for a "-"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to the regression that occurred on my first try

lindsay-stevens and others added 12 commits October 23, 2021 12:59
- as noted in the comment, the check doesn't apply to questions anymore
  since PR XLSForm#543 makes pyxform always emit a label, and the only case
  that is not permissible is a guidance_hint with no label or hint.
- nesting this check inside the "control group" processing branch seems
  the only neat+easy way about this, since:
  - the context is a loop from which the various branches continue, so
    it can't check at the end,
  - each branch parses / determines the actual item type which is not
    apparent in the raw "type" data from the row dict, and
  - after the loop context exits, the data is in an array with nested
    dicts so it's no longer possible to determine the sheet row number
    for the error message to be as useful to users.
- refactored an old test which covers a lot of question types, into
  the tests_v1 / md style since this test is affected by the changes
  and having md style is more transparent
- added some tests to verify the assumption that for the label check,
  it's not going to encounter a None or empty value for "control_type".
@KeynesYouDigIt KeynesYouDigIt force-pushed the issue_157_warn_about_missing_translations branch from 5670deb to 2bd90f6 Compare October 23, 2021 18:59
@@ -703,8 +705,49 @@ def _add_empty_translations(self):
self._translations[lang][path] = {}
for content_type in content_types:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real "meat" of the PR is here. Attempt to figure out what column we are working with, then create the appropriate warning for that column.

@@ -326,43 +333,28 @@ def get_translations(self, default_language):
for display_element in ["label", "hint", "guidance_hint"]:
label_or_hint = self[display_element]

if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small but I think helpful refactor that better sorts the logic for get_translations

@KeynesYouDigIt KeynesYouDigIt force-pushed the issue_157_warn_about_missing_translations branch from 04fe844 to 1fa9f96 Compare October 23, 2021 19:16
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2021

Codecov Report

Merging #564 (ba6329e) into master (6ca3484) will decrease coverage by 0.01%.
The diff coverage is 90.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
- Coverage   84.20%   84.18%   -0.02%     
==========================================
  Files          25       25              
  Lines        3672     3687      +15     
  Branches      865      871       +6     
==========================================
+ Hits         3092     3104      +12     
- Misses        440      441       +1     
- Partials      140      142       +2     
Impacted Files Coverage Δ
pyxform/survey.py 93.40% <85.71%> (-0.39%) ⬇️
pyxform/survey_element.py 94.40% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ca3484...ba6329e. Read the comment docs.

@lindsay-stevens
Copy link
Contributor

@KeynesYouDigIt If I understand the issue correctly, the goal is to warn users when there's one or more "missing" translation columns in the survey (or choices) sheet. A "missing" translation is one where that language appears for one or more of the translatable content types (label, media, etc) specified on the form, but not all translatable content types specified on the form. If no translations specify a certain type then e.g. it's no problem if there is no French image if no other languages have an image column.

I would expect to find this kind of warning near the top of the processing pipeline, at the level of xls2json.py, where the parsed workbook sheets are processed separately and numerous structural validations of this nature are applied. For example for the survey sheet here, and the choices sheet here. In particular the function dealias_and_group_headers is something along these lines for the survey sheet. The current approach appears to be much lower in the pipeline, (possibly?) reverse engineering the output XPath which seems likely to be error prone to implement and fragile to maintain. Could you please clarify the current approach and whether the above xls2json.py locations were considered?

About the refactoring in SurveyElement.get_translations. Before a refactor like this I would prefer to have unit tests on the block to establish that the before/after produce the same result and ensure regressions are not introduced. At the moment the refactor does not appear to be equivalent, mainly on the point that the old code for guidance_hint only wants "guidance_hint" in self.keys() (i.e. it may be blank) but the new code wants self.get("guidance_hint") to be truthy (i.e. it's not blank). It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?

About the introduction of the internal__warnings set. I'm not clear on what this is for, since the output does not appear to be used directly. It's added to the existing warnings list in the end, so why not use warnings only?

Lastly for me it would help with reviews if the commit messages for material changes were a bit more detailed, or included in-line comments. For example in the commit to list on to json I'm not clear why this change is needed - what it might be fixing or improving. Or in better is not dict check why type is considered better than isinstance - noting that some pyxform classes subclass dict, and isinstance considers subclasses while type doesn't.

@KeynesYouDigIt
Copy link
Contributor Author

KeynesYouDigIt commented Oct 30, 2021

@lindsay-stevens

Thank you so much for the comment and feedback! I have been needing some guidance and collaboration badly here, as you can see 😄

Sorry in advance for so much text, the TL:DR is:

  • I will reconsider the approach and see if I can make something work more upstream. (do you think we could we move all the translation work up here?)
  • I will remove the refactor.
  • I internal__warnings was added because of the approach I took, and to list on to json was added to support serializing the internal__warnings.

Please feel free to @ me on slack if you would rather decline the PR and move me to a simpler or more urgent set of work (I'll defer to you since you use this code much more in your day-to-day work), or if you want to chat more about what I've done here. I mostly only have time for this on weekends, but could get away from work on Fridays, Thursdays, and maybe Tuesdays.

Longer version -

Could you please clarify the current approach and whether the above xls2json.py locations were considered?

I considered and even preferred the upstream approach here, and to be honest now that you bring it up again I think you might be correct. But let me explain why I went downstream:

What caused me to pick this downstream approach is that we actually handle the empty translations themselves downstream (def _add_empty_translations). Since the "state" of a missing translation is detected and handled at this point, it seems redundant to me to detect it and warn about it earlier. You are right that it forces us to reverse engineer the xpath, which I thought would be easier than it turned out to be :/

I am going to spend some time today reconsidering that upstream path again. What I would really like to do is move the translations upstream - when I first had that thought it sounded like too big of a change, but looking at it again I think it would at least be worth spending a few hours seeing what that would look like. Now, if we calculate missing translations upstream, but only create translations downstream, that feels redundant, but I suppose it might not be. (and redundancy might be better than reverse engineering?)

It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?

It is, at my company we follow the "boy scout rule" (leave it better than you found it) as a way of packing refactors into PRs. I think the tests cover the case you had mentioned, but I am happy to just back it out for simplicity's sake - the "boy scout rule" might doesn't make sense for this project? (small team, less frequent updates, testing that's more situation-ally driven and doesn't cover small units of work like this particular chunk of code)

so why not use warnings only?

I don't (think?) ther'es a way to access that warnings object in the "guts" of where I do this work (this is another hazard of the downstream approach I am using). I do think its a bit odd that the internal workings can't pass information back up to the user, so while the approach of adding another mutable/stateful attribute was hardly ideal, this was the quickest way I could make it work.

If I can make the more "upstream" approach work we can remove this for now. My guess is that we will need something that achieves the same later? but I could be wrong. This is a pretty mature codebase and it hasn't needed it before 🤷

@KeynesYouDigIt
Copy link
Contributor Author

@lindsay-stevens

Reviewing the sections you linked me to - dealias_and_group_headers is very agnostic as to what the "tokens" it works with actually are, which makes it hard to detect if the token is a language. It also works row-by-row, but translations are added column-by-column, so we would have to review the aggregated data again.

I could certainly add something like find_uneven_translations(out_dict_array) at the end of that function, but I am still worried I'd duplicating work with the stuff done in the Survey class (see _add_empty_translations and _setup_translations ).

I haven't done enough work to see if its possible, but what do you think of trying to move some or all of the work in _setup_translations and/or _add_empty_translations to this upstream point? Perhaps we could build JSON elements that are much easier to work with down stream. Before I spend time trying to code a proof of concept, is that what you were getting at with your earlier comment? Or are we strictly trying to detect unbalanced translations and move on? Again, I think that would require duplication of logic, but I am not sure.

I'll keep mulling this over in my mind, still not sure what the best way forward is, LMK if you have any thoughts or if I've missed something here.

@lindsay-stevens
Copy link
Contributor

Hi @KeynesYouDigIt

The dealias_and_group_headers reference was meant as more of an example of examining the XLSForm data and processing/validating. To implement this I think a new function would be appropriate, particularly since dealias_and_group_headers is used by multiple sheets (Settings, Survey, Choices, External Choices, OSM) which don't all have translatable columns.

There are finite translatable column types, as noted in this comment. So after splitting column names on ::, any columns not in a 'translatable' list / mapping could be ignored. Even if it's not currently possible to determine with all certainty what all the translatable columns are, the referenced comment identifies what are likely the most commonly used ones, so the warning would still be valuable.

About the duplication concerns, I think these may overlap code-wise but the purpose is different. The goal for #157 is a warning to the user, which they may ignore at their peril. The existing _add_empty_translations seems to be about structural integrity in the XForm XML. For example if I comment out the body of _add_empty_translations, then in pyxform_test_case.py set run_odk_validate = kwargs.get("run_odk_validate", True), then run all the tests, there are 5 failures from errors from ODK Validate.

I would recommend aiming to address #157 as directly as possible. If it looks like a refactor is needed or potentially beneficial, then please create a separate issue/PR to describe/discuss/implement it. My concern is that there aren't many tests around translations behaviour, so embarking on that at the same time could stymie the PR or cause regressions.

@lognaturel
Copy link
Contributor

Thanks @KeynesYouDigIt for sticking with this one and exploring from multiple angles. I'd like to propose @lindsay-stevens take this one over and put up a new PR. Then you can use what you've learned to do an initial review and evaluate his approach.

As @lindsay-stevens says, if there's an area you feel really needs some refactor, please file an issue for now.

@lindsay-stevens
Copy link
Contributor

@lognaturel sounds good I'll take a look at this in a new PR.

@KeynesYouDigIt
Copy link
Contributor Author

@lindsay-stevens / @lognaturel Sounds good to me as well, thanks for all the help!

@lindsay-stevens
Copy link
Contributor

closing in favour of #571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sometimes users don't apply consistent language tags across all translatable columns
5 participants