Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use inclusive words in Apache Airflow project #15994

Closed
turbaszek opened this issue May 21, 2021 · 47 comments · Fixed by #23090
Closed

Use inclusive words in Apache Airflow project #15994

turbaszek opened this issue May 21, 2021 · 47 comments · Fixed by #23090
Assignees
Labels
good first issue kind:task A task that needs to be completed as part of a larger issue

Comments

@turbaszek
Copy link
Member

Description

Apache Software Foundation is discussing how we can improve inclusiveness of projects and raise awareness of conscious language. Related thread on diversity@a.o:
https://lists.apache.org/thread.html/r2d8845d9c37ac581046997d980464e8a7b6bffa6400efb0e41013171%40%3Cdiversity.apache.org%3E

Use case / motivation

We already have pre-commit check that checks for some word. However, on CLC (Conscious Language Checker) Apache Airflow seems to have problems with the following words:

  • he
  • her
  • him
  • his
  • master
  • sanity check
  • slave
  • whitelist (pylintrc)

Are you willing to submit a PR?

Related Issues

#12982 #9175

@turbaszek
Copy link
Member Author

CC @leahecole

@leahecole
Copy link
Contributor

Love it. A couple of clarifying questions

  • If I'm understanding correctly, the pre commit check is preventing the introduction of new instances of these words, but these are the words we have that were already in? (legacy non-inclusive words, I suppose? 😁 )
  • Is this issue to go through and remove said words, or to introduce more automation?

@turbaszek
Copy link
Member Author

  • If I'm understanding correctly, the pre commit check is preventing the introduction of new instances of these words, but these are the words we have that were already in? (legacy non-inclusive words, I suppose? 😁 )

That's one reason. The other one is that #12982 is still opened. Also I some words cannot be removed because are used by 3rd party tools

  • Is this issue to go through and remove said words, or to introduce more automation?

Both, when we remove the word we should add it to existing rule or create new one similar to this:

- id: language-matters
language: pygrep
name: Check for language that we do not accept as community
description: Please use "deny_list" or "allow_list" instead.
entry: "(?i)(black|white)[_-]?list"
pass_filenames: true
exclude: >
(?x)
^airflow/providers/apache/cassandra/hooks/cassandra\.py$|

@turbaszek turbaszek added kind:task A task that needs to be completed as part of a larger issue and removed kind:feature Feature Requests labels May 22, 2021
@Patil2099
Copy link

I would love to take this up @turbaszek

@turbaszek
Copy link
Member Author

@Patil2099 I assigned you to this ticket 👏

@Patil2099
Copy link

Patil2099 commented May 26, 2021

I am not an expert with regular expression. Can you help me with the regular expression? @turbaszek

@turbaszek
Copy link
Member Author

I am not an expert with regular expression. Can you help me with the regular expression? @turbaszek

Sure, what we would like to check? What words?

@Patil2099
Copy link

Patil2099 commented May 28, 2021

Sorry for the late reply, I have to edit the regex in pre-commit to solve this issue right? @turbaszek. I have to exclude all the words right? mentioned in #15994 (comment). I need a little help in that.

@ktmud
Copy link
Member

ktmud commented Jun 3, 2021

Thanks for taking on this! Is there any chance that DummyOperator can be renamed to something else, too? I've heard dummy may be a non-inclusive word, too.

@turbaszek
Copy link
Member Author

Sorry for the late reply, I have to edit the regex in pre-commit to solve this issue right? @turbaszek. I have to exclude all the words right? mentioned in #15994 (comment). I need a little help in that.

@Patil2099 yes, we should add similar hook to pre-commit-config as this one:

- id: language-matters
language: pygrep
name: Check for language that we do not accept as community
description: Please use "deny_list" or "allow_list" instead.
entry: "(?i)(black|white)[_-]?list"
pass_filenames: true
exclude: >
(?x)
^airflow/providers/apache/cassandra/hooks/cassandra\.py$|

The regexp should be like (?i)(\bmaster\b|\bhe\b|\bshe\b|\bhis\b|\bher\b) so we match exact word not substrings like HElp.
Also we will need to adjust all occurrences of non-inclusive words.

Thanks for taking on this! Is there any chance that DummyOperator can be renamed to something else, too? I've heard dummy may be a non-inclusive word, too.

@ktmud that's interesting suggestion. This would be a breaking change but I think it's worth trying. Any suggestions for a new name? CC @leahecole @kaxil @potiuk

@potiuk
Copy link
Member

potiuk commented Jun 4, 2021

Yeah. We should likely change DummyOperator but we should deprecate it not remove (and remove it in 3.0). The 'NoOpOperator' or similar could be a good name.

@eladkal
Copy link
Contributor

eladkal commented Jun 4, 2021

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

@turbaszek
Copy link
Member Author

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

I like EmptyOperator, I was wondering about NodeOperator because in most of production use cases this operator represent just a node in a DAG.

@kaxil
Copy link
Member

kaxil commented Jun 4, 2021

Hmmm.. If more people think that dummy is not an inclusive word then yes. What do others think, example Twitter Engg https://twitter.com/TwitterEng/status/1278733305190342656/photo/1 does think that "dummy value" term is non-inclusive.

How about EmptyOperator or TasklessOperator (not sure if this may cause confusion as with Tasks)?

@potiuk
Copy link
Member

potiuk commented Jun 4, 2021

Yeah. Dummy is not only non-inclusive but also well a bit dummy :). I like EmptyOperator

@uranusjr
Copy link
Member

uranusjr commented Jun 8, 2021

In my AIP-39 implementation I called a time table that never schedules (schedule_interval=None) a NullTimeTable so that's another possibility. It's probably a good idea to keep the naming consistent, so I'll change it if we decide to go with something else.

@turbaszek
Copy link
Member Author

+1 for NullOperator

@potiuk
Copy link
Member

potiuk commented Jun 8, 2021 via email

@leahecole
Copy link
Contributor

NullOperator works for me - I have also seen "Placeholder" as a substitution for dummy in our docs if you prefer that

@eladkal
Copy link
Contributor

eladkal commented Jun 11, 2021

As for dummy other than DummyOpeartor we also have TriggerRule.DUMMY

DUMMY = 'dummy'

@pateash
Copy link
Contributor

pateash commented Jul 9, 2021

+1 for NullOperator

@ashb
Copy link
Member

ashb commented Jul 15, 2021

NoOpOperator

What, not "Noperator"? 😉

+1 for NullOperator.

@eladkal
Copy link
Contributor

eladkal commented Jul 22, 2021

Opened PR #17144 to handle the Dummy trigger rule deprecation. A question was raised in the PR review if we want NULL trigger rule or something else.

@eladkal
Copy link
Contributor

eladkal commented Apr 7, 2022

PR for renaming DummyOperator: #22832

The only task left on this issue is the task that this issue was created for :)
We need a pre commit to improve inclusiveness as explained in the issue description.

@edithturn
Copy link
Contributor

edithturn commented Apr 19, 2022

@eladkal I would love to take this issue

@edithturn
Copy link
Contributor

edithturn commented Apr 19, 2022

Hello, I am taking this issue.
In this step in pre-commit, we have words with "black" and it won't pass "language matters" validation, also we have repositories links like this: - repo: https://github.com/psf/black and this: - id: trailing-whitespace in the same pre-commit file
Screenshot from 2022-04-19 10-47-31

Should we exclude .pre-commit-config.yam in validations?

Let me know if I missed something

@eladkal
Copy link
Contributor

eladkal commented Apr 19, 2022

I don't think the hook scans the file it's defined in because then you won't be able to place the words you want to scan for...?

We have language-matters hook that checks for white/black (though I don't know why we exclude the files listed in it):

- id: language-matters
language: pygrep
name: Check for language that we do not accept as community
description: Please use "deny_list" or "allow_list" instead.
entry: "(?i)(black|white)[_-]?list"
pass_filenames: true
exclude: >
(?x)
^airflow/www/fab_security/manager\.py$|
^airflow/providers/apache/cassandra/hooks/cassandra\.py$|
^airflow/providers/apache/hive/operators/hive_stats\.py$|
^airflow/providers/apache/hive/.*PROVIDER_CHANGES_*|
^airflow/providers/apache/hive/.*README\.md$|
^tests/providers/apache/cassandra/hooks/test_cassandra\.py$|
^docs/apache-airflow-providers-apache-cassandra/connections/cassandra\.rst$|
^docs/apache-airflow-providers-apache-hive/commits\.rst$|
git

@edithturn
Copy link
Contributor

edithturn commented Apr 19, 2022

This regex (?i)(black|white)[_-]?list just will find black and white followed with "list", if I am not wrong, for that reason it doesn't work completely with all words.

The bracketed characters [_-] (underscore and dash) indicate that the alphanumeric pattern must be followed by an underscore or a dash.

The correct could be this: (?ix)
entry: > (?ix) \bmaster\b| \bhe\b| \bshe\b| \bhis\b| \bher\b| \bslave\b| \bsanity\b| \bdummy\b| \bwhite\b| \bblack\b

i = Use case-insensitive matching. For more information, (edited)
x = Exclude unescaped white space from the pattern, and enable comments after a number sign (#).

I will try!

@edithturn
Copy link
Contributor

**@eladkal I am not sure why we are excluding these files. Some of them exist and some don't. Some are from "provider/docs", so I am assuming that they are being excluded because we are using third-party documentation that we cannot change after validation.

(Exist)  ^airflow/www/fab_security/manager\.py$|
(Exist)  ^airflow/providers/apache/cassandra/hooks/cassandra\.py$|
(Exist)  ^airflow/providers/apache/hive/operators/hive_stats\.py$|
(Exist)  ^.pre-commit-config\.yaml$|
(Doesn't Exist) ^airflow/providers/apache/hive/.*PROVIDER_CHANGES_*|
(Doesn't Exist)  ^airflow/providers/apache/hive/.*README\.md$|
(Exist)  ^tests/providers/apache/cassandra/hooks/test_cassandra\.py$|
(Exist)  ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra\.rst$|
(Exist)  ^docs/apache-airflow-providers-apache-hive/commits\.rst$|
(Doesn't Exist) git**

@edithturn
Copy link
Contributor

edithturn commented Apr 19, 2022

It is working, should I work to change these words in CHANGELOG: master, dummy..etc?

No relative imports......................................................................Passed
Check for language that we do not accept as community....................................Failed
- hook id: language-matters
- exit code: 1

CHANGELOG.txt:358:- Deprecate dummy trigger rule in favor of always (#17144)
CHANGELOG.txt:1091:- Fix webserver exiting when gunicorn master crashes (#13518)(#13780)
CHANGELOG.txt:1176:- Fix link to Airflow master branch documentation (#13179)
CHANGELOG.txt:1232:- Fix webserver exiting when gunicorn master crashes (#13470)
CHANGELOG.txt:1603:- Move k8s executor out of contrib to closer match master (#8904)
CHANGELOG.txt:1654:- Update README to remove Python 3.8 limitation for Master (#9451)
CHANGELOG.txt:1752:- Dont schedule dummy tasks (#7880)
CHANGELOG.txt:[244](https://github.com/apache/airflow/runs/6082120417?check_suite_focus=true#step:11:244)9:- [AIRFLOW-5301] Some not-yet-available files from breeze are committed to master (#5901)
CHANGELOG.txt:2912:- [AIRFLOW-3817] Corrected task ids returned by BranchPythonOperator to match the dummy operator ids (#4659)
CHANGELOG.txt:3772:- [AIRFLOW-1314] Rebasing against master
CHANGELOG.txt:4708:- [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
CHANGELOG.txt:4861:- [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead of black
CHANGELOG.txt:5058:- [AIRFLOW-425] Add white fill for null state tasks in tree view.
CHANGELOG.txt:5326:- Merge branch 'master' into hivemeta_sasl
CHANGELOG.txt:5372:- [hotfix] typo that made it in master
CHANGELOG.txt:5375:- Merge remote-tracking branch 'upstream/master' into minicluster
CHANGELOG.txt:5397:- Merge remote-tracking branch 'upstream/master'
CHANGELOG.txt:5418:- Merge branch 'airbnb/master'`

```

PR related: https://github.com/apache/airflow/pull/23090
any feedback is appreciated it 

@ashb
Copy link
Member

ashb commented Apr 19, 2022

It is working, should I work to change these words in CHANGELOG: master, dummy..etc?

No relative imports......................................................................Passed
Check for language that we do not accept as community....................................Failed
- hook id: language-matters
- exit code: 1

CHANGELOG.txt:358:- Deprecate dummy trigger rule in favor of always (#17144)

We can't reword that one - it's a notice to the users that the class has changed name.

The others were correct at the time the were written (master branch) so it doesn't make sense to change them, but we could do of we want

@edithturn
Copy link
Contributor

@ashb agree with this, then I'm going to add this file CHANGELOG to exclusions.

@leahecole
Copy link
Contributor

@edithturn thanks for taking this on! +1 to Ash's suggestion to exclude the CHANGELOG. Rewriting history may cause more confusion than it does good - as much as I'm always in favor of upgrading language, it might make it harder to tell in the changelog when this change actually occurred.

@edithturn
Copy link
Contributor

Hello! I have a doubt regarding this:

I am excluding these files:

^airflow/www/fab_security/manager.py$| ^airflow/providers/| ^tests/providers/apache/cassandra/hooks/test_cassandra.py$| ^docs/apache-airflow-providers-apache-cassandra/connections/cassandra.rst$| ^docs/apache-airflow-providers-apache-hive/commits.rst$| ^docs/*.*$| ^tests/providers/| ^.pre-commit-config\.yaml$| ^.*RELEASE_NOTES\.rst$| ^.*CHANGELOG\.txt$|^.*CHANGELOG\.rst$|

And the pre-commit is showing an error in these other files:

Screenshot from 2022-04-26 22-37-49

PR: #23090

I'm not sure how to deal with them, should I exclude them or fix them to change the terms?

@eladkal
Copy link
Contributor

eladkal commented Apr 27, 2022

We should fix what we can
For example sanity has many alternatives:
https://gist.github.com/seanmhanson/fe370c2d8bd2b3228680e38899baf5cc

@potiuk
Copy link
Member

potiuk commented Apr 27, 2022

I had no idea "sanity check" is a problem. TIL

josh-fell added a commit to josh-fell/airflow that referenced this issue Feb 22, 2023
Related: apache#15994 apache#23090

This PR addresses a few items related to inclusive language use and the CI check:

- There are several occurrences of "dummy" throughout documentation; however the current CI check for preventing non-inclusive language doesn't inspect `docs/` files. Ideally, the docs also include inclusive language.
- Even when removing `docs/` from the exclusion list, non-inclusive language was still escaping pygrep. Upon inspection, the `(?x)` inline modifier was missing from the regex (although intended in apache#23090). Adding this modifier revealed these "dummy" instances and others related non-inclusive occurrences which were previously uncaught.
- The exclusion list seemed too broad in places. There are still instances in which directories are excluded as a whole, but the list now is more tailored to non-inclusive occurrences that are beyond the purview of Airflow, history, dev/test files, etc.
josh-fell added a commit that referenced this issue Feb 22, 2023
* Fix and augment `check-for-inclusive-language` CI check

Related: #15994 #23090

This PR addresses a few items related to inclusive language use and the CI check:

- There are several occurrences of "dummy" throughout documentation; however the current CI check for preventing non-inclusive language doesn't inspect `docs/` files. Ideally, the docs also include inclusive language.
- Even when removing `docs/` from the exclusion list, non-inclusive language was still escaping pygrep. Upon inspection, the `(?x)` inline modifier was missing from the regex (although intended in #23090). Adding this modifier revealed these "dummy" instances and others related non-inclusive occurrences which were previously uncaught.
- The exclusion list seemed too broad in places. There are still instances in which directories are excluded as a whole, but the list now is more tailored to non-inclusive occurrences that are beyond the purview of Airflow, history, dev/test files, etc.

* Update links in get_pandas_df() of BigQueryHook

* dummy-command -> placeholder-command in _KubernetesDecoratedOperator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:task A task that needs to be completed as part of a larger issue
Projects
None yet