Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation for fluid interface check #287

Merged
merged 11 commits into from
Dec 22, 2023

Conversation

sbrugman
Copy link
Contributor

@sbrugman sbrugman commented Sep 11, 2023

Implementation for #286

Basic code for what the implementation for the rule proposed could look like.
By design it only considers the simplest case for now: a sequential assignment to the same variable with method chaining.

I left the error code open (999). For the test cases I "mocked" the APIs, this works fine but could be cleaner.
Awaiting the discussion on the issues page, I have not extensively checked this implementation in the wild.

The rule could be extended:

  • to match assignments with statements in between that have no side effects.
def assign_multiple(df, df2):
   df = df.select("column")
   df2 = df2.select("another_column")
   df = df.withColumn("column2", F.lit("abc"))
   return df, df2
  • In the wild (e.g. pandas) there might also be examples with the slice syntax. (This requires assign)
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/ecommerce_sales_by_date.csv')
df = df.fillna('')
df = df.sort_values(by='date', ascending=False)
df['conversion_rate'] = ((df['transactions'] / df['sessions']) * 100).round(2)
df['revenuePerTransaction'] = (df['transactionRevenue'] / df['transactions']).round(2)
df['transactionsPerSession'] = (df['transactions'] / df['sessions']).round(2)
df['date'] = pd.to_datetime(df['date'])
df = df.rename(columns={'date': 'Date', 
                        'sessions': 'Sessions', 
                        'transactions': 'Transactions', 
                        'transactionRevenue': 'Revenue', 
                        'transactionsPerSession': 'Transactions Per Session', 
                        'revenuePerTransaction': 'AOV',
                        'conversion_rate': 'CR'}) 
df = df.drop(columns=['Unnamed: 0', 'Transactions Per Session'])
df.head()
  • The names do not have to match when referenced only once (e.g. if returned afterwards). (Partially covered by flake8-return R504)
def assign_multiple(df, df2):
   df = df.select("column")
   result_df = df.select("another_column")
   final_df = result_df.withColumn("column2", F.lit("abc"))
   return final_df
  • The second assignment may also be a return statement instead.
def projection(df_in: spark.DataFrame) -> spark.DataFrame:
    df = (
        df_in.select(["col1", "col2"])
        .withColumnRenamed("col1", "col1a")
    )
    return df.withColumn("col2a", spark.functions.col("col2").cast("date"))

Copy link
Owner

@dosisod dosisod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbrugman ! Everything looks good for the most part, just a few changes and this should be good to merge.

refurb/checks/readability/fluid_interface.py Outdated Show resolved Hide resolved
refurb/checks/readability/fluid_interface.py Show resolved Hide resolved
refurb/checks/readability/fluid_interface.py Outdated Show resolved Hide resolved
refurb/checks/readability/fluid_interface.py Outdated Show resolved Hide resolved
test/data/err_999.py Outdated Show resolved Hide resolved
@sbrugman
Copy link
Contributor Author

Thanks for the review, will incorporate the changes

@sbrugman
Copy link
Contributor Author

sbrugman commented Dec 9, 2023

Finally got around to continuing on this PR. Updated the code to the next one available (184)

Handling returns could reuse logic from the assignment case, so I expanded the functionality to handle certain return statements.
Non-matching names are supported, with restrictions. For non-matching variable names, e.g. result_df = df.withColumns(...) we now check that df is not referenced anywhere else in the code block.

Also excluded multiple references to the assigned variable in the call expression itself, e.g.

def _(x):
    y = x.m()
    return y.operation(*[v for v in y])

@sbrugman sbrugman force-pushed the feat/call-chaining branch 5 times, most recently from 9c02deb to 6090977 Compare December 9, 2023 22:21
@dosisod
Copy link
Owner

dosisod commented Dec 11, 2023

Thank you for the new updates! Things look good for the most part, though this is causing one of the FURB177 test to fail:

import pathlib
from pathlib import Path

# these should match

_ = Path().resolve()
_ = Path("").resolve()  # noqa: FURB153
_ = Path(".").resolve()  # noqa: FURB153
_ = pathlib.Path().resolve()

...

FURB184 is being emitted on line 8:

test/data/err_177.py:6:5 [FURB177]: Replace `Path().resolve()` with `Path.cwd()`
test/data/err_177.py:7:5 [FURB177]: Replace `Path("").resolve()` with `Path.cwd()`
test/data/err_177.py:8:5 [FURB177]: Replace `Path(".").resolve()` with `Path.cwd()`
test/data/err_177.py:9:1 [FURB184]: Assignment statement should be chained
test/data/err_177.py:9:5 [FURB177]: Replace `Path().resolve()` with `Path.cwd()`

Though the error seems to go away if you use Path() instead of pathlib.Path(), don't know why.

@sbrugman
Copy link
Contributor Author

Thanks, I'll have a look

@sbrugman
Copy link
Contributor Author

Fixed. (The name argument was not correctly passed down in the recursion)

The example also made me think that we should exclude _ as variable name.

@dosisod
Copy link
Owner

dosisod commented Dec 18, 2023

Sorry for the late response, the holidays have been pretty chaotic this year!

The tests are passing now, but I found another false-positive:

def f(x):
    if x:
        name = "alice"
        stripped = name.strip()

    else:
        name = "bob"

    print(name)

Because name bleeds out into the next scope the name reference counter doesn't detect that name is used elsewhere. I think the easiest fix for this would be to only check top-level blocks inside function definitions of instead of checking every block.

Copy link
Owner

@dosisod dosisod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

Also I just approved CI workflows for this PR, so CI is failing because of linter errors. I ran your changes locally, and the tests are passing.

@sbrugman
Copy link
Contributor Author

Thanks! I've updated the implementation to restrict it to top-level definitions, and fixed linting & typing

Copy link
Owner

@dosisod dosisod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!

Also, the was_referenced property isn't being used anymore, so I'll remove that real quick and then merge this!

@dosisod dosisod merged commit 39cafb0 into dosisod:master Dec 22, 2023
3 checks passed
@sbrugman sbrugman deleted the feat/call-chaining branch December 22, 2023 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants