Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Koalas with pandas API on Spark #1949

Merged
merged 23 commits into from
Mar 15, 2022
Merged

Replace Koalas with pandas API on Spark #1949

merged 23 commits into from
Mar 15, 2022

Conversation

jeff-hernandez
Copy link
Contributor

Closes #1864

def replace_nan_with_flag(pdf, flag=-1):
def replace_nan_with_flag(pdf, flag=-1.):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark series doesn't support an array that contains floats and integers

ps.from_pandas(pd.Series([[0.0, 0.0], [7.0, 3.0], [14.0, 6.0], [-1, -1], [-1, -1]]))
TypeError: element in array field 0: DoubleType can not accept object -1 in type <class 'int'>

Comment on lines 1582 to 1584
if isinstance(df, dd.DataFrame):
if isinstance(df, (dd.DataFrame, ps.DataFrame)):
df[index] = 1
df[index] = df[index].cumsum() - 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doesn't support range type for column assignment

setup.cfg Outdated
woodwork >= 0.8.1
woodwork @ git+https://github.com/alteryx/woodwork.git@migrate-to-pyspark-api#egg=woodwork
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to change this to the corresponding woodwork version before merging

@jeff-hernandez
Copy link
Contributor Author

we'll need to switch the required unit tests from koalas to spark

@codecov
Copy link

codecov bot commented Mar 11, 2022

Codecov Report

Merging #1949 (db333c9) into main (55cb2be) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head db333c9 differs from pull request most recent head ed639ae. Consider uploading reports for the commit ed639ae to get more accurate results

@@            Coverage Diff             @@
##             main    #1949      +/-   ##
==========================================
- Coverage   98.99%   98.99%   -0.01%     
==========================================
  Files         146      146              
  Lines       16478    16437      -41     
==========================================
- Hits        16313    16271      -42     
- Misses        165      166       +1     
Impacted Files Coverage Δ
...computational_backends/calculate_feature_matrix.py 100.00% <100.00%> (ø)
...s/computational_backends/feature_set_calculator.py 98.69% <100.00%> (ø)
featuretools/computational_backends/utils.py 96.44% <100.00%> (ø)
featuretools/entityset/entityset.py 99.21% <100.00%> (-0.01%) ⬇️
featuretools/entityset/serialize.py 100.00% <100.00%> (ø)
...ools/primitives/standard/aggregation_primitives.py 96.60% <100.00%> (ø)
...aturetools/primitives/standard/binary_transform.py 100.00% <100.00%> (ø)
...imitives/standard/datetime_transform_primitives.py 100.00% <100.00%> (ø)
...retools/primitives/standard/transform_primitive.py 100.00% <100.00%> (ø)
featuretools/primitives/utils.py 99.51% <100.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55cb2be...ed639ae. Read the comment docs.

@jeff-hernandez jeff-hernandez marked this pull request as ready for review March 11, 2022 18:26
@jeff-hernandez jeff-hernandez requested a review from a team March 11, 2022 19:02
setup.cfg Show resolved Hide resolved
spark-requirements.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeff-hernandez
Copy link
Contributor Author

I think we'll need to merge and release the changes in Woodwork first before doing the same in featuretools.

@thehomebrewnerd
Copy link
Contributor

@jeff-hernandez Just a quick heads-up. The compatibility attribute for the primitives added by #1948 will need to be updated as well when you fix the release notes merge conflict.

setup.cfg Outdated Show resolved Hide resolved
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming tests pass.

@jeff-hernandez jeff-hernandez enabled auto-merge (squash) March 15, 2022 17:55
@jeff-hernandez jeff-hernandez merged commit aa8e2e7 into main Mar 15, 2022
@thehomebrewnerd thehomebrewnerd mentioned this pull request Mar 15, 2022
@rwedge rwedge deleted the pyspark-api branch June 16, 2022 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update koalas code to pyspark pandas API instead in Featuretools
3 participants