Spark Example for Featuretools #203

8bit-pixies · 2018-08-03T06:44:53Z

Bug/Feature Request Description

In notebooks such as here: https://github.com/Featuretools/predict-next-purchase/blob/master/Tutorial.ipynb and documentation: https://docs.featuretools.com/usage_tips/scaling.html

It mentions the ability to scale to Spark. Could an example be provided like it was for dask here: https://github.com/Featuretools/predict-next-purchase?

Issues created here on Github are for bugs or feature requests. For usage questions and questions about errors, please ask on Stack Overflow with the featuretools tag. Check the documentation for further guidance on where to ask your question.

kmax12 · 2018-08-03T14:05:46Z

@chappers thanks for the request. this is something on our to do list.

ajing · 2018-08-08T17:25:21Z

Is there any quick hacky way to use featuretools with spark?

tropicalpb · 2018-08-11T14:12:06Z

i too request a spark implementation, preferably with java api also

kmax12 · 2018-08-13T15:25:33Z

@ajing we've had success integrating with pyspark. the basic approach would be to partition your data using spark dataframes and then map each partition to an entityset and calculate_feature_matrix(..) call. Then you can write out a feature matrix for each partition.

there is a basic outline of this approach using Dask in our Predict Next Purchase repo. See the end of the notebook here: https://github.com/Featuretools/predict-next-purchase/blob/master/Tutorial.ipynb

A more in depth tutorial on this will be coming soon too.

hcz28 · 2018-08-21T02:02:29Z

I need that too, thanks

bgweber · 2018-09-13T04:03:02Z

For applying learned features to a full data set, you can use Pandas UDFs which call calculate_feature_matrix(). With this approach, you group by a key that distributes your spark data frame across nodes and passes portions of the dataframe as pandas to a user-defined function before recombining the result as a spark dataframe.

Are there any plans on making dfs() work in a distributed way? Right now I'm using toPandas() on my spark data frame to do feature generation on the driver node.

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

kmax12 · 2018-09-14T14:13:43Z

@bgweber thanks for outlining the approaching with Pandas UDFS

yep, there are plans for a better integration. stay tuned!

WillKoehrsen · 2018-10-01T20:00:49Z

There is now a Spark example notebook showing how to distribute feature engineering across a cluster of machines: Featuretools on Spark notebook. A write-up describing the approach is also available on the Feature Labs engineering blog: Featuretools on Spark article.

Hopefully this is helpful, and if anyone needs more clarification, we'd be glad to provide it!

8bit-pixies · 2018-10-01T22:10:18Z

@WillKoehrsen It certainly looks promising as a first pass. There does seem to be a fair amount of upfront work involved to ensure that each partition has the "full data" that is required to build the feature matrix. This might be tricky if the entity set has complex relationships (i.e. getting them partitioned in the first place might be a massive pain).

Happy to close this as "done" for now - might be worth testing it out there on our internal large datasets first and revisit if we have any questions.

WillKoehrsen · 2018-10-02T17:36:13Z

@chappers You're right, the example there is relatively simple to partition. With a more complex EntitySet (more parent-child relationships) this could be tough to implement.

If you do test out this approach, we'd enjoy hearing about it.

bgweber · 2018-10-11T02:02:13Z

Here's a gist of the approach I'm taking, the goal is to be able to deal directly with Spark data frames:
https://gist.github.com/bgweber/6655508db34dffe7a63cfb95281fa700

There's a few things kept in memory on the driver node. I'm hoping I can share a write up I did on this approach eventually, because it's a great way of mixing python code that relies on Pandas with PySpark.

kmax12 · 2018-10-11T13:53:17Z

@bgweber this looks like a great approach! please let us know how it works.

8bit-pixies · 2018-10-12T04:25:17Z

For the purpose of discussion, heres a version from @bgweber using the tutorial version on feature tools:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession
import featuretools as ft

spark = (SparkSession\
    .builder\
    .appName("myname")\
    .getOrCreate())

data = ft.demo.load_mock_customer()
customer_df = data['customers']
es = ft.EntitySet('name', {'customers': (customer_df, 'customer_id')})
output_df, feats = ft.dfs(entityset=es, target_entity='customers', max_depth=2)

# now we can build this on spark...
customer_sp = spark.createDataFrame(customer_df)
target_schema = spark.createDataFrame(output_df.reset_index()).schema

#ft.calculate_feature_matrix(feats, es)
@pandas_udf(target_schema, PandasUDFType.GROUPED_MAP)
def generate_features(df_sp):
    es = ft.EntitySet('name', {'customers': (df_sp, 'customer_id')})
    return ft.calculate_feature_matrix(feats, es).reset_index()

customer_ft = customer_sp.groupby("customer_id").apply(generate_features)
customer_ft.toPandas()

Comments: this approach will probably work well if you have a single dataset. We probably want to be able to do something across the whole entityset object for everything (as per the mock data).

In my eyes, the value that Featuretools bring is that the objects you pass are relational objects, having to denormalise first somewhat diminishes the value of featuretools in the first place. I think if we could have Spark working over entitysets that would be really cool (not sure if that means a change in the spark internals etc).

Of course this is a massive step in the right direction, and would be keen on how other people use and scale featuretools

FavioVazquez · 2018-10-14T02:11:36Z

I love this thread. How much the actual feature tools depend on pandas? I'm happy to help and the work of @chappers and @bgweber it's an amazing step indeed

kmax12 · 2018-10-15T19:39:30Z

@FavioVazquez right now, the internals depend on the Pandas DataFrame a little too tightly. There are also a few places where we go down to pure python code, which isn't amenable to a spark dataframe integration.

If you'd like to contribute, it probably makes sense for you review the EntitySet class and the calculate_feature_matrix method, to see how hard it'd be to adjust them to support both pandas dataframes and spark dataframes.

Then we can make a plan to refactor the internals towards the goal of supporting dataframes from other libraries such as spark or dask.

Tagar · 2018-11-13T21:17:37Z

@kmax12 would it be possible basically to implement a "SparkBackend" and reference it in calculate_feature_matrix() here:

https://github.com/Featuretools/featuretools/blob/b5c0b5e93b3a8586a731bebce2e4668c23ef135b/featuretools/computational_backends/calculate_feature_matrix.py#L174

and then update EntitySet to use an abstract dataframe too (like it could be pandas dataframe or Spark dataframe ) ?

https://github.com/Featuretools/featuretools/blob/b5c0b5e93b3a8586a731bebce2e4668c23ef135b/featuretools/entityset/entityset.py
?

It seems first step could be some refactoring to use an abstract dataframe, with implementation for pandas dataframe so then we can add later Spark dataframe implementation too.

kmax12 · 2018-11-13T21:40:41Z

Yep, that's the approach we had in mind. Although, much like we'd have a spark backend, we'd have a SparkEntitySet.

I'd also like to refactor functionality that can be shared across backends in to a BaseBackend class. For example, we have logic that decides the order in which to calculate features to avoid repeating calculations. Then each backend could subclass BaseBackend and implement the relevant functions.

Tagar · 2018-11-13T21:50:18Z

That makes sense.

Thank you @kmax12

Is this something that is planned to be part of a particular Featuretools release ?

kmax12 · 2018-11-13T21:55:00Z

we're working on it, but no ETA for putting into Featuretools open source.

sarvendras · 2019-02-08T09:33:15Z

@kmax12 @WillKoehrsen @chappers
Hi All,

Can you please provide code to partition data using spark and any link for full implementation of featuretool on spark ?We are thinking to use featuretool for our fraud prediction system so just wan t to do small poc on spark implementation..

https://medium.com/feature-labs-engineering/featuretools-on-spark-e5aa67eaf807

following this link but partition data code is not work.

Iteration through grouped partitions

for partition, grouped in members.groupby('partition'):

# Open file for appending
with open(file_dir + f'p{partition}/members.csv', 'a') as f:
    # Write a new line and then the contents of the dataframe
    f.write('\n')
    grouped.to_csv(f, header = False, index = False)

Thanks
Sarvendra

lvjiujin · 2019-12-10T12:13:43Z

@chappers hello, I see your code about using the pyspark pandas_udf and featuretools to calculate the feature matrix ! good work! but I want to ask you why you first used the ft.dfs to calculate the matrix, you should have add the arguments features_only=True, but you didn't set it, so in fact ,you calculate the matrxi two times. My question is how to get the schema from the features if I set the features_only =True.

8bit-pixies · 2019-12-11T00:35:21Z

@lvjiujin I'm not a featuretools developer - best to ask the team.

candalfigomoro · 2020-04-07T13:16:50Z

Because featuretools is currently based on pandas, maybe it would be easier to add support for Koalas (https://github.com/databricks/koalas) which is pandas API on Spark. Maybe with minimal changes we could add support for Koalas DataFrames, which are a pandas-like interface for Spark DataFrames.

kmax12 · 2020-04-09T19:05:09Z

@candalfigomoro would be interesting to see if we can do this after we do #783

candalfigomoro · 2020-04-10T14:22:15Z

@kmax12 Yes, I think some of the changes performed to add Dask DataFrames compatibility could also be valid for Koalas DataFrames.

rwedge · 2020-05-14T16:49:11Z

Closing in favor of #887

kmax12 changed the title ~~Spark Example for Featuretools?~~ Spark Example for Featuretools Aug 3, 2018

rwedge mentioned this issue Apr 10, 2020

Support Spark using Koalas dataframes #887

Open

rwedge closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Example for Featuretools #203

Spark Example for Featuretools #203

8bit-pixies commented Aug 3, 2018

kmax12 commented Aug 3, 2018

ajing commented Aug 8, 2018

tropicalpb commented Aug 11, 2018

kmax12 commented Aug 13, 2018

hcz28 commented Aug 21, 2018

bgweber commented Sep 13, 2018

kmax12 commented Sep 14, 2018 •

edited

WillKoehrsen commented Oct 1, 2018

8bit-pixies commented Oct 1, 2018

WillKoehrsen commented Oct 2, 2018

bgweber commented Oct 11, 2018

kmax12 commented Oct 11, 2018

8bit-pixies commented Oct 12, 2018

FavioVazquez commented Oct 14, 2018

kmax12 commented Oct 15, 2018

Tagar commented Nov 13, 2018

kmax12 commented Nov 13, 2018

Tagar commented Nov 13, 2018

kmax12 commented Nov 13, 2018

sarvendras commented Feb 8, 2019

lvjiujin commented Dec 10, 2019

8bit-pixies commented Dec 11, 2019

candalfigomoro commented Apr 7, 2020

kmax12 commented Apr 9, 2020

candalfigomoro commented Apr 10, 2020

rwedge commented May 14, 2020

Spark Example for Featuretools #203

Spark Example for Featuretools #203

Comments

8bit-pixies commented Aug 3, 2018

Bug/Feature Request Description

kmax12 commented Aug 3, 2018

ajing commented Aug 8, 2018

tropicalpb commented Aug 11, 2018

kmax12 commented Aug 13, 2018

hcz28 commented Aug 21, 2018

bgweber commented Sep 13, 2018

kmax12 commented Sep 14, 2018 • edited

WillKoehrsen commented Oct 1, 2018

8bit-pixies commented Oct 1, 2018

WillKoehrsen commented Oct 2, 2018

bgweber commented Oct 11, 2018

kmax12 commented Oct 11, 2018

8bit-pixies commented Oct 12, 2018

FavioVazquez commented Oct 14, 2018

kmax12 commented Oct 15, 2018

Tagar commented Nov 13, 2018

kmax12 commented Nov 13, 2018

Tagar commented Nov 13, 2018

kmax12 commented Nov 13, 2018

sarvendras commented Feb 8, 2019

Iteration through grouped partitions

lvjiujin commented Dec 10, 2019

8bit-pixies commented Dec 11, 2019

candalfigomoro commented Apr 7, 2020

kmax12 commented Apr 9, 2020

candalfigomoro commented Apr 10, 2020

rwedge commented May 14, 2020

kmax12 commented Sep 14, 2018 •

edited