Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ pytest==6.0.1
pytest-cov==2.10.1
wheel==0.33.1
fastparquet==0.5.0
featuretools==0.21.0
evalml==0.28.0
featuretools==1.4.0
evalml==0.41.0
2 changes: 1 addition & 1 deletion composeml/tests/test_featuretools.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def test_dfs(labels):
es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)
feature_matrix, _ = ft.dfs(
entityset=es,
target_entity="customers",
target_dataframe_name="customers",
cutoff_time=labels,
cutoff_time_in_index=True,
)
Expand Down
49 changes: 26 additions & 23 deletions docs/source/examples/predict_bike_trips.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
"- Feature Engineering\n",
"- Machine Learning\n",
"\n",
"In the first step, create new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, generate features for the labels by using [Featuretools](https://docs.featuretools.com/). In the third step, search for the best machine learning pipeline using [EvalML](https://evalml.alteryx.com/). After working through these steps, you should understand how to build machine learning applications for real-world problems like forecasting demand."
"In the first step, create new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, generate features for the labels by using [Featuretools](https://featuretools.alteryx.com/). In the third step, search for the best machine learning pipeline using [EvalML](https://evalml.alteryx.com/). After working through these steps, you should understand how to build machine learning applications for real-world problems like forecasting demand.\n",
"\n",
"**Note: In order to run this example, you should have Featuretools 1.4.0 or newer and EvalML 0.41.0 or newer installed.**"
]
},
{
Expand Down Expand Up @@ -179,7 +181,7 @@
"\n",
"### Representing the Data\n",
"\n",
"Start by representing the data with an entity set. That way, you can generate features based on the relational structure of the dataset. You currently have a single table of trips where one station can have many trips. This one-to-many relationship can be represented by normalizing a station entity. The same can be done with other one-to-many relationships like weather-to-trips. Because you want to make predictions based on the station where the trips started from, you should use this station entity as the target entity for generating features. Also, you should use the stop times of the trips as the time index for generating features, since data about a trip would likely be unavailable until the trip is complete."
"Start by representing the data with an EntitySet. That way, you can generate features based on the relational structure of the dataset. You currently have a single table of trips where one station can have many trips. This one-to-many relationship can be represented by normalizing a station dataframe. The same can be done with other one-to-many relationships like weather-to-trips. Because you want to make predictions based on the station where the trips started from, you should use this station dataframe as the target for generating features. Also, you should use the stop times of the trips as the time index for generating features, since data about a trip would likely be unavailable until the trip is complete."
]
},
{
Expand All @@ -190,36 +192,37 @@
"source": [
"es = ft.EntitySet('chicago_bike')\n",
"\n",
"es.entity_from_dataframe(\n",
"es.add_dataframe(\n",
" dataframe=df.reset_index(),\n",
" entity_id='trips',\n",
" dataframe_name='trips',\n",
" time_index='stoptime',\n",
" index='trip_id',\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='trips',\n",
" new_entity_id='from_station_id',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='trips',\n",
" new_dataframe_name='from_station_id',\n",
" index='from_station_id',\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='trips',\n",
" new_entity_id='weather',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='trips',\n",
" new_dataframe_name='weather',\n",
" index='events',\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='trips',\n",
" new_entity_id='gender',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='trips',\n",
" new_dataframe_name='gender',\n",
" index='gender',\n",
" make_time_index=False,\n",
")\n",
"\n",
"es[\"trips\"][\"gender\"].interesting_values = ['Male', 'Female']\n",
"es[\"trips\"][\"events\"].interesting_values = ['tstorms']\n",
"es.add_interesting_values(dataframe_name='trips',\n",
" values={'gender': ['Male', 'Female'],\n",
" 'events': ['tstorms']})\n",
"es.plot()"
]
},
Expand All @@ -229,10 +232,10 @@
"source": [
"### Calculating the Features\n",
"\n",
"Generate features using a method called Deep Feature Synthesis (DFS). The method automatically builds features by stacking and applying mathematical operations called primitives across relationships in an entity set. The more structured an entity set is, the better DFS can leverage the relationships to generate better features. Run DFS with the following parameters:\n",
"Generate features using a method called Deep Feature Synthesis (DFS). The method automatically builds features by stacking and applying mathematical operations called primitives across relationships in an entityset. The more structured an entityset is, the better DFS can leverage the relationships to generate better features. Run DFS with the following parameters:\n",
"\n",
"- `entity_set` as the entity set we structured previously.\n",
"- `target_entity` as the station entity where the trips started from.\n",
"- `entityset` as the entitset we structured previously.\n",
"- `target_dataframe_name` as the station dataframe where the trips started from.\n",
"- `cutoff_time` as the label times that we generated previously. The label values are appended to the feature matrix."
]
},
Expand All @@ -244,7 +247,7 @@
"source": [
"fm, fd = ft.dfs(\n",
" entityset=es,\n",
" target_entity='from_station_id',\n",
" target_dataframe_name='from_station_id',\n",
" trans_primitives=['hour', 'week', 'is_weekend'],\n",
" cutoff_time=lt,\n",
" cutoff_time_in_index=True,\n",
Expand Down Expand Up @@ -277,7 +280,7 @@
"outputs": [],
"source": [
"fm.reset_index(drop=True, inplace=True)\n",
"y = fm.pop('trip_count')\n",
"y = fm.ww.pop('trip_count')\n",
"\n",
"splits = evalml.preprocessing.split_data(\n",
" X=fm,\n",
Expand Down Expand Up @@ -431,13 +434,13 @@
"source": [
"### Next Steps\n",
"\n",
"You have completed this tutorial. You can revisit each step to explore and fine-tune the model using different parameters until it is ready for production. For more information about how to work with the features produced by Featuretools, take a look at [the Featuretools documentation](https://docs.featuretools.com/). For more information about how to work with the models produced by EvalML, take a look at [the EvalML documentation](https://evalml.alteryx.com/)."
"You have completed this tutorial. You can revisit each step to explore and fine-tune the model using different parameters until it is ready for production. For more information about how to work with the features produced by Featuretools, take a look at [the Featuretools documentation](https://featuretools.alteryx.com/). For more information about how to work with the models produced by EvalML, take a look at [the EvalML documentation](https://evalml.alteryx.com/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -451,7 +454,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.5"
}
},
"nbformat": 4,
Expand Down
67 changes: 35 additions & 32 deletions docs/source/examples/predict_next_purchase.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
"- Feature Engineering\n",
"- Machine Learning\n",
"\n",
"In the first step, you generate new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, you generate features for the labels by using [Featuretools](https://docs.featuretools.com/). In the third step, you search for the best machine learning pipeline by using [EvalML](https://evalml.alteryx.com/). After working through these steps, you should understand how to build machine learning applications for real-world problems like predicting consumer spending."
"In the first step, you generate new labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, you generate features for the labels by using [Featuretools](https://featuretools.alteryx.com/). In the third step, you search for the best machine learning pipeline by using [EvalML](https://evalml.alteryx.com/). After working through these steps, you should understand how to build machine learning applications for real-world problems like predicting consumer spending.\n",
"\n",
"**Note: In order to run this example, you should have Featuretools 1.4.0 or newer and EvalML 0.41.0 or newer installed.**"
]
},
{
Expand Down Expand Up @@ -186,7 +188,7 @@
"\n",
"### Representing the Data\n",
"\n",
"Start by representing the data with an entity set. That way, you can generate features based on the relational structure of the dataset. You currently have a single table of orders where one customer can have many orders. This one-to-many relationship can be represented by normalizing a customer entity. The same can be done for other one-to-many relationships like aisle-to-products. Because you want to make predictions based on the customer, you should use this customer entity as the target entity for generating features."
"Start by representing the data with an EntitySet. That way, you can generate features based on the relational structure of the dataset. You currently have a single table of orders where one customer can have many orders. This one-to-many relationship can be represented by normalizing a customer dataframe. The same can be done for other one-to-many relationships like aisle-to-products. Because you want to make predictions based on the customer, you should use this customer dataframe as the target for generating features."
]
},
{
Expand All @@ -197,53 +199,54 @@
"source": [
"es = ft.EntitySet('instacart')\n",
"\n",
"es.entity_from_dataframe(\n",
"es.add_dataframe(\n",
" dataframe=df.reset_index(),\n",
" entity_id='order_products',\n",
" dataframe_name='order_products',\n",
" time_index='order_time',\n",
" index='id',\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='order_products',\n",
" new_entity_id='orders',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='order_products',\n",
" new_dataframe_name='orders',\n",
" index='order_id',\n",
" additional_variables=['user_id'],\n",
" additional_columns=['user_id'],\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='orders',\n",
" new_entity_id='customers',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='orders',\n",
" new_dataframe_name='customers',\n",
" index='user_id',\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='order_products',\n",
" new_entity_id='products',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='order_products',\n",
" new_dataframe_name='products',\n",
" index='product_id',\n",
" additional_variables=['aisle_id', 'department_id'],\n",
" additional_columns=['aisle_id', 'department_id'],\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='products',\n",
" new_entity_id='aisles',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='products',\n",
" new_dataframe_name='aisles',\n",
" index='aisle_id',\n",
" additional_variables=['department_id'],\n",
" additional_columns=['department_id'],\n",
" make_time_index=False,\n",
")\n",
"\n",
"es.normalize_entity(\n",
" base_entity_id='aisles',\n",
" new_entity_id='departments',\n",
"es.normalize_dataframe(\n",
" base_dataframe_name='aisles',\n",
" new_dataframe_name='departments',\n",
" index='department_id',\n",
" make_time_index=False,\n",
")\n",
"\n",
"es[\"order_products\"][\"department\"].interesting_values = ['produce']\n",
"es[\"order_products\"][\"product_name\"].interesting_values = ['Banana']\n",
"es.add_interesting_values(dataframe_name='order_products',\n",
" values={'department': ['produce'],\n",
" 'product_name': ['Banana']})\n",
"es.plot()"
]
},
Expand All @@ -253,10 +256,10 @@
"source": [
"### Calculating the Features\n",
"\n",
"Now you can generate features by using a method called Deep Feature Synthesis (DFS). That method automatically builds features by stacking and applying mathematical operations called primitives across relationships in an entity set. The more structured an entity set is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:\n",
"Now you can generate features by using a method called Deep Feature Synthesis (DFS). That method automatically builds features by stacking and applying mathematical operations called primitives across relationships in an entityset. The more structured an entityset is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:\n",
"\n",
"- `entity_set` as the entity set we structured previously.\n",
"- `target_entity` as the customer entity.\n",
"- `entity_set` as the entityset we structured previously.\n",
"- `target_dataframe_name` as the customer dataframe.\n",
"- `cutoff_time` as the label times that we generated previously. The label values are appended to the feature matrix."
]
},
Expand All @@ -268,7 +271,7 @@
"source": [
"fm, fd = ft.dfs(\n",
" entityset=es,\n",
" target_entity='customers',\n",
" target_dataframe_name='customers',\n",
" cutoff_time=lt,\n",
" cutoff_time_in_index=True,\n",
" include_cutoff_time=False,\n",
Expand Down Expand Up @@ -300,7 +303,7 @@
"outputs": [],
"source": [
"fm.reset_index(drop=True, inplace=True)\n",
"y = fm.pop('bought_product')\n",
"y = fm.ww.pop('bought_product')\n",
"\n",
"splits = evalml.preprocessing.split_data(\n",
" X=fm,\n",
Expand Down Expand Up @@ -454,7 +457,7 @@
"source": [
"### Next Steps\n",
"\n",
"You have completed this tutorial. You can revisit each step to explore and fine-tune the model using different parameters until it is ready for production. For more information about how to work with the features produced by Featuretools, take a look at [the Featuretools documentation](https://docs.featuretools.com/). For more information about how to work with the models produced by EvalML, take a look at [the EvalML documentation](https://evalml.alteryx.com/)."
"You have completed this tutorial. You can revisit each step to explore and fine-tune the model using different parameters until it is ready for production. For more information about how to work with the features produced by Featuretools, take a look at [the Featuretools documentation](https://featuretools.alteryx.com/). For more information about how to work with the models produced by EvalML, take a look at [the EvalML documentation](https://evalml.alteryx.com/)."
]
}
],
Expand All @@ -465,7 +468,7 @@
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -479,7 +482,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.8.5"
}
},
"nbformat": 4,
Expand Down
Loading