# Tuning Deep Feature Synthesis

There are several parameters that can be tuned to change the output of DFS.

In [None]:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
es

## Using "Seed Features"

Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.

By using seed features, we can include domain specific knowledge in feature engineering automation.

In [None]:
expensive_purchase = ft.Feature(es, "transactions", "amount") > 125

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["percent_true"],
                                      seed_features=[expensive_purchase])
feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]

We can now see that the ``PERCENT_TRUE`` primitive was automatically applied to the boolean `expensive_purchase` Feature from the `transactions` table.

## Add "interesting" values to variables

Sometimes we want to create features that are conditioned on a second value before we calculate. We call this extra filter a "where clause".

By default, where clauses are built using the ``interesting_values`` of a column.

Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling `EntitySet.add_interesting_values`. 

Note that Dask and Koalas EntitySets cannot have interesting values determined for  their DataFrames. For those EntitySets, or when interesting values are already known for columns, the `dataframe_name` and `values` parameters can be used to set interesting values for DataFrames in an EntitySet.

In [None]:
values_dict = {'device': ["desktop", "mobile", "tablet"]}
es.add_interesting_values(dataframe_name='sessions', values=values_dict)

Interesting values are stored in a DataFrame's Woodwork typing information.

In [None]:
es['sessions'].ww.columns['device'].metadata

Now that interesting values are set for the `device` column in the `sessions` table, we can specify aggregation primitives that will make where clauses for using ``where_primitives``.

In [None]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["count", "avg_time_between"],
                                      where_primitives=["count", "avg_time_between"],
                                      trans_primitives=[])
feature_matrix

Now, we have several new potentially useful features. Here are two of them that are built off of the where clause "where the device used was a tablet":

In [None]:
feature_matrix[["COUNT(sessions WHERE device = tablet)", "AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)"]]

The first Feature, `COUNT(sessions WHERE device = tablet)` has the following description generated for it:
For example, the two features below tell us , and .

In [None]:
ft.describe_feature(feature_defs[8])

Another way of understanding this feature is *how many sessions a customer completed on a tablet*.

The second Feature, `AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)`, has the following description:

In [None]:
ft.describe_feature(feature_defs[5])

Another way of understanding this is *the time between those sessions*.

We can see that customer who only had 0 or 1 sessions on a tablet, had ``NaN`` values for average time between such sessions.


## Encoding categorical features

Machine learning algorithms typically expect all numeric data. When Deep Feature Synthesis generates categorical features, we need to encode them.

In [None]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name="customers",
                                      agg_primitives=["mode"],
                                      trans_primitives=['time_since'],
                                      max_depth=1)

feature_matrix

In [None]:
# --> eventually do feature_matrix.select('category')
{f._name: f.column_schema for f in feature_defs if 'category' in f.column_schema.semantic_tags}

This feature matrix contains 2 columns that are categorical in nature, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values. Featuretools offers functionality to apply one hot encoding to the output of DFS.

In [None]:
feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
feature_matrix_enc

The returned feature matrix is now all numeric (as boolean values correspond to `0` and `1`). Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values.

In [None]:
features_enc

These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read :doc:`/guides/deployment`.