# Representing Data with EntitySets

An ``EntitySet`` is a collection of entities and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools  take ``entities`` and ``relationships`` as separate arguments, it is recommended to create an ``EntitySet``, so you can more easily manipulate your data as needed.

## The Raw Data

Below we have a two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:

In [None]:
import featuretools as ft
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])

transactions_df.sample(10)

And the second dataframe is a list of products involved in those transactions.

In [None]:
products_df = data["products"]
products_df

## Creating an EntitySet

First, we initialize an EntitySet. If you'd like to give it name, you can optionally provide an ``id`` to the constructor.

In [None]:
es = ft.EntitySet(id="customer_data")

## Adding entities

To get started, we load the transactions dataframe as an entity.

In [None]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="transaction_id",
    time_index="transaction_time",
    logical_types={
        "product_id": ww.logical_types.Categorical,
        "zip_code": ww.logical_types.PostalCode,
    },
)

es

In [None]:
es["transactions"].variables

In the call to ``add_dataframe``, we specified three important parameters

* The ``index`` parameter specifies the column that uniquely identifies rows in the dataframe
* The ``time_index`` parameter tells Featuretools when the data was created.
* The ``logical_types`` parameter indicates that "product_id" should be interpreted as a Categorical variable, even though it just an integer in the underlying data.

Now, we can do that same thing with our products dataframe

In [None]:
es = es.add_dataframe(
    dataframe_name="products",
    dataframe=products_df,
    index="product_id")

es

With two entities in our entity set, we can add a relationship between them.

## Adding a Relationship

We want to relate these two entities by the columns called "product_id" in each entity. Each product has multiple transactions associated with it, so it is called it the **parent entity**, while the transactions entity is known as the **child entity**. When specifying relationships we list the variable in the parent entity first. Note that each `ft.Relationship` must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many.

In [None]:
es = es.add_relationship("products", "product_id", "transactions", "product_id")
es

Now, we see the relationship has been added to our entity set.

## Creating entity from existing table

When working with raw data, it is common to have sufficient information to justify the creation of new entities. In order to create a new entity and relationship for sessions, we "normalize" the transaction entity.

In [None]:
es = es.normalize_dataframe(
    base_dataframe_name="transactions",
    new_dataframe_name="sessions",
    index="session_id",
    make_time_index="session_start",
    additional_columns=[
        "device",
        "customer_id",
        "zip_code",
        "session_start",
        "join_date",
    ],
)
es

Looking at the output above, we see this method did two operations

1. It created a new entity called "sessions" based on the "session_id" and "session_start" variables in "transactions"
2. It added a relationship connecting "transactions" and "sessions".

If we look at the variables in transactions and the new sessions entity, we see two more operations that were performed automatically.

In [None]:
es["transactions"].variables
es["sessions"].variables

1. It removed "device", "customer_id", "zip_code" and "join_date" from "transactions" and created a new variables in the sessions entity. This reduces redundant information as the those properties of a session don't change between transactions.
2. It copied and marked "session_start" as a time index variable into the new sessions entity to indicate the beginning of a session. If the base entity has a time index and ``make_time_index`` is not set, ``normalize entity`` will create a time index for the new entity.  In this case it would create a new time index called "first_transactions_time" using the time of the first transaction of each session. If we don't want this time index to be created, we can set ``make_time_index=False``.

If we look at the dataframes, can see what the ``normalize_entity`` did to the actual data.

In [None]:
es["sessions"].df.head(5)
es["transactions"].df.head(5)

To finish preparing this dataset, create a "customers" entity using the same method call.

In [None]:
es = es.normalize_dataframe(
    base_dataframe_name="sessions",
    new_dataframe_name="customers",
    index="customer_id",
    make_time_index="join_date",
    additional_columns=["zip_code", "join_date"],
)

es

## Using the EntitySet

Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let's build a feature matrix for each product in our dataset.

In [None]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe="products")

feature_matrix