# Woodwork Typing in Featuretools

Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external package for its typing: [Woodwork](https://woodwork.alteryx.com/en/stable/index.html). Woodwork is also used by Alteryx's open source autoML tool, EvalML, allowing for a smooth transition from feature engineering to model-building via a common Woodwork type system.

Woodwork is a library that helps with the data typing of 2-dimensional tabular data structures. Initializing Woodwork on a DataFrame creates a namespace, `ww`, that contains physical, logical, and semantic data types that can be used within the Alteryx Open Source ecosystem as well as in general machine learning applications.

A more in-depth explanation of what is available as a part of Woodwork's type system as its integrated into Featuretools can be found below, but to start, here are several quick definitions of the types that are central to Woodwork:

- Physical Type: defines how the data is stored on disk or in memory.
- Logical Type: defines how the data should be parsed or interpreted.
- Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

A full-length explanation of Woodwork's Types and Tags that includes more instruction on building Woodwork DataFrames can be found in Woodwork's documentation [here](link). 

## Physical Types 
Physcial types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.

Pandas, Dask, and Koalas DataFrames rely on these dtypes when performing DataFrame operations, so knowing a Woodwork table's physical types is important for any DataFrame operations that might be performed. Each `LogicalType` class has a single physical type associated with it, though multiple logical types can have the same physical type.

## Logical Types

Multiple logical types may have the same physical types, because a logical type adds additional information about how data should be interpreted or parsed beyond what can be contained in a physical type. 

For example, email addresses and phone numbers would typically both be stored in a data column with a physical type of `string`. However, when reading, validating, and using these two types of information, different rules apply. For email addresses, the presence of the `@` symbol is important. For phone numbers, you might want to confirm that only a certain number of digits are present or use the first three digits to determine an area-code, and special characters might be restricted to `+`, `-`, `(` or `)`. In this particular example Woodwork defines two different logical types to separate these parsing needs: `EmailAddress` and `PhoneNumber`.

Woodwork provides many different logical types, which can be seen with the `list_logical_types` function.

In [None]:
import featuretools as ft
ft.list_logical_types()

Featuretools will perform type inference to assign logical types to the data in EntitySets, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type). 

To learn more about how logical types are used in EntitySets, see the [Creating EntitySets guide](link). To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on [working with Logical Types](https://woodwork.alteryx.com/en/stable/guides/understanding_types_and_tags.html#Working-with-Logical-Types). 

## Semantic Tags
The `standard_tags` column in the `list_logical_types` output above contains semantic tags that Woodwork applies to specific logical types. These tags provide even more context about the meaning and potential uses of a column. 

One use of semantic tags can be in grouping certain columns toghether. For example, `Age` and `Integer` have the `int64` physical type, and `Double` has the `float64` physical type, but all three have the `numeric` tag, which ties them together in a way that physical types and logical types are not able to do alone. 

Unlike physical types and logical types, semantic tags are much less restrictive. A column might contain many semantic tags or none at all. Regardless, when assigning semantic tags, users should take care to not assign tags that have conflicting meanings.

As another example of how semantic tags can be useful, consider a dataset with 2 date columns: a signup date and a user birth date. Both of these columns have the same physical type (`datetime64[ns]`), and both have the same logical type (`Datetime`), but the way one might interpret or use these columns is not the same. This time, semantic tags can be used to differentiate these columns. For example, you might want to add the `date_of_birth` semantic tag to the user birth date column to indicate this column has special meaning and could be used to compute a user’s age. Computing an age from the signup date column would not make sense, so the semantic tag can be used to differentiate between what the dates in these columns mean.

### Woodwork Standard Tags

Woodwork will add certain tags to columns at initialization. As we mentioned above, one type of these tags is standard tags that may be associated with different sets logical types that fall under certain predefined categories.

The standard tags that come from logical types are as follows:

* `'numeric'` - The tag applied to numeric Logical Types.
    * `Integer`
    * `IntegerNullable`
    * `Double`
    * `Age`
    * `AgeNullable`
    
* `'category'` - The tag applied to Logical Types that represent categorical variables.
    * `Categorical`
    * `CountryCode`
    * `Ordinal`
    * `PostalCode`
    * `SubRegionCode`
    
To learn more about working with Woodwork's standard semantic tags directly on a DataFrame, see the Woodwork guide on [working with semantic tags](https://woodwork.alteryx.com/en/stable/guides/understanding_types_and_tags.html#Standard-Tags).

There are also 2 tags that Woodwork adds only to index columns. If no index columns have been specified, these tags are not present:

* `'index'` - on the index column, when specified
* `'time_index'` on the time index column, when specified

These `index` and `time_index` tags, which have special meaning, can be controlled by the user at EntitySet creation or through Woodwork methods directly on the DataFrame. This is discussed in more detail in the Woodwork guide on [setting the index](https://woodwork.alteryx.com/en/stable/guides/understanding_types_and_tags.html#Setting-the-index).

To get a list of the standard, index, and time index tags, you can use the `list_semantic_tags` function.

In [None]:
ft.list_semantic_tags()

Above we see the standard tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, but in order to generate the full set of features, Featuretools must create a few tags of its own.

### Semantic Tags in Featuretools

Just like Woodwork specifies semantic tags internally, Featuretools also has certain semantic tags that have a specific meaning when they are present on a column.

- `'last_time_index'` - added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.
- `'foreign_key'` - used to indicate that a column has a parent column that's a primary key of another dataframe in the EntitySet. Is added by Featuretools when a Relationship is created between two DataFrames.
- `'date_of_birth'` - indicates that a column can be used for Primitive operations that require a date of birth. Must be added at EntitySet creation to be present.


## Woodwork Throughout Featuretools

Now that we've described the elements that make up Woodwork's type system, lets see them in action in Featuretools.

### Woodwork in EntitySets
An EntitySet is a collection of DataFrames and the relationships between them. Previously, EntitySets were built of `Entity` objects that stored the typing information, but with Woodwork, the physical, logical, and semantic typing information gets stored within the DataFrames that make up an EntitySet. For more information on building these EntitySets using Woodwork, see the [EntitySet guide](link). Let's look at the typing information as it's stored in a demo EntitySet of flight data:

In [None]:
es = ft.demo.load_flight()
es

The EntitySet representation above contains no Woodwork typing information, as it is a representation of a relational dataset, and Woodwork currently supports single tables. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the `ww` namespace:

In [None]:
df = es['flights']
df.head()

In [None]:
df.ww

Notice how the three columns provided to show this DataFrame's typing information are the three types of typing information outlined at the beginning of this guide. To reiterate: By defining physcial types, logical types, and semantic tags for each column in a DataFrame, we've defined a DataFrame's Woodwork schema, and we can gain an understanding of the contents of each column.

In this case, we can see how every column with the `Categorical` logical type also has the `category` physical type, but so does the column that has an `Ordinal` logical type.

There's also a good showing of semantic tags in this DataFrame. The `'category'` standard semantic tag is present on the categorical columns, and the Featuretools-sepcific `foreign_key` tag is present on columns that are child columns of Relationships. This DataFrame also has the `index` and `time_index` tags.

All of the typing information shown here paints a picture of this table in the flights EntitySet, and it's useful as a way of summarizing the contents of a table. And on Featuretools' end, all of this information will get used to run DFS. To understand how Featuretools uses this information, we need to take a closer look at the structure of typing information in Woodwork.

The table above is a view of the Woodwork typing information on a DataFrame-wide level, but just as we can select a single Series from a DataFrame, we can get the Woodwork typing information for a single column.

In [None]:
df.ww['flight_id'].ww

The column-specific typing information that exists for every column in every table in an EntitySet is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.

### Woodwork in DFS
As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For these, we use Woodwork `ColumnSchema` objects. To understand what a `ColumnSchema` is and how it gets used, we will dive a little deeper into the structure of a DataFrame that has Woodwork typing information.

A Woodwork DataFrame has the `ww` namespace, which is how users access the typing information. However, Woodwork also provides an object that contains just typing information and does not have any data associated with it: `woodwork.table_schema.TableSchema`.

If we take the DataFrame from above and access its Woodwork `TableSchema`, we can see that the major difference is that there are no physical types associated with the schema.

In [None]:
table_schema = df.ww.schema
table_schema

In [None]:
type(table_schema)

The lack of a physical type is due to the fact that a TableSchema has no data, and therefore no physical representation of the data. We often rely on physical typing information to know the exact pandas or Dask or Koalas operations that are valid for a DataFrame, but for a schema of typing information that is not tied to data, those operations are not relevant.

Now, let's look at a single column of typing information, or a `woodwork.column_schema.ColumnSchema` that we can aquire in much the same way as we selected a Series from the DataFrame and had it maintain its typing information: 

In [None]:
flight_id_col = df.ww['flight_id']
column_schema = flight_id_col.ww.schema
column_schema

The `column_schema` object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was an index column from the `flights` dataframe in the flights EntitySet. But we can also create a `ColumnSchema` that exists without being associated with any individual column of data.

In order to understand how Featuretools uses ColumnSchemas to define input and return types for DFS, we have to be able to understand a `ColumnSchema` both in the context of a specific column that's part of a DataFrame (like `flight_id` above) as well as a more general type definition that's not associated with any data. The combination of Woodwork logical types and semantic tags that makes up a `ColumnSchema` is how Featuretools determines what features can be built from Primitives.

Let's look at a different column in a different DataFrame to see how this works:

In [None]:
es['trip_logs'].ww.columns['distance']

The `ColumnSchema` above has been pulled from the `distance` column in the `trip_logs` table in the flights EntitySet. If we look at the `trip_logs` table, we can see that there are a lot of columns that will have the same `ColumnSchema`. Any of those columns could be used by a Primitive that can take in any input, a Primitive that can use any numeric input, or a Primitive that requires specifically floating point inputs.

We can use `ColumnSchema` objects to define those input types.

In [None]:
es['trip_logs'].ww

Below are several `ColumnSchema`s that all would include our `distance` column, but each of them describe a type space. These `ColumnSchema`s get more restrictive as we go down:

- **`<ColumnSchema >`** - No restrictions have been placed; any column falls into this definition.
- **`<ColumnSchema (Semantic Tags = ['numeric'])>`** - Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well.
- **`<ColumnSchema (Logical Type = Double) >`** - Only columns with logical type of `Double` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply
- **`<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>`** - The column must have logical type `Double` and have the `numeric` semantic tag, excluding index columns.

In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how we determine which columns in a DataFrame are valid for a Primitive in building Features during DFS.

Each Primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. Every table in an EntitySet has Woodwork initialized on it. This means that when we pass an EntitySet into DFS, we can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing information is in a way that lets DFS stack features on top of one another.

In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the `ColumnSchema`, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis.