# Woodwork Typing in Featuretools

Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external package for its typing: [Woodwork](https://woodwork.alteryx.com/en/stable/index.html).

Woodwork is a library that helps with the data typing of 2-dimensional tabular data structures. Initializing Woodwork on a DataFrame creates a namespace, `ww`, that contains physical, logical, and semantic data types that can be used within the Alteryx Open Source ecosystem as well as in general machine learning applications.

Understanding these Woodwork types and how they work together to create Woodwork's type system will allow users to build EntitySets that best represent their data, understand the possible input and return types for Featuretools' Primitives, and understand what features will get generated from a given set of data and primitives.

- Physical Type: defines how the data is stored on disk or in memory.
- Logical Type: defines how the data should be parsed or interpreted.
- Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

Read the [Understanding Woodwork Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide for an in-depth walkthrough of the available Woodwork types.

Read the [Working with Woodwork Types and Tags](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html) guide to learn more about manipulating these types on a Woodwork DataFrame.

For users that are familiar with the old `Variable` ojbects, the [Transitioning from Variables to Woodwork](link) guide will be useful for converting Variable types to Woodwork types.

## Physical Types 
Physcial types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.

Pandas, Dask, and Koalas DataFrames rely on these dtypes when performing DataFrame operations, so knowing a Woodwork table's physical types is important for any DataFrame operations that might be performed. Each `LogicalType` class has a single physical type associated with it, though multiple logical types can have the same physical type.

## Logical Types

Multiple logical types may have the same physical types, because a logical type adds additional information about how data should be interpreted or parsed beyond what can be contained in a physical type.

A column's logical type informs how data is read into an EntitySet and how it gets used down the line in Deep Feature Synthesis.

Woodwork provides many different logical types, which can be seen with the `list_logical_types` function.

In [None]:
import featuretools as ft
ft.list_logical_types()

Featuretools will perform type inference to assign logical types to the data in EntitySets, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type). 

To learn more about how logical types are used in EntitySets, see the [Creating EntitySets](using_entitysets.ipynb) guide.

To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on [working with Logical Types](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Logical-Types). 

To learn more about the individual Logical Types, see the [Understanding Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html#Logical-Types) guide in Woodwork.

## Semantic Tags
Semantic tags provide additional information to columns beyond what Logical Types can provide alone.

### Woodwork Standard Tags

Woodwork will add certain semantic tags to columns at initialization. These can be standard tags that may be associated with different sets of logical types, index tags, or even tags that have some additional meaning in Woodwork.

For a full explanation of Woodwork's standard tags, see the Woodwork guide on [Understanding Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html##Semantic-Tags).

To get a list of the standard, index, and time index tags, you can use the `list_semantic_tags` function.

In [None]:
ft.list_semantic_tags()

Above we see the standard tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, but in order to generate the full set of features, Featuretools has defined a few tags of its own.

### Semantic Tags in Featuretools

Just like Woodwork specifies semantic tags internally, Featuretools also has certain semantic tags that have a specific meaning when they are present on a column.

- `'last_time_index'` - added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.
- `'foreign_key'` - used to indicate that a column has a parent column that's a primary key of another dataframe in the EntitySet. Is added by Featuretools when a Relationship is created between two DataFrames.
- `'date_of_birth'` - indicates that a column can be used for Primitive operations that require a date of birth. Must be added at EntitySet creation to be present.


## Woodwork Throughout Featuretools

Now that we've described the elements that make up Woodwork's type system, lets see them in action in Featuretools.

### Woodwork in EntitySets
For more information on building EntitySets using Woodwork, see the [EntitySet guide](using_entitysets.ipynb).

Let's look at the Woodwork typing information as it's stored in a demo EntitySet of retail data:

In [None]:
es = ft.demo.load_retail()

es

The EntitySet representation above contains no Woodwork typing information, as it is a representation of a relational dataset, and Woodwork currently supports single tables. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the `ww` namespace:

In [None]:
df = es['products']
df.head()

In [None]:
df.ww

Notice how the three columns provided to show this DataFrame's typing information are the three types of typing information outlined at the beginning of this guide. To reiterate: By defining physcial types, logical types, and semantic tags for each column in a DataFrame, we've defined a DataFrame's Woodwork schema, and with it, we can gain an understanding of the contents of each column.

This column-specific typing information that exists for every column in every table in an EntitySet is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.

### Woodwork in DFS
As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For an in-depth explanation of Primitives in Featuretools, see the [Feature Primitives](primitives.ipynb) guide. Here, we'll look at how the Woodwork types come together into a `ColumnSchema` object to describe Primitive input and return types.

Below is a Woodwork `ColumnSchema` that we've obtained from the `'product_id'` column in the `products` DataFrame in the retail EntitySet.

In [None]:
products_df = es['products']
product_ids_series = products_df.ww['product_id']
column_schema = product_ids_series.ww.schema
column_schema

This combination of logical type and semantic tag typing information is a `ColumnSchema`. In the case above, the `ColumnSchema` describes the **type definition** for a single column of data.

However, it's important to note that there is no physical type in a `ColumnSchema`. This is because a `ColumnSchema` is a collection of Woodwork types that doesn't have any data tied to it and therefore has no physical representation. 

Because a `ColumnSchema` object is not tied to any data, it can also be used to describe a **type space** into which other columns may or may not fall.

This flexibility of the `ColumnSchema` class means that each column in an EntitySet ultimately can be described by a type definition `ColumnSchema`, and each Primitive defines the input and return types using a `ColumnSchema` type space.

Let's look at a different column in a different DataFrame to see how this works:

In [None]:
order_products_df = es['order_products']
order_products_df.head() 

In [None]:
quantity_series = order_products_df.ww['quantity']
column_schema = quantity_series.ww.schema
column_schema

The `ColumnSchema` above has been pulled from the `'quantity'` column in the `order_products` table in the retail EntitySet. This is a **type definition**. 

If we look at the `order_products` table's Woodwork types below, we can see that there are several columns that will have similar `ColumnSchema` type definitions. If we wanted to describe subsets of those columns, we could define several `ColumnSchema` **type spaces**

In [None]:
es['order_products'].ww

Below are several `ColumnSchema`s that all would include our `quantity` column, but each of them describes a different type space. These `ColumnSchema`s get more restrictive as we go down:

##### Entire Table
No restrictions have been placed; any column falls into this definition. This would include the whole DataFrame.

In [None]:
from woodwork.column_schema import ColumnSchema

ColumnSchema()

##### By Standard Tag
Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well. It will not include the `index` column which, despite containing integers, has had its standard tags replaced by the `'index'` tag.

In [None]:
ColumnSchema(semantic_tags=['numeric'])

In [None]:
df = es['order_products'].ww.select(include='numeric')
df.ww

##### By Logical Type
Only columns with logical type of `Double` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply.

In [None]:
from woodwork.logical_types import Double

ColumnSchema(logical_type=Double)

In [None]:
df = es['order_products'].ww.select(include='Integer')
df.ww

##### By Logical Type and Semantic Tag
The column must have logical type `Double` and have the `numeric` semantic tag, excluding index columns.

In [None]:
ColumnSchema(logical_type=Double, semantic_tags=['numeric'])

In [None]:
df = es['order_products'].ww.select(include='numeric')
df = df.ww.select(include='Integer')
df.ww

In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how Featuretools determines which columns in a DataFrame are valid for a Primitive in building Features during DFS.

Each Primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. Every table in an EntitySet has Woodwork initialized on it. This means that when an EntitySet is passed into DFS, Featuretools can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing definition is in a way that lets DFS stack features on top of one another.

In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the `ColumnSchema`, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis.