In [2]:
import woodwork as ww
import featuretools as ft
import pandas as pd

# Woodwork Typing in Featuretools

Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system, Variables; now and moving forward, Featuretools will use an external package for its typing: Woodwork. Featuretools is also used by Alteryx's open source autoML tool, EvalML, allowing a smooth transition from feature generation to model-building with a common type system.

[Woodwork](https://woodwork.alteryx.com/en/stable/index.html) is a library that helps with the data typing of 2-dimensional tabular data structures by creating a namespace on your DataFrame that contains physical, logical, and semantic data types that can be used within the Alteryx Open Source ecosystem as well as in general machine learning applications.

A more in-depth explanation of what is available as a part of Woodwork's type system can be found below, but to start,  here are several quick definitions of the types that are central to Woodwork:

- Physical Type: defines how the data is stored on disk or in memory.
- Logical Type: defines how the data should be parsed or interpreted.
- Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

A full-length explanation of Woodwork's Types and Tags can be found [here](link). 

## Physical Types 
Physcial types define how the data is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.

Pandas, Dask, and Koalas DataFrames rely on these dtypes when performing DataFrame operations, so knowing a Woodwork table's physical types is important for any DataFrame operations that might be performed. Each `LogicalType` has a single physical type associated with it, though multiple logical types can have the same physical type.

## Logical Types

Multiple LogicalTypes may have the same physical types, because a logical type adds additional information about how data should be interpreted or parsed. 

For example, email addresses and phone numbers would typically both be stored in a data column with a physical type of `string`. However, when reading, validating, and using these two types of information, different rules apply. For email addresses, the presence of the `@` symbol is important. For phone numbers, you might want to confirm that only a certain number of digits are present or use the first three digits to determine an area-code, and special characters might be restricted to `+`, `-`, `(` or `)`. In this particular example Woodwork defines two different logical types to separate these parsing needs: `EmailAddress` and `PhoneNumber`.

Woodwork uses many different logical types, which can be seen with the `list_logical_types` function.

In [45]:
ft.list_logical_types()

Unnamed: 0,name,type_string,description,physical_type,standard_tags,is_default_type,is_registered,parent_type
0,Address,address,Represents Logical Types that contain address ...,string,{},True,True,NaturalLanguage
1,Age,age,Represents Logical Types that contain non-nega...,int64,{numeric},True,True,Integer
2,AgeNullable,age_nullable,Represents Logical Types that contain non-nega...,Int64,{numeric},True,True,IntegerNullable
3,Boolean,boolean,Represents Logical Types that contain binary v...,bool,{},True,True,BooleanNullable
4,BooleanNullable,boolean_nullable,Represents Logical Types that contain binary v...,boolean,{},True,True,
5,Categorical,categorical,Represents Logical Types that contain unordere...,category,{category},True,True,
6,CountryCode,country_code,Represents Logical Types that contain categori...,category,{category},True,True,Categorical
7,Datetime,datetime,Represents Logical Types that contain date and...,datetime64[ns],{},True,True,
8,Double,double,Represents Logical Types that contain positive...,float64,{numeric},True,True,
9,EmailAddress,email_address,Represents Logical Types that contain email ad...,string,{},True,True,NaturalLanguage


## Semantic Tags
The `standard_tags` column in the output above 


### Woodwork in EntitySets

An EntitySet is a collection of DataFrames and the relationships between them. Previously, EntitySets were built of `Entity` objects that stored the typing information, but with Woodwork, the physical, logical, and semantic typing information gets stored within the DataFrames that make up an EntitySet. For more information on representing data with EntitySets, see the [EntitySet guide](link). Let's look at the typing information as it's stored in an EntitySet:

In [14]:
es = ft.demo.load_flight()
es

Downloading data ...


Entityset: Flight Data
  DataFrames:
    trip_logs [Rows: 860457, Columns: 21]
    flights [Rows: 50557, Columns: 9]
    airlines [Rows: 12, Columns: 1]
    airports [Rows: 299, Columns: 3]
  Relationships:
    trip_logs.flight_id -> flights.flight_id
    flights.carrier -> airlines.carrier
    flights.dest -> airports.dest

The EntitySet representation above contains no Woodwork typing information, as it is a representation of a relational dataset, and Woodwork currently supports single tables. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork accessor:

In [31]:
df = es['flights']
df.head()

Unnamed: 0,flight_id,origin,origin_city,origin_state,dest,distance_group,carrier,flight_num,first_trip_logs_time
AA-494:RSW->CLT,AA-494:RSW->CLT,RSW,"Fort Myers, FL",FL,CLT,3,AA,494,2016-09-03
AA-495:ATL->PHX,AA-495:ATL->PHX,ATL,"Atlanta, GA",GA,PHX,7,AA,495,2016-09-03
AA-495:CLT->ATL,AA-495:CLT->ATL,CLT,"Charlotte, NC",NC,ATL,1,AA,495,2016-09-03
AA-495:TPA->CLT,AA-495:TPA->CLT,TPA,"Tampa, FL",FL,CLT,3,AA,495,2016-09-03
AA-496:PIT->DFW,AA-496:PIT->DFW,PIT,"Pittsburgh, PA",PA,DFW,5,AA,496,2016-09-03


In [32]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
flight_id,string,NaturalLanguage,['index']
origin,category,Categorical,['category']
origin_city,string,NaturalLanguage,[]
origin_state,category,Categorical,['category']
dest,category,Categorical,"['foreign_key', 'category']"
distance_group,category,Ordinal,['category']
carrier,category,Categorical,"['foreign_key', 'category']"
flight_num,category,Categorical,['category']
first_trip_logs_time,datetime64[ns],Datetime,['time_index']


Notice how the three columns provided on the Woodwork accessor are the three types of typing information outlined at the beginning of this guide. Physcial types, logical types, and semantic tags make up the typing information that Woodwork provides. By combining all three pieces of this information for a column of data, we can gain an understanding of the contents of each column.

The physical type is determined by the Logical Type (notice how every column with `Categorical` logical type also has the `category` physical type), and some semantic tags are determined by the logical type (notice how both `Ordinal` and `Categorical` have the `'category'` semantic tag).

However, semantic tags can provide other semantic information. For example, the `index` and `time_index` tags are added by Woodwork to the columns that are the DataFrames index and time index. These tags allow Woodwork-specific logic to be performed on those columns. For example, in requiring that the index column be fully unique. 

Then, there is a semantic tag that has been added by Featuretools, `foreign_key`. This is a result of the relationships that featuretools defines between tables in the EntitySet. The `foreign_key` tag lets us know that these columns have parent columns that are the primary key, or `index`, of another DataFrame.

In this context, we view the Woodwork typing information on a table-wide level, but just as we can select a single Series from a DataFrame, we can get the Woodwork typing information for a single column.

In [19]:
df.ww['flight_id'].ww

<Series: flight_id (Physical Type = string) (Logical Type = NaturalLanguage) (Semantic Tags = {'index'})>

This is useful because of the way that Featuretools calculates features on a by-column basis. In fact, the column-specific typing information is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.


### Woodwork in DFS
As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For these, we use Woodwork ColumnSchemas. To understand how these get used, we will dive a little deeper into the structure of a DataFrame that has Woodwork typing information:

A Woodwork DataFrame has the `ww` namespace, which is how users access the typing information. However, Woodwork also provides an object that contains just typing information and does not have any data associated with it: `woodwork.TableSchema`.

If we take the DataFrame from above and access its Woodwork TableSchema, we can see that the major difference is that there's no physical type associated with the schema.

In [21]:
table_schema = df.ww.schema
table_schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
flight_id,NaturalLanguage,['index']
origin,Categorical,['category']
origin_city,NaturalLanguage,[]
origin_state,Categorical,['category']
dest,Categorical,"['foreign_key', 'category']"
distance_group,Ordinal,['category']
carrier,Categorical,"['foreign_key', 'category']"
flight_num,Categorical,['category']
first_trip_logs_time,Datetime,['time_index']


In [22]:
type(table_schema)

woodwork.table_schema.TableSchema

The lack of a physical type is due to the fact that a TableSchema has no data, and therefore, no physical representation of the data. We often rely on physical typing information to know the exact pandas or Dask or Koalas operations that are valid, but for a schema of typing information that is not tied to data, those operations are not relevant.

Now, let's look at a single column of typing information, or a `ColumnSchema` that we can aquire in much the same way as we selected a Series from the DataFrame and had it maintain its typing information: 

In [27]:
column_schema = df.ww['flight_id'].ww.schema
column_schema

<ColumnSchema (Logical Type = NaturalLanguage) (Semantic Tags = ['index'])>

The `column_schema` object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was an index column from the flights table in the flights entityset. But since a `ColumnSchema` is just typing information for a column, we can get more specific and more general. Let's look at a different column and generate few examples:


In [30]:
es['trip_logs'].ww.columns['distance']

<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>

The `ColumnSchema` above has been selected from `distance` column in the `trip_logs` table in the flights EntitySet. If we look at the `trip_logs` table, we can see that there are a lot of columns with the same Column Schemas.


In [44]:
es['trip_logs'].ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
trip_log_id,int64,Integer,['index']
flight_id,string,NaturalLanguage,['foreign_key']
date_scheduled,datetime64[ns],Datetime,['time_index']
scheduled_dep_time,datetime64[ns],Datetime,[]
scheduled_arr_time,datetime64[ns],Datetime,[]
dep_time,datetime64[ns],Datetime,[]
arr_time,datetime64[ns],Datetime,[]
dep_delay,float64,Double,['numeric']
taxi_out,float64,Double,['numeric']
taxi_in,float64,Double,['numeric']


In order to understand how Featuretools uses ColumnSchemas to define input and return types for DFS, we have to be able to understand a `ColumnSchema` in the context of a specific column that's part of a DataFrame (like `distance` above) as well as a more general type definition. Below are several `ColumnSchema`s that all would include our `distance` column, but each of them describe a slightly different set of columns. 

These `ColumnSchema`s get more restrictive as we go down:

- **`<ColumnSchema >`** - No restrictions have been placed; any column falls into this definition.
- **`<ColumnSchema (Semantic Tags = ['numeric'])>`** - Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns.
- **`<ColumnSchema (Logical Type = Double) >`** - Only columns with logical type of `Double` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply
- **`<ColumnSchema (Logical Type = Double) (Semantic Tags = ['numeric'])>`** - The column must have logical type `Double` and have the `numeric` semantic tag, excluding index columns.

In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how we determine which columns in a DataFrame are valid for a Primitive in building Features during DFS.

Each primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. So when we pass an EntitySet into DFS, and every table in the EntitySet is a Woodwork-initialized DataFrame, we can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing information is in a way that lets DFS stack features on top of one another. 