# Logical Types and Semantic Tags

In a Woodwork DataFrame, each column has three pieces of typing information: physical types, logical types, and semantic tags. When Woodwork is initialized on a DataFrame, Woodwork will attempt to infer these types if none are supplied at init.

Defining specific logical types or semantic tags for your data is important for building a table that contains the most specific type information as is possible. As Woodwork may transform your data to match the expected format of certain logical types, using the correct logical type is also a promise that the data that is that logical type matches a defined format.

Having the correct types set is also important for downstream usage of the data. The purpose of having a type system is so that data can be confidently used for general machine learning projects in the most optimal way. For example, if Woodwork infers a categorical column as natural language, that will have an adverse impact any model built using the resulting types. 

This guide offers an in-depth walk-through of all of the logical types and semantic tags that Woodwork defines. As a reminder, here are quick definitions of Woodwork's types:

- Physical Type: defines how the data is stored on disk or in memory.
- Logical Type: defines how the data should be parsed or interpreted.
- Semantic Tag(s): provides additional data about the meaning of the data or how it should be used.

For an in-depth guide on how to set and manipulate these types, see the [understanding types and tags guide](https://woodwork.alteryx.com/en/stable/guides/understanding_types_and_tags.html).

For information on how to customize Woodwork's type system, see the [custom types and inference guide](https://woodwork.alteryx.com/en/stable/guides/custom_types_and_type_inference.html).

It's important to remember that Woodwork columns will always have a logical type and that any semantic tags that are added by Woodwork are meant to add additional meaning onto the logical type. We'll start out by looking in-depth at semantic tags so that when we get to logical types, we can better understand how a semantic tag might add additional information onto it.



# Semantic Tags

Here is the full set of Woodwork-defined semantic tags:

In [None]:
import woodwork as ww
ww.list_semantic_tags()

#### Standard Tags
Standard tags are associated with specific logical types. They are useful for indicating predefined categories that logical types might fall into.
- `'numeric'` - Is applied to any numeric logical type
    - **Uses**: Can select for just numeric columns when performing operations that require numeric operations
    - **Related Properties**: `series.ww.is_numeric`
- `'category'` - Is applied to any logical type that is categorical in nature
    - **Uses**: Can select for just categorical columns when performing operations that require categories
    - **Related Properties**: `series.ww.is_categorical`

#### Index Tags
Index tags are added by Woodwork to a DataFrame when an `index` or `time_index` column is identified by the user. These tags have some special properties that are only confirmed to be true in the context of a DataFrame (so any Series with these tags may not have these properties).
- `'index'` - Indicates that a column is the DataFrame's index, or primary key
    - There will only be one index column
    - The contents of an index column will be unique
    - An index column will have any standard semantic tags removed
    - The data in an index column will be reflected in the DataFrame's underlying index
- `'time_index'`
    - There will only be one time index column
    - A time index column will either contain datetime or numeric data

#### Other Tags
This section contains tags that will not be automatically added to a Woodwork DataFrame but that Woodwork acknowledges have a specific meaning even if it won't take any specific ation on. Additional tags beyond the ones Woodwork adds at initialization may be useful for a DataFrame's interpretability, so users are encouraged to add any tags that will allow them to use their data more efficiently. 
- `'date_of_birth'` - Indicates that a datetime column should be parsed as a date of birth



# Logical Types

Below are all of the Logical Types that Woodwork defines. We will now walk through each of these types with some examples and suggested uses.

In [None]:
import woodwork as ww
ww.list_logical_types()

In the DataFrame above, we can see a `parent_type` column. This concept of `LogicalType` classes having parents is not indicitive of a class-inheritence structure. We'll see below that all default `LogicalType`s that can show up in a Woodwork DataFrame are subclassed on the base `LogicalType` class. This parent-child relationship is something that comes from Woodwork's type system, which we describe in detail in the [custom types and type inference guide](https://woodwork.alteryx.com/en/stable/guides/custom_types_and_type_inference.html#Logical-Type-Relationships). 

As a quick summary to explain the `parent_type` column, it is indicitive of how type inference should rank inferred types. It's possible for data to be inferred as more than one Logical Type. For example, a column of all integers with no null values could be inferred as either `Integer` or `IntegerNullable`. `IntegerNullable` being the parent of `Integer` indicates that we know more about the contents of a column if it's able to be an `Integer` column (namely, that there are no null values), and Woodwork should infer a column as the Logical Type for which we know the most information.

#### Base LogicalType Class

All logical types used by Woodwork are subclassed off of the base `LogicalType` class. Here we'll describe a bit about what the base LogicalType class (and therefore all subclasses) can do.
- Define a `dtype` that will get used for a given LogicalType - this is how the physical type for a column gets determined
- Allow for a basic transformation into the expected phs `dtype` - this is how Woodwork LogicalTypes act as a form of data-transformers. Depending on the requirements of a LogicalType, a LogicalType can define an expected format for any data applied to it and provide a method for transforming input data to that format. 
    - Will make a note of any transformation that may happen with a LogicalType beyond the expected DataFrame behavior for an `astype` call.
- Have an empty set as its `LogicalType.standard_tags` attribute. This indicates that, by default, LogicalTypes do not have standard tags associated with them. Thus, if standard tags are not mentioned for a Logical Type, it can be assumed that there are none.

## Default

#### Unknown
When Woodwork inference does not return any LogicalTypes for a column, Woodwork will set the column's logical type as the default LogicalType, `Unknown`. A logical type being inferred as `Unknown` may be a good indicator that a more specific logical type can be chosen and set by the user.
- **physical type**: `string`

In [None]:
import pandas as pd

series = pd.Series(["AUS", "USA", "UKR"])
unknown_series = ww.init_series(series)
unknown_series.ww

In [None]:
countrycode_series = ww.init_series(unknown_series, 'CountryCode')
countrycode_series.ww

## Numeric

#### Age

Represents Logical Types that contain non-negative numbers indicating a person’s age.
- **physical type**: `int64`
- **standard tags**: `{'numeric'}`

#### AgeNullable

Represents Logical Types that contain non-negative numbers indicating a person’s age.
- **physical type**: `Int64`
- **standard tags**: `{'numeric'}`

#### Double

Represents Logical Types that contain positive and negative numbers, some of which include a fractional component.
- **physical type**: `float64`
- **standard tags**: `{'numeric'}`

#### Integer

Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0).
- **physical type**: `int64`
- **standard tags**: `{'numeric'}`

#### IntegerNullable 
Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0). May contain null values. 
- **physical type**: `Int64`
- **standard tags**: `{'numeric'}`



Below we'll find a dataframe with examples of each of the numeric LogicalTypes

In [None]:
numerics_df = pd.DataFrame({
    'ints' : [1, 2, 3, 4],
    'ints_nullable': pd.Series([1, 2, None, 4], dtype='Int64'),
    'floats' : [0.0, 1.1, 2.2, 3.3],
    'ages': [18, 22, 24, 34],
    'ages_nullable' : [None, 2, 22, 33]
})

numerics_df.ww.init(logical_types={'ages':'Age', 'ages_nullable':'AgeNullable'})
numerics_df.ww


## Categorical

#### Categorical 

Represents a Logical Type with few unique values relative to the size of the data.

- **physical type**: `category`
- **inference**: Woodwork defines a threshold for percentage unique values relative to the size of the series below which a series will be considered categorical. See [setting config options guide](https://woodwork.alteryx.com/en/stable/guides/setting_config_options.html#Categorical-Threshold) for more information on how to control this threshold.
- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.


Some examples of data for which the Categorical logical type would apply:

- Gender
- Eye Color
- Nationality
- Hair Color
- Spoken Language

#### CountryCode

Represents Logical Types that contain categorical information specifically used to represent countries.
- **physical type**: `category`
- **standard tags**: `{'category'}`
- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.

For example: `'AUS'` for Australia, `'CHN'` for China, and `'CAN'` for Canada.

#### Ordinal

A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers. 
- **physical type**: `category`
- note - has ordered types as categories 
- **standard tags**: `{'category'}`
- **parameters**:
    - `order` - the order of the ordinal values in the column from low to high
- **validation** - an order must be defined for an Ordinal column on a DataFrame or Series, and all elements of the order must be present.
- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.

Some examples of data for which the Ordinal logical type would apply:

- Educational Background (Elementary, High School, Undergraduate, Graduate)
- Satisfaction Rating (“Not Satisfied”, “Satisfied", “Very Satisfied”)
- Spicy Level (Hot, Hotter, Hottest)
- Student Grade (A, B, C, D, F)
- Size (small, medium, large)

#### PostalCode

Represents Logical Types that contain a series of postal codes for representing a group of addresses.
- **physical type**: `category`
- **standard tags**: `{'category'}`
- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.

#### SubRegionCode

Represents Logical Types that contain codes representing a portion of a larger geographic region.
- **physical type**: `category`
- **standard tags**: `{'category'}`
- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.

For example: `'US-IL'` to represent Illinois in the United States or `'AU-TAS'` to represent Tasmania in Australia.

In [None]:
categoricals_df = pd.DataFrame({
    'categorical': pd.Series(['a', 'b', 'a', 'a'], dtype='category'),
    'ordinal' : ['small', 'large', 'large', 'medium'],
    'country_code' : ["AUS", "USA", "UKR", "AUS"],
    'postal_code': ["90210", "60035", "SW1A", "90210" ],
    'sub_region_code' : ["AU-NSW", "AU-TAS", "AU-QLD", "AU-QLD"]
})
categoricals_df.ww.init(logical_types={'ordinal':ww.logical_types.Ordinal(order=['small', 'medium', 'large']),
                                       'country_code':'CountryCode', 
                                       'postal_code':'PostalCode',
                                       'sub_region_code':'SubRegionCode'})

categoricals_df.ww

## Specific Format

#### Boolean

Represents Logical Types that contain binary values indicating true/false.
- **physical type**: `bool`

#### BooleanNullable
Represents Logical Types that contain binary values indicating true/false.
    May also contain null values.
- **physical type**: `boolean`

#### Datetime
A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers. However, they should be in a intrepretable format or properly cast before using DFS. 
- **physical type**: `datetime64[ns]`
- **transformation**: Will convert valid strings or numbers to pandas datetimes, allowing more datetime formats with the `datetime_format` parameter.
- **parameters**:
    - `datetime_format` - the format of the datetimes in the column 

Some examples of Datetime include:

- transaction time
- flight departure time
- pickup time

#### EmailAddress

Represents Logical Types that contain email address values.
- **physical type**: `string`
- **inference**: Uses an email address regex that, if the data matches, means that the column contains email addresses. To learn more about controling the regex used, see the [setting config options guide](https://woodwork.alteryx.com/en/stable/guides/setting_config_options.html#Email-Inference-Regex).

#### LatLong
A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers. 
- **physical type**: `object`
- **transformation**: Will convert inputs into a tuple of floats. Any null values will be stored as `np.nan`
- **koalas note**: Koalas does not support tuples, so latlongs will be stored as a list of floats

#### Timedelta

Represents Logical Types that contain values specifying a duration of time
- **physical type**: `timedelta64[ns]`


Examples could inclue:
- Days/months/years since some event
- How long a flight's arrival was delayed/early
- Days until birthday

In [None]:
df = pd.DataFrame({
    'dates': ["2019/01/01", "2019/01/02", "2019/01/03", "2019/01/03"],
    'latlongs': ['[33.670914, -117.841501]', '40.423599, -86.921162', (-45.031705, None), None],
    'booleans': [True, True, False, True],
    'bools_nullable': pd.Series([True, False, True, None], dtype='boolean'),
    'timedelta': [pd.Timedelta('1 days 00:00:00'), pd.Timedelta('-1 days +23:40:00'),
             pd.Timedelta('4 days 12:00:00'), pd.Timedelta('-1 days +23:40:00')],
    'emails':["john.smith@example.com", "support@example.com", "team@example.com", "help@example.com"]
})
df

In [None]:
df.ww.init(logical_types={'latlongs':'LatLong',
                          'dates':ww.logical_types.Datetime(datetime_format='%Y/%m/%d')})
df.ww

In [None]:
df

## String-physical type

#### NaturalLanguage

Represents Logical Types that contain text or characters representing natural human language
- **physical type**: `string`

#### Address

Represents Logical Types that contain address values.
- **physical type**: `string`


#### Filepath

Represents Logical Types that specify locations of directories and files in a file system.
- **physical type**: `string`


#### PersonFullName

Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.
- **physical type**: `string`
#### PhoneNumber

Represents Logical Types that contain numeric digits and characters representing a phone number
- **physical type**: `string`


#### URL

Represents Logical Types that contain URLs, which may include protocol, hostname and file name
- **physical type**: `string`

#### IPAddress

Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.
- **physical type**: `string`


In [None]:
strings_df = pd.DataFrame({
    'natural_language':["This is a short sentence.",
                         "I like to eat pizza!",
                         "When will humans go to mars?",
                         "This entry contains two sentences. Second sentence."],
    'addresses':['1 Miller Drive, New York, NY 12345', '1 Berkeley Street, Boston, MA 67891',
                '26387 Russell Hill, Dallas, TX 34521', '54305 Oxford Street, Seattle, WA 95132'],
    'filepaths':["/usr/local/bin",
                 "/Users/john.smith/dev/index.html",
                 "/tmp",
                 "../woodwork"],
    'full_names':["Mr. John Doe, Jr.",
                 "Doe, Mrs. Jane",
                 "James Brown",
                 "John Smith"],
    'phone_numbers':["1-(555)-123-5495",
                     "+1-555-123-5495",
                     "5551235495",
                     "111-222-3333"],
    'urls': ["http://google.com",
             "https://example.com/index.html",
             "example.com",
             "https://woodwork.alteryx.com/"],
    'ip_addresses': ["172.16.254.1",
                     "192.0.0.0",
                     "2001:0db8:0000:0000:0000:ff00:0042:8329",
                     "192.0.0.0"],
})
strings_df.ww.init(logical_types={
    'natural_language':'NaturalLanguage',
    'addresses':'Address',
    'filepaths':'FilePath',
    'full_names':'PersonFullName',
    'phone_numbers':'PhoneNumber',
    'urls':'URL',
    'ip_addresses':'IPAddress',
})
strings_df.ww

## ColumnSchema objects

Now that we've gone in-depth on semantic tags and logical types, we can start to understand how they're used together to build Woodwork tables and define type spaces.

A `ColumnSchema` is the typing information for a single column. We can obtain a `ColumnSchema` from a Woodwork-initialized DataFrame as follows:

In [None]:
# Woodwork typing info for a DataFrame
combined_df = ww.concat_columns([numerics_df, categoricals_df])
combined_df.ww

Above is the typing information for a Woodwork DataFrame. If we want, we can access just the schema of typing information outside of the context of the actual data in the DataFrame.

In [None]:
# A Woodwork TableSchema
combined_df.ww.schema

The representation of the `woodwork.table_schema.TableSchema` is only different in that it does not have a column for the physical types.

This lack of a physical type is due to the fact that a TableSchema has no data, and therefore no physical representation of the data. We often rely on physical typing information to know the exact pandas or Dask or Koalas operations that are valid for a DataFrame, but for a schema of typing information that is not tied to data, those operations are not relevant.

Now, let's look at a single column of typing information, or a `woodwork.column_schema.ColumnSchema` that we can aquire in much the same way as we can select a Series from the DataFrame: 

In [None]:
# Woodwork typing infor for a Series
ints_series = combined_df.ww['ints']
ints_series.ww

In [None]:
# A Woodwork ColumnSchema
ints_schema = ints_series.ww.schema
ints_schema

The `column_schema` object above can be understood as typing information for a single column that is not tied to any data. In this case, we happen to know where the column schema came from - it was the `ints` column from the `combined_df` DataFrame. But we can also create a `ColumnSchema` that exists without being associated with any individual column of data.

If we look at the `combined_df` table as a whole, we can see the similarities and differences between the columns, and we can describe those subsets of the DataFrame with `ColumnSchema` objects, or type spaces.

In [None]:
combined_df.ww.schema

Below are several `ColumnSchema`s that all would include our `ints` column, but each of them describe a different type space. These `ColumnSchema`s get more restrictive as we go down:

- **`<ColumnSchema >`** - No restrictions have been placed; any column falls into this definition.
- **`<ColumnSchema (Semantic Tags = ['numeric'])>`** - Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well.
- **`<ColumnSchema (Logical Type = Integer) >`** - Only columns with logical type of `Integer` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply
- **`<ColumnSchema (Logical Type = Integer) (Semantic Tags = ['numeric'])>`** - The column must have logical type `Integer` and have the `numeric` semantic tag, excluding index columns.

In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall.