In [1]:
import woodwork as ww

df = ww.demo.load_retail(nrows=100, return_dataframe=True)
df.head(5)



Unnamed: 0,order_product_id,order_id,product_id,description,quantity,order_date,unit_price,customer_name,country,total,cancelled
0,0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.2075,Andrea Brown,United Kingdom,25.245,False
1,1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
2,2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,4.5375,Andrea Brown,United Kingdom,36.3,False
3,3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
4,4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False


As you can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values, and natural language descriptions. Next, initialize Woodwork on this DataFrame.

## Initializing Woodwork on a DataFrame
Importing Woodwork creates a special namespace on your DataFrames, `DataFrame.ww`, that can be used to set or update the typing information for the DataFrame. As long as Woodwork has been imported, initializing Woodwork on a DataFrame is as simple as calling `.ww.init()` on the DataFrame of interest. An optional name parameter can be specified to label the data.

In [2]:
df.ww.init(name="retail")
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,Int64,Integer,['numeric']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,NaturalLanguage,[]
country,string,NaturalLanguage,[]
total,float64,Double,['numeric']


Using just this simple call, Woodwork was able to infer the logical types present in the data by analyzing the DataFrame dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.

All Woodwork methods and properties can be accessed through the `ww` namespace on the DataFrame. DataFrame methods called from the Woodwork namespace will be passed to the DataFrame, and whenever possible, Woodwork will be initialized on the returned object, assuming it is a Series or a DataFrame.

As an example, use the `head` method to create a new DataFrame containing the first 5 rows of the original data, with Woodwork typing information retained.

In [3]:
head_df = df.ww.head(5)
head_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,Int64,Integer,['numeric']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,NaturalLanguage,[]
country,string,NaturalLanguage,[]
total,float64,Double,['numeric']


In [4]:
head_df

Unnamed: 0,order_product_id,order_id,product_id,description,quantity,order_date,unit_price,customer_name,country,total,cancelled
0,0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.2075,Andrea Brown,United Kingdom,25.245,False
1,1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
2,2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,4.5375,Andrea Brown,United Kingdom,36.3,False
3,3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
4,4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False


## Updating Logical Types
If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let's change some of the columns to a different logical type to illustrate this process. In this case, set the logical type for the `order_product_id` and `country` columns to be `Categorical` and set `customer_name` to have a logical type of `FullName`.

In [5]:
df.ww.set_types(logical_types={
    'customer_name': 'FullName',
    'country': 'Categorical',
    'order_id': 'Categorical'
})
df.ww.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,FullName,[]
country,category,Categorical,['category']
total,float64,Double,['numeric']


Inspect the information in the `types` output. There, you can see that the Logical type for the three columns has been updated with the logical types you specified.

## Selecting Columns

Now that you've prepared logical types, you can select a subset of the columns based on their logical types. Select only the columns that have a logical type of `Integer` or `Double`.

In [6]:
numeric_df = df.ww.select(['Integer', 'Double'])
numeric_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
quantity,Int64,Integer,['numeric']
unit_price,float64,Double,['numeric']
total,float64,Double,['numeric']


This selection process has returned a new Woodwork DataFrame containing only the columns that match the logical types you specified. After you have selected the columns you want, you can use the DataFrame containing just those columns as you normally would for any additional analysis.

In [7]:
numeric_df

Unnamed: 0,order_product_id,quantity,unit_price,total
0,0,6,4.2075,25.245
1,1,6,5.5935,33.561
2,2,8,4.5375,36.300
3,3,6,5.5935,33.561
4,4,6,5.5935,33.561
...,...,...,...,...
95,95,6,4.2075,25.245
96,96,120,0.6930,83.160
97,97,24,0.9075,21.780
98,98,24,0.9075,21.780


## Adding Semantic Tags

Next, let’s add semantic tags to some of the columns. Add the tag of `product_details` to the `description` column, and tag the `total` column with `currency`.

In [8]:
df.ww.set_types(semantic_tags={'description':'product_details', 'total': 'currency'})
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,['product_details']
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,FullName,[]
country,category,Categorical,['category']
total,float64,Double,"['numeric', 'currency']"


Select columns based on a semantic tag. Only select the columns tagged with `category`.

In [9]:
category_df = df.ww.select('category')
category_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
country,category,Categorical,['category']


Select columns using multiple semantic tags or a mixture of semantic tags and logical types.

In [10]:
category_numeric_df = df.ww.select(['numeric', 'category'])
category_numeric_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
quantity,Int64,Integer,['numeric']
unit_price,float64,Double,['numeric']
country,category,Categorical,['category']
total,float64,Double,"['numeric', 'currency']"


In [11]:
mixed_df = df.ww.select(['Boolean', 'product_details'])
mixed_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
description,string,NaturalLanguage,['product_details']
cancelled,boolean,Boolean,[]


To select an individual column, specify the column name. Woodwork will be initialized on the returned Series and you can use the Series for additional analysis as needed.

In [12]:
total = df.ww['total']
total.ww

<Series: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'numeric', 'currency'})>

In [13]:
total

0     25.245
1     33.561
2     36.300
3     33.561
4     33.561
       ...  
95    25.245
96    83.160
97    21.780
98    21.780
99    21.780
Name: total, Length: 100, dtype: float64

Select multiple columns by supplying a list of column names.

In [14]:
multiple_cols_df = df.ww[['product_id', 'total', 'unit_price']]
multiple_cols_df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
product_id,category,Categorical,['category']
total,float64,Double,"['numeric', 'currency']"
unit_price,float64,Double,['numeric']


## Removing Semantic Tags
Remove specific semantic tags from a column if they are no longer needed. In this example, remove the `product_details` tag from the `description` column.

In [15]:
df.ww.remove_semantic_tags({'description':'product_details'})
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,FullName,[]
country,category,Categorical,['category']
total,float64,Double,"['numeric', 'currency']"


Notice how the ``product_details`` tag has been removed from the ``description`` column. If you want to remove all user-added semantic tags from all columns, you can do that, too.

In [16]:
df.ww.reset_semantic_tags()
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['numeric']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,[]
unit_price,float64,Double,['numeric']
customer_name,string,FullName,[]
country,category,Categorical,['category']
total,float64,Double,['numeric']


## Set Index and Time Index
At any point, you can designate certain columns as the Woodwork `index` or `time_index` with the methods [set_index](generated/woodwork.table_accessor.WoodworkTableAccessor.set_index.rst) and [set_time_index](generated/woodwork.schema.Schema.set_time_index.rst). These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.

Index and time index columns contain `index` and `time_index` semantic tags, respectively.

In [17]:
df.ww.set_index('order_product_id')
df.ww.index

'order_product_id'

In [18]:
df.ww.set_time_index('order_date')
df.ww.time_index

'order_date'

In [19]:
df.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,Integer,['index']
order_id,category,Categorical,['category']
product_id,category,Categorical,['category']
description,string,NaturalLanguage,[]
quantity,Int64,Integer,['numeric']
order_date,datetime64[ns],Datetime,['time_index']
unit_price,float64,Double,['numeric']
customer_name,string,FullName,[]
country,category,Categorical,['category']
total,float64,Double,['numeric']


In [20]:
import pandas as pd

series = pd.Series([1, 2, 3], dtype='Int64')
series.ww.init(logical_type='Integer')
series.ww

<Series: None (Physical Type = Int64) (Logical Type = Integer) (Semantic Tags = {'numeric'})>

In the example above, we specified the `Integer` LogicalType for the Series. Because `Integer` has a physical type of `Int64` and this matches the dtype used to create the Series, no Series dtype conversion was needed and the initialization succeeds.

In cases where the LogicalType requires the Series dtype to change, a helper function `ww.init_series` must be used. This function will return a new Series object with Woodwork initialized and the dtype of the series changed to match the physical type of the LogicalType.

To demonstrate this case, first create a Series, with a `string` dtype. Then, initialize a Woodwork Series with a `Categorical` logical type using the `init_series` function. Because `Categorical` uses a physical type of `category`, the dtype of the Series must be changed, and that is why we must use the `init_series` function here.

The series that is returned will have Woodwork initialized with the LogicalType set to `Categorical` as expected, with the expected dtype of `category`.

In [21]:
string_series = pd.Series(['a', 'b', 'a'], dtype='string')
ww_series = ww.init_series(string_series, logical_type='Categorical')
ww_series.ww

<Series: None (Physical Type = category) (Logical Type = Categorical) (Semantic Tags = {'category'})>

As with DataFrames, Woodwork provides several methods that can be used to update or change the typing information associated with the series. As an example, add a new semantic tag to the series.

In [22]:
series.ww.add_semantic_tags('new_tag')
series.ww

<Series: None (Physical Type = Int64) (Logical Type = Integer) (Semantic Tags = {'numeric', 'new_tag'})>

As you can see from the output above, the specified tag has been added to the semantic tags for the series.

You can also access Series properties methods through the Woodwork namespace. When possible, Woodwork typing information will be retained on the value returned. As an example, you can access the Series `shape` property through Woodwork.

In [23]:
series.ww.shape

(3,)

You can also call Series methods such as `sample`. In this case, Woodwork typing information is retained on the Series returned by the `sample` method.

In [24]:
sample_series = series.ww.sample(2)
sample_series.ww

<Series: None (Physical Type = Int64) (Logical Type = Integer) (Semantic Tags = {'numeric', 'new_tag'})>

In [25]:
sample_series

1    2
2    3
dtype: Int64

## List Logical Types
Retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, as well as how they are interpreted.

In [26]:
from woodwork.type_sys.utils import list_logical_types

list_logical_types()

Unnamed: 0,name,type_string,description,physical_type,standard_tags,is_default_type,is_registered,parent_type
0,Boolean,boolean,Represents Logical Types that contain binary v...,boolean,{},True,True,
1,Categorical,categorical,Represents Logical Types that contain unordere...,category,{category},True,True,
2,CountryCode,country_code,Represents Logical Types that contain categori...,category,{category},True,True,Categorical
3,Datetime,datetime,Represents Logical Types that contain date and...,datetime64[ns],{},True,True,
4,Double,double,Represents Logical Types that contain positive...,float64,{numeric},True,True,
5,EmailAddress,email_address,Represents Logical Types that contain email ad...,string,{},True,True,NaturalLanguage
6,Filepath,filepath,Represents Logical Types that specify location...,string,{},True,True,NaturalLanguage
7,FullName,full_name,Represents Logical Types that may contain firs...,string,{},True,True,NaturalLanguage
8,IPAddress,ip_address,Represents Logical Types that contain IP addre...,string,{},True,True,NaturalLanguage
9,Integer,integer,Represents Logical Types that contain positive...,Int64,{numeric},True,True,
