# Start

In this guide, we will walk through an example of creating a Woodwork DataTable, and will show how to update and remove logical types and semantic tags. We will also demonstrate how to use the typing information to select subsets of data.

In [1]:
import woodwork as ww

data = ww.demo.load_retail(nrows=100)
data.head(5)

Unnamed: 0,order_product_id,order_id,product_id,description,quantity,order_date,unit_price,customer_name,country,total,cancelled
0,0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.2075,Andrea Brown,United Kingdom,25.245,False
1,1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
2,2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,4.5375,Andrea Brown,United Kingdom,36.3,False
3,3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False
4,4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,5.5935,Andrea Brown,United Kingdom,33.561,False


As we can see, this is a dataframe containing several different data types, including dates, categorical values, numeric values and natural language descriptions. Let's use Woodwork to create a DataTable from this data.

## Creating a DataTable
Creating a Woodwork DataTable is as simple as passing in a dataframe with the data of interest during initialization. An optional name parameter can be specified to label the DataTable.

In [2]:
dt = ww.DataTable(data, name="retail")
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,Int64,WholeNumber,{numeric}
order_date,datetime64[ns],Datetime,{}
unit_price,float64,Double,{numeric}
customer_name,string,NaturalLanguage,{}
country,string,NaturalLanguage,{}
total,float64,Double,{numeric}


Using just this simple call, Woodwork was able to infer the logical types present in our data by analyzing the dataframe dtypes as well as the information contained in the columns. In addition, Woodwork also added semantic tags to some of the columns based on the logical types that were inferred.

## Updating Logical Types
If the initial inference was not to our liking, the logical type can be changed to a more appropriate value. Let's change some of the columns to a different logical type to illustrate this process. Below we will set the logical type for the ``quantity``, ``customer_name`` and ``country`` columns to be ``Categorical``.

In [3]:
dt.set_logical_types({
    'quantity': 'Categorical',
    'customer_name': 'Categorical',
    'country': 'Categorical'
})
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,category,Categorical,{category}
order_date,datetime64[ns],Datetime,{}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,{numeric}


If we now inspect the information in the `types` output, we can see that the Logical type for the three columns has been updated with the `Categorical` logical type we specified.

## Selecting Columns

Now that we have logical types we are happy with, we can select a subset of the columns based on their logical types. Let's select only the columns that have a logical type of ``WholeNumber`` or ``Double``:

In [4]:
numeric_dt = dt.select_ltypes(['WholeNumber', 'Double'])
numeric_dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
unit_price,float64,Double,{numeric}
total,float64,Double,{numeric}


This selection process has returned a new ``DataTable`` containing only the columns that match the logical types we specified. After we have selected the columns we want, we can also access a dataframe containing just those columns if we need it for additional analysis.

In [5]:
numeric_dt.to_pandas()

Unnamed: 0,order_product_id,order_id,unit_price,total
0,0,536365,4.2075,25.245
1,1,536365,5.5935,33.561
2,2,536365,4.5375,36.3
3,3,536365,5.5935,33.561
4,4,536365,5.5935,33.561
5,5,536365,12.6225,25.245
6,6,536365,7.0125,42.075
7,7,536366,3.0525,18.315
8,8,536366,3.0525,18.315
9,9,536367,2.7885,89.232


## Adding Semantic Tags

Next, let's add semantic tags to some of the columns. We will add the tag of ``product_details`` to the ``description`` column and tag the ``total`` column with ``currency``.

In [6]:
dt.set_semantic_tags({'description':'product_details', 'total': 'currency'})
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{product_details}
quantity,category,Categorical,{category}
order_date,datetime64[ns],Datetime,{}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,"{currency, numeric}"


We can also select columns based on a semantic tag. Perhaps we want to only select the columns tagged with ``category``:

In [7]:
category_dt = dt.select_semantic_tags('category')
category_dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
product_id,category,Categorical,{category}
quantity,category,Categorical,{category}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}


We can also select columns using mutiple semantic tags, or even a mixture of semantic tags and logical types:


In [8]:
category_numeric_dt = dt.select_semantic_tags(['numeric', 'category'])
category_numeric_dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
quantity,category,Categorical,{category}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,"{currency, numeric}"


In [9]:
mixed_dt = dt.select(['Boolean', 'product_details'])
mixed_dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
description,string,NaturalLanguage,{product_details}
cancelled,boolean,Boolean,{}


If we wanted to select an individual column, we just need to specify the column name. We can then get access to the data in the DataColumn using the ``to_pandas`` method:

In [10]:
dc = dt['total']
dc

<DataColumn: total (Physical Type = float64) (Logical Type = Double) (Semantic Tags = {'currency', 'numeric'})>

In [11]:
dc.to_pandas()

0      25.2450
1      33.5610
2      36.3000
3      33.5610
4      33.5610
5      25.2450
6      42.0750
7      18.3150
8      18.3150
9      89.2320
10     20.7900
11     20.7900
12     49.5000
13     16.3350
14     42.0750
15     24.5025
16     32.8350
17     29.4525
18     29.4525
19     52.4700
20     52.4700
21     42.0750
22     24.5025
23     24.5025
24     24.5025
25     29.4525
26    148.5000
27    148.5000
28     74.2500
29     16.8300
30     25.7400
31     67.3200
32     49.5000
33     87.6150
34    116.8200
35     77.2200
36     77.2200
37     77.2200
38     28.0500
39     65.3400
40    116.8200
41     74.2500
42     16.6320
43     16.6320
44     38.6100
45     89.1000
46    336.6000
Name: total, dtype: float64

You can also access multiple columns by supplying a list of column names:

In [12]:
multiple_cols_dt = dt[['product_id', 'total', 'unit_price']]
multiple_cols_dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
product_id,category,Categorical,{category}
total,float64,Double,"{currency, numeric}"
unit_price,float64,Double,{numeric}


## Removing Semantic Tags
We can also remove specific semantic tags from a column if they are no longer needed. Let's remove the ``product_details`` tag from the ``description`` column:

In [13]:
dt.remove_semantic_tags({'description':'product_details'})
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,category,Categorical,{category}
order_date,datetime64[ns],Datetime,{}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,"{currency, numeric}"


Notice how the ``product_details`` tag has now been removed from the ``description`` column. If we wanted to remove all user-added semantic tags from all columns, we can also do that:

In [14]:
dt.reset_semantic_tags()
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,{numeric}
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,category,Categorical,{category}
order_date,datetime64[ns],Datetime,{}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,{numeric}


## Set Index and Time Index
At any point, we can designate certain columns as the DataTable's `index` and `time_index` with the metods `set_index` and `set_time_index`. These methods can be used to assign these columns for the first time or to change the column being used as the index or time index.

Index and time index columns contain `index` and `time_index` semantic tags, respectively.

In [16]:
dt.set_index('order_product_id')
dt.index

'order_product_id'

In [17]:
dt.set_time_index('order_date')
dt.time_index

'order_date'

In [19]:
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,Int64,WholeNumber,"{numeric, index}"
order_id,Int64,WholeNumber,{numeric}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,category,Categorical,{category}
order_date,datetime64[ns],Datetime,{time_index}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,{numeric}


## List Logical Types
We can also retrieve all the Logical Types present in Woodwork. These can be useful for understanding the Logical Types, and how they will be interpreted. 

In [15]:
from woodwork.utils import list_logical_types

list_logical_types()

Unnamed: 0,name,type_string,description,pandas_dtype,standard_tags
0,Boolean,boolean,Represents Logical Types that contain binary v...,boolean,{}
1,Categorical,categorical,Represents Logical Types that contain unordere...,category,{category}
2,CountryCode,country_code,Represents Logical Types that contain categori...,category,{category}
3,Datetime,datetime,Represents Logical Types that contain date and...,datetime64[ns],{}
4,Double,double,Represents Logical Types that contain positive...,float64,{numeric}
5,Integer,integer,Represents Logical Types that contain positive...,Int64,{numeric}
6,EmailAddress,email_address,Represents Logical Types that contain email ad...,string,{}
7,Filepath,filepath,Represents Logical Types that specify location...,string,{}
8,FullName,full_name,Represents Logical Types that may contain firs...,string,{}
9,IPAddress,ip_address,Represents Logical Types that contain IP addre...,string,{}
