# Gain Statistical Insights into Your DataTable

Woodwork provides methods on DataTable to allow users to utilize the typing information inherent in a DataTable to better understand their data.

Let's walk through how to use `describe` and `get_mutual_information` on a retail DataTable so that we can see the full capabilities of the functions.

In [1]:
import pandas as pd
from woodwork import DataTable
from woodwork.demo import load_retail

dt = load_retail()
dt.types

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Data Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
order_product_id,category,Categorical,{index}
order_id,category,Categorical,{category}
product_id,category,Categorical,{category}
description,string,NaturalLanguage,{}
quantity,Int64,WholeNumber,{numeric}
order_date,datetime64[ns],Datetime,{time_index}
unit_price,float64,Double,{numeric}
customer_name,category,Categorical,{category}
country,category,Categorical,{category}
total,float64,Double,{numeric}


## DataTable.describe

We use `dt.describe()` to calculate statistics for the Data Columns in a DataTable in the format of a Pandas DataFrame with the relevant calculations done for each Data Column. 

In [2]:
dt.describe()

Unnamed: 0,order_id,product_id,description,quantity,order_date,unit_price,customer_name,country,total,cancelled
physical_type,category,category,string,Int64,datetime64[ns],float64,category,category,float64,boolean
logical_type,Categorical,Categorical,NaturalLanguage,WholeNumber,Datetime,Double,Categorical,Categorical,Double,Boolean
semantic_tags,{category},{category},{},{numeric},{time_index},{numeric},{category},{category},{numeric},{}
count,401604,401604,401604,401604,401604,401604,401604,401604,401604,401604
nunique,22190,3684,,436,20460,620,4372,37,3952,
nan_count,0,0,0,0,0,0,0,0,0,0
mean,,,,12.1833,2011-07-10 12:08:23.848567552,5.73221,,,34.0125,
mode,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,1,2011-11-14 15:27:00,2.0625,Mary Dalton,United Kingdom,24.75,False
std,,,,250.283,,115.111,,,710.081,
min,,,,-80995,2010-12-01 08:26:00,0,,,-277975,


There are a couple things to note in the above dataframe:

- The DataTable's index, `order_product_id`, is not included
- We provide each Data Column's typing information according to Woodwork's typing system
- Any statistic that cannot be calculated for a Data Column, say `num_false` on a `Datetime`, will be filled with `NaN`.
- Null values would not get counted in any of the calculations other than `nunique`

## DataTable.value_counts

We use `dt.value_counts()` to calculate the most frequently appearing values for each Categorical Data Column in a DataTable. This returns a dictionary where each Categorical Data Column is associated with a sorted list of dictionaries that each contain `value` and `count`, the number of appearances of the specified value in the column.

In [3]:
dt.value_counts()

{'order_product_id': [{'value': 401603, 'count': 1},
  {'value': 133868, 'count': 1},
  {'value': 133867, 'count': 1},
  {'value': 133866, 'count': 1},
  {'value': 133865, 'count': 1},
  {'value': 133864, 'count': 1},
  {'value': 133863, 'count': 1},
  {'value': 133862, 'count': 1},
  {'value': 133861, 'count': 1},
  {'value': 133859, 'count': 1}],
 'order_id': [{'value': '576339', 'count': 542},
  {'value': '579196', 'count': 533},
  {'value': '580727', 'count': 529},
  {'value': '578270', 'count': 442},
  {'value': '573576', 'count': 435},
  {'value': '567656', 'count': 421},
  {'value': '567183', 'count': 392},
  {'value': '575607', 'count': 377},
  {'value': '571441', 'count': 364},
  {'value': '570488', 'count': 353}],
 'product_id': [{'value': '85123A', 'count': 2065},
  {'value': '22423', 'count': 1894},
  {'value': '85099B', 'count': 1659},
  {'value': '47566', 'count': 1409},
  {'value': '84879', 'count': 1405},
  {'value': '20725', 'count': 1346},
  {'value': '22720', 'count'

## DataTable.get_mutual_information()

`dt.get_mutual_information` will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.

The mutual information between columns `A` and `B` can be understood as the amount of knowlege we can have about column `A` if we have the values of column `B`. The more mutual information there is between `A` and `B`, the less uncertainty there is in `A` knowing `B` or vice versa. 

If we call `dt.get_mutual_information()`, we'll see that `order_date` will be excluded from the resulting dataframe.

In [None]:
dt.get_mutual_information()

#### Available Parameters
`dt.get_mutual_information` provides two parameters for tuning the mutual information calculation.

- `num_bins` - In order to calculate mutual information on continuous data, we bin numeric data into categories. This parameter allows users to choose the number of bins with which to categorize data.
    - Defaults to using 10 bins
    - The more bins there are, the more variety a column will have. The number of bins used should accurately portray the spread of the data.
- `nrows` - If `nrows` is set at a value below the number of rows in the DataTable, that number of rows will be randomly sampled from the underlying data
    - Defaults to using all the available rows.
    - Decreasing the number of rows can speed up the mutual information calculation on a DataTable with many rows, though care should be taken that the number being sampled is large enough to accurately portray the data.

Now we'll explore changing the number of bins. Note that this will only impact numeric Data Columns `quantity` and `unit_price`. We're going to increase the number of bins from 10 to 50, only showing the impacted columns.

In [None]:
mi = dt.get_mutual_information()
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]

In [None]:
mi = dt.get_mutual_information(num_bins = 50)
mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]