# Gain Statistical Insights into Your DataTable

Woodwork provides two methods on DataTable to allow users to better understand their data: `describe` and `get_mutual_information`. The added typing information in a DataTable allows us to more specifically choose which operations to apply to each column according to what's possible for each type.

First, we'll build a DataTable with a variety of types so that we can see the full capabilityies of `describe` and `get_mutual_information`. We'll set an index as well as define a few Logical Types.

In [None]:
import pandas as pd
from woodwork import DataTable

df = pd.DataFrame({
        'id': range(3),
        'country': ['AUS', 'GB', 'NZ'],
        'email': ['john.smith@example.com', None, 'team@featuretools.com'],
        'delta': (pd.Series([pd.to_datetime('2020-09-01')] * 3) - pd.to_datetime('2020-07-01')),
        'age': [33, 25, 31],
        'signup_date': [pd.to_datetime('2020-09-01')] * 3,
        'is_registered': [True, False, True],
    })

dt = DataTable(df, index='id')
dt = dt.set_logical_types({'email':'EmailAddress', 'country': 'CountryCode'})

In order to understand which Logical Types `describe` and `get_mutual_information` will include in their calculations, we can split the available Logical Types into five categories:

- Categorical
    - `Categorical`, `CountryCode`, `Ordinal`, `SubRegionCode`, `ZIPCode`
- Numeric
    - `Double`, `Integer`, `WholeNumber`
- String
    - `EmailAddress`, `FilePath`, `FullName`, `IPAddress`, `LatLong`, `NaturalLanguage`, `PhoneNumber`, `URL`
- Boolean - just the `Boolean` LogicalType
- Datetime - just the `Datetime` LogicalType
- Timedelta - just the `Timedelta` LogicalType


## DataTable.describe

Using `dt.describe()` will calculate statistics for the columns in your DataTable and return a Pandas DataFrame with the relevant calculations done for each column.

The statistics calculated can be broken down into a few types:

- General - can be applied to all columns
    - `nan_count` and `mode`
- Aggregate
    - `count` - Categorical, Numeric, Datetime
    - `nunique` - Categorical, Numeric, Datetime
    - `mean` - Numeric, Datetime
    - `std` - Numeric
    - `min` - Numeric, Datetime
    - `max` - Numeric, Datetime
- Boolean - only relevant for columns of Booleans
    - `num_false` and `num_true`
- Quartile - calculated on Numeric columns
    - `first_quartile`
    - `second_quartile`
    - `third_quartile`

Now, let's call `describe` and see how the above statistics get calculated.

In [None]:
dt.describe()

There are a couple things to note in the above dataframe:

- The DataTable's `index` is not included
- We provide typing information for each column according to Woodwork's typing system
- Any statistic that cannot be calculated for a column will be filled with `NaN`.
- The `email` column contains a null value, which does not get counted in any of the calculations other than `nunique`. 

## DataTable.get_mutual_information()

`dt.get_mutual_information` will calculate the mutual information between all pairs of columns whose Logical Types fall into one of the Numeric, Categorical, or Boolean categories described above. The mutual information between columns `A` and `B` can be understood as the amount of knowlege we can have about column `A` if we have the values of column `B`. The more mutual information there is between `A` and `B`, the less uncertainty there is in `A` knowing `B` or vice versa. 

The full list of Logical Types for which mutual information can be calculated is Boolean, Categorical, CountryCode, Double, Integer, Ordinal, SubRegionCode, WholeNumber, and ZIPCode.

If we call `dt.get_mutual_information()`, we'll see that `delta`, `signup_date`, and `email` will be excluded from the resulting dataframe.

In [None]:
dt.get_mutual_information()

We see that the mutual information between a column and itself is 1, which makes sense since if we know everything about a column given itself.

While the above dataframe nicely shows how `get_mutual_information` includes only some types of columns, it doesn't actually provide much in terms of understanding mutual information because our dummy DataTable only has 3 rows of data.

If we use our demo retail DataTable, we can see the behavior of mutual information more clearly. We'll filter out the columns calculated with themselves so that we can more clearly see the relationshipts between other columns.

In [None]:
from woodwork.demo import load_retail

dt = load_retail(nrows=100)
mi_1 = dt.get_mutual_information()
mi_1[mi_1['column_1'] != mi_1['column_2']].sort_values('mutual_info', ascending=False).head(8)

Let's see what happens if we increase the number of rows that we pull for the DataTable:

In [None]:
dt = load_retail(nrows=500)
mi_2 = dt.get_mutual_information()
mi_2[mi_2['column_1'] != mi_2['column_2']].sort_values('mutual_info', ascending=False).head(8)

What's interesting to note here is that increasing the number of available rows with which to calculate mutual information increased the value of the top mutual information pair, `customer_name` and `order_id`. We see some fluctuation of the other values, but if we look at the average mutual information for the two runs, we see that the mutual information went down, on average, when we had more data. This likely points to a truly strong connection between `customer_name` and `order_id`. This makes sense, because the more data we have, the more accurate the exhibited relationships between volumns should be.

In [None]:
'mutual information with 100 rows', mi_1['mutual_info'].mean(), 'mutual information with 500 rows', mi_2['mutual_info'].mean()

If we do just want 500 rows from our data, just taking the first 500 rows, which is what `nrows=500` when we load the data might not be the best for sampling data randomly. If we'd like to randomly sample 500 rows from our data, we can set the `nrows` parameter in `get_mutual_information`. Note that the randomness involved in sampling means that different runs can result in different 

In [None]:
dt = load_retail()
mi_sample = dt.get_mutual_information(nrows=500)
mi_sample[mi_sample['column_1'] != mi_sample['column_2']].sort_values('mutual_info', ascending=False).head(8)

In [None]:
dt.describe()

In [None]:

mi_sample = dt.get_mutual_information(nrows=500)
mi_sample[mi_sample['column_1'] != mi_sample['column_2']].sort_values('mutual_info', ascending=False).head(8)

In [None]:

mi_sample = dt.get_mutual_information()
mi_sample[mi_sample['column_1'] != mi_sample['column_2']].sort_values('mutual_info', ascending=False).head(8)