Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename Mutual Info #390

Merged
merged 8 commits into from Nov 16, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/api_reference.rst
Expand Up @@ -22,7 +22,7 @@ DataTable
DataTable.set_time_index
DataTable.to_dataframe
DataTable.describe
DataTable.get_mutual_information
DataTable.mutual_information
DataTable.value_counts
DataTable.to_csv
DataTable.to_pickle
Expand Down
16 changes: 8 additions & 8 deletions docs/source/guides/statistical_insights.ipynb
Expand Up @@ -8,7 +8,7 @@
"\n",
"Woodwork provides methods on DataTable to allow users to utilize the typing information inherent in a DataTable to better understand their data.\n",
"\n",
"Let's walk through how to use `describe` and `get_mutual_information` on a retail DataTable so that we can see the full capabilities of the functions."
"Let's walk through how to use `describe` and `mutual_information` on a retail DataTable so that we can see the full capabilities of the functions."
]
},
{
Expand Down Expand Up @@ -79,13 +79,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataTable.get_mutual_information()\n",
"## DataTable.mutual_information()\n",
"\n",
"`dt.get_mutual_information` will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.\n",
"`dt.mutual_information` will calculate the mutual information between all pairs of relevant Data Columns. Certain types such as datetimes or strings cannot have mutual information calculated.\n",
"\n",
"The mutual information between columns `A` and `B` can be understood as the amount of knowlege we can have about column `A` if we have the values of column `B`. The more mutual information there is between `A` and `B`, the less uncertainty there is in `A` knowing `B` or vice versa. \n",
"\n",
"If we call `dt.get_mutual_information()`, we'll see that `order_date` will be excluded from the resulting dataframe."
"If we call `dt.mutual_information()`, we'll see that `order_date` will be excluded from the resulting dataframe."
]
},
{
Expand All @@ -94,15 +94,15 @@
"metadata": {},
"outputs": [],
"source": [
"dt.get_mutual_information()"
"dt.mutual_information()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Available Parameters\n",
"`dt.get_mutual_information` provides two parameters for tuning the mutual information calculation.\n",
"`dt.mutual_information` provides two parameters for tuning the mutual information calculation.\n",
"\n",
"- `num_bins` - In order to calculate mutual information on continuous data, we bin numeric data into categories. This parameter allows users to choose the number of bins with which to categorize data.\n",
" - Defaults to using 10 bins\n",
Expand All @@ -125,7 +125,7 @@
"metadata": {},
"outputs": [],
"source": [
"mi = dt.get_mutual_information()\n",
"mi = dt.mutual_information()\n",
"mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]"
]
},
Expand All @@ -135,7 +135,7 @@
"metadata": {},
"outputs": [],
"source": [
"mi = dt.get_mutual_information(num_bins = 50)\n",
"mi = dt.mutual_information(num_bins = 50)\n",
"mi[mi['column_1'].isin(['unit_price', 'quantity']) | mi['column_2'].isin(['unit_price', 'quantity'])]"
]
}
Expand Down
8 changes: 4 additions & 4 deletions docs/source/guides/using_woodwork_with_dask_and_koalas.ipynb
Expand Up @@ -101,7 +101,7 @@
"metadata": {},
"source": [
"### Analyzing Underlying Data\n",
"There are three DataTable methods that also require bringing the underlying Dask DataFrame into memory: `describe`, `value_counts` and `get_mutual_information`. When called, these methods will call a `compute` operation on the DataFrame associated with the DataTable in order to calculate the desired information. This may be problematic for datasets that cannot fit in memory, so exercise caution when using these methods."
"There are three DataTable methods that also require bringing the underlying Dask DataFrame into memory: `describe`, `value_counts` and `mutual_information`. When called, these methods will call a `compute` operation on the DataFrame associated with the DataTable in order to calculate the desired information. This may be problematic for datasets that cannot fit in memory, so exercise caution when using these methods."
]
},
{
Expand All @@ -128,7 +128,7 @@
"metadata": {},
"outputs": [],
"source": [
"dt.get_mutual_information().head()"
"dt.mutual_information().head()"
]
},
{
Expand Down Expand Up @@ -233,7 +233,7 @@
"metadata": {},
"source": [
"### Analyzing Underlying Data\n",
"As with Dask, running `describe`, `value_counts` or `get_mutual_information` requires bringing the data into memory to perform the analysis. When called, these methods will call a `to_pandas` operation on the DataFrame associated with the DataTable in order to calculate the desired information. This may be problematic for very large datasets, so exercise caution when using these methods."
"As with Dask, running `describe`, `value_counts` or `mutual_information` requires bringing the data into memory to perform the analysis. When called, these methods will call a `to_pandas` operation on the DataFrame associated with the DataTable in order to calculate the desired information. This may be problematic for very large datasets, so exercise caution when using these methods."
]
},
{
Expand All @@ -260,7 +260,7 @@
"metadata": {},
"outputs": [],
"source": [
"dt.get_mutual_information().head()"
"dt.mutual_information().head()"
]
},
{
Expand Down
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Expand Up @@ -9,6 +9,7 @@ Release Notes
* Fixes
* Rename ``data_column.py`` ``datacolumn.py`` (:pr:`386`)
* Rename ``data_table.py`` ``datatable.py`` (:pr:`387`)
* Rename ``get_mutual_information`` ``mutual_information`` (:pr:`390`)
gsheni marked this conversation as resolved.
Show resolved Hide resolved
* Changes
* Lower moto test requirement for serialization/deserialization (:pr:`376`)
* Make Koalas an optional dependency installable with woodwork[koalas] (:pr:`378`)
Expand Down
2 changes: 1 addition & 1 deletion koalas-requirements.txt
@@ -1,2 +1,2 @@
pyspark>=3.0.0
koalas>=1.1.0
koalas>=1.1.0,<=1.3.0
8 changes: 4 additions & 4 deletions woodwork/datatable.py
Expand Up @@ -764,11 +764,11 @@ def _make_categorical_for_mutual_info(self, data, num_bins):
data[col_name] = new_col.cat.codes
return data

def get_mutual_information(self, num_bins=10, nrows=None):
def mutual_information(self, num_bins=10, nrows=None):
"""
Calculates mutual information between all pairs of columns in the DataTable
that support mutual information. Logical Types that support mutual information are
as follows: Boolean, Categorical, CountryCode, Double, Integer, Ordinal, SubRegionCode, and ZIPCode
Calculates mutual information between all pairs of columns in the DataTable that
support mutual information. Logical Types that support mutual information are as
follows: Boolean, Categorical, CountryCode, Double, Integer, Ordinal, SubRegionCode, and ZIPCode

Args:
num_bins (int): Determines number of bins to use for converting
Expand Down
18 changes: 9 additions & 9 deletions woodwork/tests/datatable/test_datatable.py
Expand Up @@ -2205,14 +2205,14 @@ def test_datatable_make_categorical_for_mutual_info():
assert formatted_num_bins_df['categories'].equals(pd.Series([0, 1, 1, 0], dtype='int8'))


def test_datatable_get_mutual_information(df_same_mi, df_mi):
def test_datatable_mutual_information(df_same_mi, df_mi):
# Only test if df_same_mi and df_mi are same type
if type(df_same_mi) != type(df_mi):
return

dt_same_mi = DataTable(df_same_mi, logical_types={'date': Datetime(datetime_format='%Y-%m-%d')})

mi = dt_same_mi.get_mutual_information()
mi = dt_same_mi.mutual_information()

cols_used = set(np.unique(mi[['column_1', 'column_2']].values))
assert 'nans' not in cols_used
Expand All @@ -2223,20 +2223,20 @@ def test_datatable_get_mutual_information(df_same_mi, df_mi):

dt = DataTable(df_mi)
original_df = dt.to_dataframe().copy()
mi = dt.get_mutual_information()
mi = dt.mutual_information()
assert mi.shape[0] == 6
np.testing.assert_almost_equal(mi_between_cols('ints', 'bools', mi), 0.734, 3)
np.testing.assert_almost_equal(mi_between_cols('ints', 'strs', mi), 0.0, 3)
np.testing.assert_almost_equal(mi_between_cols('strs', 'bools', mi), 0, 3)

mi_many_rows = dt.get_mutual_information(nrows=100000)
mi_many_rows = dt.mutual_information(nrows=100000)
pd.testing.assert_frame_equal(mi, mi_many_rows)

mi = dt.get_mutual_information(nrows=1)
mi = dt.mutual_information(nrows=1)
assert mi.shape[0] == 6
assert (mi['mutual_info'] == 1.0).all()

mi = dt.get_mutual_information(num_bins=2)
mi = dt.mutual_information(num_bins=2)
assert mi.shape[0] == 6
np.testing.assert_almost_equal(mi_between_cols('bools', 'ints', mi), .274, 3)
np.testing.assert_almost_equal(mi_between_cols('strs', 'ints', mi), 0, 3)
Expand All @@ -2248,19 +2248,19 @@ def test_datatable_get_mutual_information(df_same_mi, df_mi):

def test_mutual_info_does_not_include_index(sample_df):
dt = DataTable(sample_df, index='id')
mi = dt.get_mutual_information()
mi = dt.mutual_information()
assert 'id' not in mi['column_1'].values


def test_mutual_info_returns_empty_df_properly(sample_df):
dt = DataTable(sample_df.copy()[['id', 'age']], index='id')
mi = dt.get_mutual_information()
mi = dt.mutual_information()
assert mi.empty


def test_mutual_info_sort(df_mi):
dt = DataTable(df_mi)
mi = dt.get_mutual_information()
mi = dt.mutual_information()

for i in range(len(mi['mutual_info']) - 1):
assert mi['mutual_info'].iloc[i] >= mi['mutual_info'].iloc[i + 1]
Expand Down