# Confirm Mutual Info Calcs
The goal of this notebook is to compare computing Mutual Information  manually versus using [sklearn's package](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif). This confirms the math behind it and ensures we know what we're doing.

In all cases the response variable is discrete (e.g. is_churn=T/F). But the independent variable may be discrete or continuous; each uses different approach.

 We start here with the case of discrete response variable in its simplest form: binary, then continuous, then dig into Discrete (non-binary) data and how sklearn treats these as Continuous. Finally, we confirm that after one-hot encoding categorical variables we can find mutual information for each class using binary approach.

## Mutual Information: Formula
<img src ='./mutual_info_formula.png'> 

# Findings
### Previewed here, repeated at the end of notebook
A Manual Calculation of Mutual Information based on a well-known formula was created. 
 - When applied to Discrete (Binary) values, and compared with SKLearn's `mutual_info_classif`, the formula produces the same output, and seem reliable. 
 - When applied to Continuous values, the formula does not produce reliable results, and SKLearn's `mutual_info_classif` implementation should be used with specified parameter `discrete_features=[False]`
 - When applied to Discrete (but not Binary) values (such as Ordinal columns), SkLearn's `mutual_info_classif` implementation returns null if specified parameter `discrete_features=[True]`.
 - When Discrete (but not Binary) values are one-hot encoded, either Manual or Sklearn's approaches are equal (as already shown). However the downside is we've measured mutual information PER VALUE of categorical column, rather than the column's overall importance.

Clarified (and still clarifying) sklearn's use of the word 'discrete' here.

In [0]:
### Note this notebook was coded with and requires a pre 2.0.0 pandas that uses .append()
import numpy as np
import pandas as pd
print(pd.__version__)

1.3.4


# Data Prep

## Test Data: Titanic
cross checking with example from trainindata. https://www.blog.trainindata.com/mutual-information-with-python/
This is only necessary when use_test_data=False

In [0]:
def generate_test_data():
    import pandas as pd
    from sklearn.metrics import mutual_info_score
    from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
    # from feature_engine.encoding import RareLabelEncoder

    variables = [
        'pclass', 'survived', 'sex', 'age', 'sibsp',
        'parch', 'fare', 'cabin', 'embarked',
        ]
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl',
                    usecols=variables,
                    na_values='?',
                    dtype={'fare': float, 'age': float},
                    )
    
    print(f"before cleanup there are {len(data)} rows")
    data.dropna(subset=['embarked', 'fare'], inplace=True)
    data['age'] = data['age'].fillna(data['age'].mean())
    def get_first_cabin(row):
        try:
            return row.split()[0]
        except:
            return 'N'
    data['cabin'] = data['cabin'].apply(get_first_cabin).str[0]

    ### because feature_engine wouldn't load in databricks, this section is commented out. can do another way if need these columns
    """encoder = RareLabelEncoder(variables='cabin', n_categories=2)
    data = encoder.fit_transform(data)

    #convert categorical variables to numbers
    encoder = OrdinalEncoder(
                    encoding_method='arbitrary',
                    variables=["sex", "embarked"],
                    )
    data = encoder.fit_transform(data) """
    data['sex'] = data['sex'].replace({'female': 0, 'male': 1})
    data.head()
    return data

# Mutual Information For Binary Data

In [0]:
### This will be overwritten in later Continuous section. Choose here for Binary section:
### To make coding easer, standardize either input as "df", and choose columns.
### Make sure to set the response variable as col2 - important later when we loop through all columns


use_test_data=True
print(f"Use Test Data? {use_test_data}")

if use_test_data:
    data = generate_test_data()
    df = data.copy()
    col1='sex'      # Binary Values {'female': 0, 'male': 1}
    col2='survived'
else:
    print("No prod data configured here")

print(f"Sending in {len(data.columns)} columns which are: {data.columns}, and there are now {len(data)} rows")


Use Test Data? True
before cleanup there are 1309 rows
Sending in 9 columns which are: Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin',
       'embarked'],
      dtype='object'), and there are now 1306 rows


In [0]:
#print(data['age'].to_list())

## Mutual Information: Contingency Table

In [0]:
def label_contingency_table(contingency_table):
    """ Create labeled version to confirm visually. Output is not used elsewhere """
    labeled_contingency_table = contingency_table.round(4)
    labeled_contingency_table.columns = [f" {col1} ({col})" for col in labeled_contingency_table.columns]
    labeled_contingency_table.index = [f"({index}) {col2}" for index in labeled_contingency_table.index]
    print(labeled_contingency_table)


In [0]:
def create_contingency_table(df, col1, col2='is_churn'):
    """ given a full dataframe, calc the Mutual Information between 2 Discrete columns """

    ### Create contingency table using raw values 
    values_contingency_table = pd.crosstab(df[col1], df[col2], margins=True, margins_name='Total').transpose()
    display(label_contingency_table(values_contingency_table))

    ### Show probabilities by dividing each value by total possible
    contingency_table = pd.crosstab(df[col1], df[col2], margins=True, margins_name='Total', normalize='all').transpose()

    # Exclude 'Total' row and column (could also remove "normalize" but that is helpful to view)
    contingency_table = contingency_table.iloc[:-1, :-1]

    return contingency_table


In [0]:
contingency_table = create_contingency_table(df, col1=col1, col2=col2)
display(contingency_table)

                   sex (0)   sex (1)   sex (Total)
(0) survived           127       681           808
(1) survived           337       161           498
(Total) survived       464       842          1306


0,1
0.0972434915773353,0.5214395099540582
0.2580398162327718,0.1232771822358346


## Mutual Information: Formula
<img src ='/files/dhislop/mutual_info_formula.png'> 

In [0]:
def get_mutual_info_from_contingency_tablex(contingency_table):
    """ Given a Contingency Table, Compute mutual information """
    mutual_info = 0

    print(f"contingency_table rows: {contingency_table.index.name}, columns: {contingency_table.columns.name}")

    for i, row in enumerate(contingency_table.index):
        for j, col in enumerate(contingency_table.columns):
            p_xy = contingency_table.at[row, col]
            p_x = contingency_table.sum(axis=1).at[row]
            p_y = contingency_table.sum(axis=0).at[col]
            # print(f"for row {row}, col {col}, the p_xy={p_xy}, p_x={p_x}, p_y={p_y}")

            if p_xy > 0:
                ### HERE IS THE FORMULA ###
                mutual_info += p_xy * np.log(p_xy / (p_x * p_y))

    print(f"Manual Mutual Information: {mutual_info:.3f}")
    return round(mutual_info,3)

In [0]:
def get_mutual_info_from_contingency_table(contingency_table, col1='feature_name'):
    """ Given a Contingency Table, Compute mutual information """
    mutual_info = 0
    mutual_info_this_combination = 0
    
    summary_df = pd.DataFrame(columns=['Feature', 'p_x', 'p_xy', 'p_y', 'Calc_MI'])

    print(f"contingency_table rows: {contingency_table.index.name}, columns: {contingency_table.columns.name}")

    for i, row in enumerate(contingency_table.index):
        for j, col in enumerate(contingency_table.columns):
            p_xy = contingency_table.at[row, col]
            p_x = contingency_table.sum(axis=1).at[row]
            p_y = contingency_table.sum(axis=0).at[col]
            #print(f"for row {row}, col {col}, the p_xy={p_xy}, p_x={p_x}, p_y={p_y}")

            if p_xy > 0:
                mutual_info_this_combination = p_xy * np.log(p_xy / (p_x * p_y))
                mutual_info += mutual_info_this_combination

            summary_df = summary_df.append({
                #'Feature': col1,
                'p_x': round(p_x, 2),
                'p_xy': round(p_xy, 2),
                'p_y': round(p_y,2),
                'Calc_MI': round(mutual_info_this_combination,3)
                }, ignore_index=True).sort_values(by='Calc_MI', ascending=False)
    
    summary_df['Feature']=col1
    print(f"Manual Mutual Information: {mutual_info:.3f}")
    return round(mutual_info,3), summary_df

In [0]:
manual_mutual_info, summary_df = get_mutual_info_from_contingency_table(contingency_table=contingency_table, col1=col1)

contingency_table rows: survived, columns: sex
Manual Mutual Information: 0.142


### Show Details of Probability in Mutual Info Calc

In [0]:
display(summary_df)
print(f"Mutual Information for this feature: {summary_df.Calc_MI.sum().round(3)}")

Feature,p_x,p_xy,p_y,Calc_MI
sex,0.38,0.26,0.36,0.166
sex,0.62,0.52,0.64,0.14
sex,0.62,0.1,0.36,-0.079
sex,0.38,0.12,0.64,-0.085


Mutual Information for this feature: 0.142


## Mutual Information: SKLearn Methods

In [0]:
from sklearn.metrics import mutual_info_score
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
def get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=True):
    if col1_discrete:
        """ For binary columns, can use _score or _classif, same result """
        # mi = round(mutual_info_score(df[col1], df[col2]), 3)
        mi = mutual_info_classif(df[col1].to_frame(), df[col2], discrete_features=[True]).round(3)[0]

    else:
        """ For continuous columns: specify False for discrete_features """
        mi = mutual_info_classif(df[col1].to_frame(), df[col2], discrete_features=[False]).round(3)[0]

    return mi

In [0]:
sklearn_mutual_info = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=True)
print(f"sklearn's mutual info calculation: {sklearn_mutual_info}")

sklearn's mutual info calculation: 0.142


In [0]:
print(f"For columns {col1} and {col2}, Mutual Information is:")
print(f"Manual Calc {manual_mutual_info}")
print(f"Sklearn Calc {sklearn_mutual_info}")

For columns sex and survived, Mutual Information is:
Manual Calc 0.142
Sklearn Calc 0.142


## Iterate all Binary Columns

In [0]:
### Examine only columns with 2 values
columns_to_drop = [col for col in df.columns if len(df[col].unique()) != 2]
df_binary_iterate = df.drop(columns=columns_to_drop)
cols_binary_iterate = list(df_binary_iterate.columns)

### Remove the response variable - don't need to run it against itself
if col2 in cols_binary_iterate:
    cols_binary_iterate.remove(col2)

print("drop these ", columns_to_drop, "\nkeep these ",cols_binary_iterate)

drop these  ['pclass', 'age', 'sibsp', 'parch', 'fare', 'cabin', 'embarked'] 
keep these  ['sex']


In [0]:
### Iterate thru all Binary Columns getting Mutual Info with col2 (response variable - set in Mutual Info section)

mutual_info_summary_binary = pd.DataFrame(columns=['Response', 'Feature', 'manual_mutual_info', 'sklearn_mutual_info'])

for col1 in cols_binary_iterate:
    print(f"working on column {col1}")
    manual_mutual_info_value, summary_df = get_mutual_info_from_contingency_table(create_contingency_table(df_binary_iterate, col1=col1, col2=col2))

    sklearn_mutual_info_value = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=True)
    mutual_info_summary_binary = mutual_info_summary_binary.append({
        'Feature': col1,
        'Response': col2,
        'manual_mutual_info': manual_mutual_info_value,
        'sklearn_mutual_info': sklearn_mutual_info_value,
        }, ignore_index=True).sort_values(by='sklearn_mutual_info', ascending=False)

display(mutual_info_summary_binary)



working on column sex
                   sex (0)   sex (1)   sex (Total)
(0) survived           127       681           808
(1) survived           337       161           498
(Total) survived       464       842          1306
contingency_table rows: survived, columns: sex
Manual Mutual Information: 0.142


Response,Feature,manual_mutual_info,sklearn_mutual_info
survived,sex,0.142,0.142


## Result: For Binary variables: Manual or SKLearn give same result

# Mutual Information For Continuous Data

In [0]:
### This will overwrite earlier selections. Choose here for Continuous section:
### To make coding easer, standardize either input as "df", and choose columns.
### Make sure to set the response variable as col2 - important later when we loop through all columns
print(f"Use Test Data? {use_test_data}")
if use_test_data:
    data = generate_test_data()
    df = data.copy()
    col1='fare'     # Continuous Values
    col2='survived'
else:
    df = df_binary.copy()
    col1='is_churn_logo_lost' # 'is_churn_downgrade'
    col2='is_churn'


Use Test Data? True
before cleanup there are 1309 rows


In [0]:
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,0,29.0,0,0,211.3375,B,S
1,1,1,1,0.9167,1,2,151.55,C,S
2,1,0,0,2.0,1,2,151.55,C,S
3,1,0,1,30.0,1,2,151.55,C,S
4,1,0,0,25.0,1,2,151.55,C,S


## Mutual Information: SKLearn

In [0]:
sklearn_mutual_info = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=False)
print(f"sklearn's mutual info calculation for {col2}, {col1}: {sklearn_mutual_info}")

sklearn's mutual info calculation for survived, fare: 0.137


## Mutual Information: Formula
This is not expected to work for Continuous Values. (A contingency table splits out every possible value) Confirming:

In [0]:
contingency_table = create_contingency_table(df, col1=col1, col2=col2)
manual_mutual_info, summary_df = get_mutual_info_from_contingency_table(contingency_table=contingency_table, col1=col1)
print(col1)

                   fare (0.0)   fare (3.1708)   fare (4.0125)   fare (5.0)  \
(0) survived               15               0               1            1   
(1) survived                2               1               0            0   
(Total) survived           17               1               1            1   

                   fare (6.2375)   fare (6.4375)   fare (6.45)  \
(0) survived                   1               3             1   
(1) survived                   0               0             0   
(Total) survived               1               3             1   

                   fare (6.4958)   fare (6.75)   fare (6.8583)  ...  \
(0) survived                   3             2               1  ...   
(1) survived                   0             0               0  ...   
(Total) survived               3             2               1  ...   

                   fare (164.8667)   fare (211.3375)   fare (211.5)  \
(0) survived                     1                 0              

## Iterate all Continuous Columns

In [0]:
# check on unique/possible values in a column

unique_counts = df.nunique()
unique_counts_df = pd.DataFrame({'Column': unique_counts.index, 'Unique_Counts': unique_counts.values})
display(unique_counts_df)

Column,Unique_Counts
pclass,3
survived,2
sex,2
age,98
sibsp,7
parch,8
fare,280
cabin,9
embarked,3


In [0]:
### Examine only columns with more than 2 values
columns_to_drop_continuous = [col for col in df.columns if len(df[col].unique()) <= 3]

### Examine only numerical columns
non_numerical_columns = df.select_dtypes(exclude=['float64']).columns.to_list()

drop_these = non_numerical_columns + columns_to_drop_continuous
drop_these = list(set(drop_these))  # dedup

### Keep Response variable in dataframe, even though it only has 2 values
if col2 in drop_these:
    drop_these.remove(col2)

df_continuous_iterate = df.drop(columns=drop_these)
cols_continuous_iterate = list(df_continuous_iterate.columns)

print("drop these", drop_these, "\nkeep these", cols_continuous_iterate)

### Remove the response variable from iterated column list - don't need to run it against itself
if col2 in cols_continuous_iterate:
    cols_continuous_iterate.remove(col2)

drop these ['sibsp', 'pclass', 'sex', 'parch', 'cabin', 'embarked'] 
keep these ['survived', 'age', 'fare']


In [0]:
### Iterate thru all Continuous Columns getting Mutual Info with col2 (response variable - set in Mutual Info section)

mutual_info_summary_continuous = pd.DataFrame(columns=['Response', 'Feature', 'manual_mutual_info', 'sklearn_mutual_info'])

for col1 in cols_continuous_iterate:
    print(f"working on column {col1}")
    manual_mutual_info_value, summary_df = get_mutual_info_from_contingency_table(create_contingency_table(df_continuous_iterate, col1=col1, col2=col2))

    sklearn_mutual_info_value = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=False)
    mutual_info_summary_continuous = mutual_info_summary_continuous.append({
        'Feature': col1,
        'Response': col2,
        'manual_mutual_info': manual_mutual_info_value,
        'sklearn_mutual_info': sklearn_mutual_info_value,
        }, ignore_index=True).sort_values(by='sklearn_mutual_info', ascending=False)

display(mutual_info_summary_continuous)


working on column age
                   age (0.1667)   age (0.3333)   age (0.4167)   age (0.6667)  \
(0) survived                  0              1              0              0   
(1) survived                  1              0              1              1   
(Total) survived              1              1              1              1   

                   age (0.75)   age (0.8333)   age (0.9167)   age (1.0)  \
(0) survived                1              0              0           3   
(1) survived                2              3              2           7   
(Total) survived            3              3              2          10   

                   age (2.0)   age (3.0)  ...   age (65.0)   age (66.0)  \
(0) survived               8           2  ...            3            1   
(1) survived               4           5  ...            0            0   
(Total) survived          12           7  ...            3            1   

                   age (67.0)   age (70.0)   age (70.5)

Response,Feature,manual_mutual_info,sklearn_mutual_info
survived,fare,0.269,0.129
survived,age,0.064,0.014


## Result: Use SKLearn for Continuous variables


# Mutual Information For Discrete Data (non-binary)

Setting up a test to confirm we can't just "throw in" Ordinal or Categorical data into a Mutual Information formula. SKlearn's mutual_info_classif package's setting of `discrete_values=True` seems to really mean `binary_values`. 

Here we input two ordinal columns `sibsp` and `parch` in the test data, each having values 0-8. We also tried a categorical columns `cabin` and `embarked`, each having categorical string data like `B` or `C` , but sklearn didnt not convert string to float

In [0]:
### This will overwrite earlier selections. Choose here for Discrete (non-binary) section:
### To make coding easer, standardize either input as "df", and choose columns.
### Make sure to set the response variable as col2 - important later when we loop through all columns
print(f"Use Test Data? {use_test_data}")
if use_test_data:
    data = generate_test_data()
    df = data.copy()
    col1='parch'     # Discrete Values
    col2='survived'
else:
    df = df_binary.copy()
    col1='is_churn_logo_lost' # 'is_churn_downgrade'
    col2='is_churn'


Use Test Data? True
before cleanup there are 1309 rows


In [0]:
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,0,29.0,0,0,211.3375,B,S
1,1,1,1,0.9167,1,2,151.55,C,S
2,1,0,0,2.0,1,2,151.55,C,S
3,1,0,1,30.0,1,2,151.55,C,S
4,1,0,0,25.0,1,2,151.55,C,S


## Mutual Information: SKLearn

In [0]:
sklearn_mutual_info = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=False)
print(f"sklearn's mutual info calculation for {col2}, {col1}: {sklearn_mutual_info}")

sklearn's mutual info calculation for survived, parch: 0


## Mutual Information: Formula

In [0]:
contingency_table = create_contingency_table(df, col1=col1, col2=col2)
manual_mutual_info, summary_df = get_mutual_info_from_contingency_table(contingency_table=contingency_table, col1=col1)
print(col1)

                   parch (0)   parch (1)   parch (2)   parch (3)   parch (4)  \
(0) survived             665          70          56           3           5   
(1) survived             334         100          57           5           1   
(Total) survived         999         170         113           8           6   

                   parch (5)   parch (6)   parch (9)   parch (Total)  
(0) survived               5           2           2             808  
(1) survived               1           0           0             498  
(Total) survived           6           2           2            1306  
contingency_table rows: survived, columns: parch
Manual Mutual Information: 0.021
parch


## Iterate all Discrete Columns

In [0]:
# check on unique/possible values in a column
unique_counts = df.nunique()
unique_counts_df = pd.DataFrame({'Column': unique_counts.index, 'Unique_Counts': unique_counts.values})
display(unique_counts_df)

Column,Unique_Counts
pclass,3
survived,2
sex,2
age,98
sibsp,7
parch,8
fare,280
cabin,9
embarked,3


In [0]:
df.dtypes

Out[40]: pclass        int64
survived      int64
sex           int64
age         float64
sibsp         int64
parch         int64
fare        float64
cabin        object
embarked     object
dtype: object

In [0]:
### Examine only int64 columns (ignore strings and floats)
non_numerical_columns = df.select_dtypes(exclude=['int64']).columns.to_list()

### Examine only columns with more than 2 values
columns_to_drop_discrete = [col for col in df.columns if len(df[col].unique()) <= 3]

drop_these = non_numerical_columns + columns_to_drop_discrete
drop_these = list(set(drop_these))  # dedup

### Keep Response variable in dataframe, even though it only has 2 values
if col2 in drop_these:
    drop_these.remove(col2)
print("drop these", drop_these)

df_discrete_iterate = df.drop(columns=drop_these)
cols_discrete_iterate = list(df_discrete_iterate.columns)

### Remove the response variable from iterated column list - don't need to run it against itself
if col2 in cols_discrete_iterate:
    cols_discrete_iterate.remove(col2)
print(cols_discrete_iterate)

drop these ['fare', 'age', 'pclass', 'sex', 'cabin', 'embarked']
['sibsp', 'parch']


In [0]:
### Iterate thru all discrete Columns getting Mutual Info with col2 (response variable - set in Mutual Info section)

mutual_info_summary_discrete = pd.DataFrame(columns=['Response', 'Feature', 'manual_mutual_info', 'sklearn_mutual_info'])

for col1 in cols_discrete_iterate:
    print(f"working on column {col1}")
    manual_mutual_info_value, summary_df = get_mutual_info_from_contingency_table(create_contingency_table(df_discrete_iterate, col1=col1, col2=col2))

    sklearn_mutual_info_value = get_sklearn_mutual_info(df, col1=col1, col2=col2, col1_discrete=True)
    mutual_info_summary_discrete = mutual_info_summary_discrete.append({
        'Feature': col1,
        'Response': col2,
        'manual_mutual_info': manual_mutual_info_value,
        'sklearn_mutual_info': sklearn_mutual_info_value,
        }, ignore_index=True).sort_values(by='sklearn_mutual_info', ascending=False)

display(mutual_info_summary_discrete)


working on column sibsp
                   sibsp (0)   sibsp (1)   sibsp (2)   sibsp (3)   sibsp (4)  \
(0) survived             581         156          23          14          19   
(1) survived             307         163          19           6           3   
(Total) survived         888         319          42          20          22   

                   sibsp (5)   sibsp (8)   sibsp (Total)  
(0) survived               6           9             808  
(1) survived               0           0             498  
(Total) survived           6           9            1306  
contingency_table rows: survived, columns: sibsp
Manual Mutual Information: 0.019
working on column parch
                   parch (0)   parch (1)   parch (2)   parch (3)   parch (4)  \
(0) survived             665          70          56           3           5   
(1) survived             334         100          57           5           1   
(Total) survived         999         170         113           8         

Response,Feature,manual_mutual_info,sklearn_mutual_info
survived,parch,0.021,0.021
survived,sibsp,0.019,0.019


## Result: Discrete (non-binary) Variables are treated as Continuous 
When applied to Discrete (but not Binary) values, SkLearn's mutual_info_classif implementation returns null if specified parameter discrete_features=[True].


# Mutual Information after One Hot Encoding

To alleviate above concern we will convert our ordinal or categorical columns to One Hot Encoding

In [0]:
### This will overwrite earlier selections. Choose here for Discrete (non-binary) section:
### To make coding easer, standardize either input as "df", and choose columns.
### Make sure to set the response variable as col2 - important later when we loop through all columns
print(f"Use Test Data? {use_test_data}")
if use_test_data:
    data = generate_test_data()
    df = data.copy()
    col1='cabin'     # Categorical Values
    col2='survived'
else:
    df = df_binary.copy()
    col1='is_churn_logo_lost' # 'is_churn_downgrade'
    col2='is_churn'

Use Test Data? True
before cleanup there are 1309 rows


In [0]:
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,0,29.0,0,0,211.3375,B,S
1,1,1,1,0.9167,1,2,151.55,C,S
2,1,0,0,2.0,1,2,151.55,C,S
3,1,0,1,30.0,1,2,151.55,C,S
4,1,0,0,25.0,1,2,151.55,C,S


In [0]:
### Examine only columns with more than 2 values
columns_to_drop_object = [col for col in df.columns if len(df[col].unique()) <= 2]
print(columns_to_drop_object)

### Examine only object columns
object_columns = df.select_dtypes(exclude=['object']).columns.to_list()

### Examine only numerical columns
# non_numerical_columns = df.select_dtypes(exclude=['float64']).columns.to_list()


drop_these = object_columns + columns_to_drop_object
drop_these = list(set(drop_these))  # dedup

### Keep Response variable in dataframe, even though it only has 2 values
if col2 in drop_these:
    drop_these.remove(col2)

df_object_iterate = df.drop(columns=drop_these)
cols_object_iterate = list(df_object_iterate.columns)

### Remove the response variable from iterated column list - don't need to run it against itself
if col2 in cols_object_iterate:
    cols_object_iterate.remove(col2)

print("drop these", drop_these, "\nkeep these", cols_object_iterate)

['survived', 'sex']
drop these ['fare', 'sibsp', 'pclass', 'age', 'sex', 'parch'] 
keep these ['cabin', 'embarked']


In [0]:
df_encoded = pd.get_dummies(df, columns=cols_object_iterate)

In [0]:
display(df_encoded.head())
cols_object_iterate = [col for col in df_encoded.columns if '_' in col]
print(cols_object_iterate)

  Unable to convert the field cabin_A. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Unsupported type in conversion from Arrow: uint8
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)


pclass,survived,sex,age,sibsp,parch,fare,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_N,cabin_T,embarked_C,embarked_Q,embarked_S
1,1,0,29.0,0,0,211.3375,0,1,0,0,0,0,0,0,0,0,0,1
1,1,1,0.9167,1,2,151.55,0,0,1,0,0,0,0,0,0,0,0,1
1,0,0,2.0,1,2,151.55,0,0,1,0,0,0,0,0,0,0,0,1
1,0,1,30.0,1,2,151.55,0,0,1,0,0,0,0,0,0,0,0,1
1,0,0,25.0,1,2,151.55,0,0,1,0,0,0,0,0,0,0,0,1


['cabin_A', 'cabin_B', 'cabin_C', 'cabin_D', 'cabin_E', 'cabin_F', 'cabin_G', 'cabin_N', 'cabin_T', 'embarked_C', 'embarked_Q', 'embarked_S']


In [0]:
print(df_encoded[cols_object_iterate].head(3))

   cabin_A  cabin_B  cabin_C  cabin_D  cabin_E  cabin_F  cabin_G  cabin_N  \
0        0        1        0        0        0        0        0        0   
1        0        0        1        0        0        0        0        0   
2        0        0        1        0        0        0        0        0   

   cabin_T  embarked_C  embarked_Q  embarked_S  
0        0           0           0           1  
1        0           0           0           1  
2        0           0           0           1  


In [0]:
### Iterate thru all object Columns getting Mutual Info with col2 (response variable - set in Mutual Info section)

mutual_info_summary_object = pd.DataFrame(columns=['Response', 'Feature', 'manual_mutual_info', 'sklearn_mutual_info'])

for col1 in cols_object_iterate:
    print(f"working on column {col1}")
    manual_mutual_info_value, summary_df = get_mutual_info_from_contingency_table(create_contingency_table(df_encoded, col1=col1, col2=col2))

    sklearn_mutual_info_value = get_sklearn_mutual_info(df_encoded, col1=col1, col2=col2, col1_discrete=True)
    mutual_info_summary_object = mutual_info_summary_object.append({
        'Feature': col1,
        'Response': col2,
        'manual_mutual_info': manual_mutual_info_value,
        'sklearn_mutual_info': sklearn_mutual_info_value,
        }, ignore_index=True).sort_values(by='sklearn_mutual_info', ascending=False)

display(mutual_info_summary_object)

working on column cabin_A
                   cabin_A (0)   cabin_A (1)   cabin_A (Total)
(0) survived               797            11               808
(1) survived               487            11               498
(Total) survived          1284            22              1306
contingency_table rows: survived, columns: cabin_A
Manual Mutual Information: 0.000
working on column cabin_B
                   cabin_B (0)   cabin_B (1)   cabin_B (Total)
(0) survived               790            18               808
(1) survived               453            45               498
(Total) survived          1243            63              1306
contingency_table rows: survived, columns: cabin_B
Manual Mutual Information: 0.012
working on column cabin_C
                   cabin_C (0)   cabin_C (1)   cabin_C (Total)
(0) survived               771            37               808
(1) survived               441            57               498
(Total) survived          1212            94              130

Response,Feature,manual_mutual_info,sklearn_mutual_info
survived,cabin_N,0.044,0.044
survived,embarked_C,0.016,0.016
survived,cabin_B,0.012,0.012
survived,embarked_S,0.011,0.011
survived,cabin_C,0.008,0.008
survived,cabin_E,0.008,0.008
survived,cabin_D,0.007,0.007
survived,cabin_F,0.002,0.002
survived,cabin_A,0.0,0.0
survived,cabin_G,0.0,0.0


## Result: Discrete (non-binary) Variables
One-hot-encoding turns discrete/categorical variables into binary; which we've already shown to be equally evaluated by Manual vs. Sklearn, and re-proven that here.  However note the downside is having only a mutual information PER VALUE of categorical column, rather than the column's overall importance.  


# Findings
A Manual Calculation of Mutual Information based on a well-known formula was created. 
 - When applied to Discrete (Binary) values, and compared with SKLearn's `mutual_info_classif`, the formula produces the same output. 
 - When applied to Continuous values, the formula does not produce reliable results, and SKLearn's `mutual_info_classif` implementation should be used with specified parameter `discrete_features=[False]`
 - When applied to Discrete (but not Binary) values, SkLearn's `mutual_info_classif` implementation returns null if specified parameter `discrete_features=[True]`.
 - When Discrete (but not Binary) values are one-hot encoded, either Manual or Sklearn's approaches are equal (as already shown). However the downside is we've measured mutual information PER VALUE of categorical column, rather than the column's overall importance.

Clarified (and still clarifying) sklearn's use of the word 'discrete' here.