# Assignment (Lesson 05)
# Data Preparation and Feature Selection
Steps in a data science project
1. Acquire data
2. Exploratory Data analysis (EDA)
3. Data Processing
    1. Data Preparation
    2. Feature Selection
4. Predictive Analytics

### Import Packages
Python, like most programming languages, has pre-made software methods.  These pre-made software methods are organized and combined by topic into packages.  The packages that we want are:
- numpy (numerical python)
- pandas (panel data aka tables)
- sklearn (sci-kit learn for predictive analytics)
- matplotlib (data plotting for matrix-like data)  

We need to "import" these packages so that we can use their methods in our code.

In [1]:
# import packages
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mutual_info_score
# Allow inline plotting in Jupyter Notebook
%matplotlib inline

# Remove documentation warning
import warnings
warnings.filterwarnings('ignore')

## Data Preparation on the Mammographic Masses Dataset (Mamm)
### Acquire data
We will get our data from the University of California, Irvine Machine Learning Repository.  Our dataset was used to determine the effectivity of radiological evaluations of breast cancer diagnoses in women who have breast tumors.  You can get some information on the data from here:  http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names

In [2]:
# csv file:
url = "../data/mammographic_masses.data"
# Alternate data source:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data"

# Download the data
mamm = pd.read_csv(url, header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
mamm.columns = ["BI_RADS", "Age", "Shape", "Margin", "Density", "Severity"]

mamm.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


### Some preliminary EDA:
"BI_RADS", and "Density" are ordinal columns.  We will assume that they are numeric.  
"Age" and "Severity" are numeric columns.    
"Shape" and "Margin" are category columns but they are encoded as integers.  

Show the actual data types of these columns.  Can you guess why the data types of these 5 columns are `object`?

<span style="color:red" float:right>[0 point]</span>

In [3]:
mamm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   BI_RADS   961 non-null    object
 1   Age       961 non-null    object
 2   Shape     961 non-null    object
 3   Margin    961 non-null    object
 4   Density   961 non-null    object
 5   Severity  961 non-null    int64 
dtypes: int64(1), object(5)
memory usage: 45.2+ KB


In [4]:
mamm.dtypes

BI_RADS     object
Age         object
Shape       object
Margin      object
Density     object
Severity     int64
dtype: object

In [5]:
# My guess is that these are being set to object because of nulls. I can see one just from the 5th row of the head above

mamm.isnull().any(axis=0)

BI_RADS     False
Age         False
Shape       False
Margin      False
Density     False
Severity    False
dtype: bool

In [6]:
# The data has filled in nulls with values other than NaN.
mamm.BI_RADS.value_counts()

4     547
5     345
3      36
2      14
6      11
0       5
?       2
55      1
Name: BI_RADS, dtype: int64

Guess: These data types are mostly object because they are str, text and/or a mixed numeric and non-numeric values.

### Some Data Processing
In the following sections you will do the following to the Mamm dataframe:
- Replace unusable entries with null/nan  
- Change types of data.
- Correct unexpected values (outliers)
- decode category data    
- Consolidate categories in category data 

#### Replace Missing Values with Nulls
Coerce all columns, even category columns, that contain missing values to numeric data using `pd.to_numeric`.  You might get an error, like `Unable to parse string`.  You need to tell `pd.to_numeric` that it should **coerce** the casting when it encounters a value that it cannot parse.  The category columns in this dataset are encoded as integers.  We will make use of that encoding.  Any non-numeric value will be replaced with a nan and you will get nans for missing numeric and category values.  After you replace all the non-numeric values, present the first five rows with `Mamm.head()`.

<span style="color:red" float:right>[1 point]</span>

In [7]:
# Coerce all the data to numeric data
# Coercion will introduce nans/nulls for the non-numeric values in all columns
# Because the categories are encoded as integers, the missing categories will also be nans/nulls after coercion.

mamm['BI_RADS'] = pd.to_numeric(mamm['BI_RADS'], errors='coerce', downcast="integer")
mamm['Age'] = pd.to_numeric(mamm['Age'], errors='coerce', downcast="integer")
mamm['Shape'] = pd.to_numeric(mamm['Shape'], errors='coerce', downcast="integer")
mamm['Margin'] = pd.to_numeric(mamm['Margin'], errors='coerce', downcast="integer")
mamm['Density'] = pd.to_numeric(mamm['Density'], errors='coerce', downcast="integer")

#for column in Mamm.columns:
#    Mamm[column] = pd.to_numeric(Mamm[column], errors='coerce', downcast="integer")

mamm.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


Reason: To coerce all columns, even category columns, that contain missing values to numeric data using pd.to_numeric.

Conclusion: Coerced each column separately and attempted to downcast to integer but result was a float. The outcome was fortunately still numeric. 

#### Replace Outliers
Values that are obviously incorrect are often replaced with averages.  Often, outlier replacements with averages are inappropriate because the extreme values have some meaning.  For instance, from the data dictionary we know that BI_RADS should range from 1 to 5.  BI_RADS values beyond 1 and 5 were added by physicians who did not adhere to the accepted range.  In this case, BI_RADS greater than 5 should be "clipped" at 5 and BI_RADS less than 1 should be "clipped" at 1. 

<span style="color:red" float:right>[1 point]</span>

In [8]:
# Cap BI_RADS values to a range of 1 to 5
# Create a boolean for values < 1 and > 5

flag_low = (mamm['BI_RADS'] < 1)
flag_high = (mamm['BI_RADS'] > 5)

# Replace values with 1 or 5 for flag
mamm['BI_RADS'][flag_low] = 1
mamm['BI_RADS'][flag_high] = 5

# Check unique values
mamm['BI_RADS'].unique()

array([ 5.,  4.,  3., nan,  2.,  1.])

Reason: To clip BI_RADS greater than 5 at 5 and BI_RADS less than 1 at 1.

Conclusion: No more outliers in the data and all the unique values fall within the range specified.

### Consolidate and decode category columns

Decoding a category is when categories are coded as numbers and we replace those numbers with actual categories.  
Consolidating (aka binning or grouping) of categories is means that multiple categories are renamed to a single category.  
The decoding and consolidating of categories can occur at the same time.  

- Shape
 - The original category codes are: round=1; oval=2; lobular=3; irregular=4;  
 - The proper consolidated category decoding is: 1 $\rightarrow$ oval; 2 $\rightarrow$ oval; 3 $\rightarrow$ lobular; 4 $\rightarrow$ irregular;  
- Margin
 - The orginal category codes are: circumscribed=1; microlobulated=2; obscured=3; ill-defined=4; spiculated=5  
 - The proper consolidated category decodes are: 1 $\rightarrow$ circumscribed; 2 $\rightarrow$ ill-defined; 3 $\rightarrow$ ill-defined; 4 $\rightarrow$ ill-defined; 5 $\rightarrow$ spiculated;

After you decode and consolidate, present the first five rows with `Mamm.head()`. 

<span style="color:red" float:right>[1 point]</span>

In [9]:
# The category columns are decoded and categories are consolidated

# The Shape variable is decoded as follows:  1 and 2 to oval;  3 to lobular; 4 to irregular
mamm['Shape'] = mamm['Shape'].replace({1: "oval", 2: "oval", 3: "lobular", 4: "irregular"})

# The Shape variable is decoded as follows:  1 to circumscribed;  2, 3, 4 to ill_defined; 5 to spiculated
mamm["Margin"] = mamm["Margin"].replace({1: "circumscribed", 2: "ill_defined", 3: "ill_defined", 4: "ill_defined", 5: "spiculated"})

# Present the first few rows
mamm.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,lobular,spiculated,3.0,1
1,4.0,43.0,oval,circumscribed,,1
2,5.0,58.0,irregular,spiculated,3.0,1
3,4.0,28.0,oval,circumscribed,3.0,0
4,5.0,74.0,oval,spiculated,,1


Reason: To decode and consolidate, present the first five rows with Mamm.head().

Conclusion: Decoded and consolidated presenting the first 5 rows. 

### Some More EDA
- Show the shape of the dataframe
- Use the `pandas` `isna` method to show the distribution of nulls among the columns.
  
<span style="color:red" float:right>[0 point]</span>

In [10]:
# Show the shape of the data frame
print("The shape of the dataframe is", mamm.shape)
print("There are 6 columns and 961 rows")

The shape of the dataframe is (961, 6)
There are 6 columns and 961 rows


In [11]:
# Show the distribution of nulls among the columns
print("The distribution of nulls among the columns is:") 
mamm.isna().sum()

The distribution of nulls among the columns is:


BI_RADS      2
Age          5
Shape       31
Margin      48
Density     76
Severity     0
dtype: int64

### Drop Rows with Multiple Missing Values
When a row has too many missing values, then it should not be used.  We can stipulate a threshold requirement of available values per row.  We will require that each row contains at least 5 values.  This requirement means that no row is allowed more than 1 missing value.  
Remove the rows that have more than one missing value.  
- Use the `pandas` `dropna` method and set the `thresh` argument.  
- Show the shape of the dataframe after you drop the rows with multiple nulls. 
- Use the `pandas` `isna` method to show the number of nulls per column after dropping rows with multiple nulls

<span style="color:red" float:right>[1 point]</span>

In [12]:
# Drop rows
mamm = mamm.dropna(thresh=5)

# Show the shape of the data frame
display(mamm.shape)

# Show the distribution of nulls among the columns
display(mamm.isna().sum())

(931, 6)

BI_RADS      1
Age          5
Shape       17
Margin      22
Density     56
Severity     0
dtype: int64

Reason: Remove the rows that have more than one missing value.

Conclusion: Dropped rows that had more than one missing value. Setting the thresh to 5 decreased rows from 961 to 931. Margin had the most rows dropped with 26. 

## Impute Missing Values
Use the median values to impute missing values for true numerical columns (`Age`, `BI_RADS`, `Density`).  `Margin` and `Shape` originally looked numeric, but they are categorical.  Therefore, do not use median on `Margin` and `Shape`.  

### Determine the imputation values for Age

In [13]:
# Replace missing age values with the median 
MedianAge = np.nanmedian(mamm.loc[:,"Age"])
HasNanAge = pd.isnull(mamm.loc[:,"Age"])
print('Now we replace', HasNanAge.sum(),'missing age values with the age median (', MedianAge, ')')
mamm.loc[HasNanAge, "Age"] = MedianAge
mamm.isna().sum(axis=0)

Now we replace 5 missing age values with the age median ( 57.0 )


BI_RADS      1
Age          0
Shape       17
Margin      22
Density     56
Severity     0
dtype: int64

### Impute Missing values for BI_RADS and Density
Assign the column medians to the null values in the respective numeric columns.
- Use the `pandas` `isnull` method to identify the nulls
- Use the `numpy` `nanmedian` to determine the median for imputation
- Use the `pandas` `isna` method to show the number of nulls per column after the imputation   
  
<span style="color:red" float:right>[1 point]</span>

In [14]:
# Median Imputation for BI_RADS
median_rads = np.nanmedian(mamm.loc[:, "BI_RADS"])
has_null_rads = pd.isnull(mamm.loc[:, "BI_RADS"])
mamm.loc[has_null_rads, "BI_RADS"] = median_rads

# Median Imputation for Density
median_density = np.nanmedian(mamm.loc[:, "Density"])
has_null_density = pd.isnull(mamm.loc[:, "Density"])
mamm.loc[has_null_density, "Density"] = median_density

# Distribution of nulls
mamm.isna().sum(axis=0)

BI_RADS      0
Age          0
Shape       17
Margin      22
Density      0
Severity     0
dtype: int64

Reason: To assign the column medians to the null values in the respective numeric columns.

Conclusion: The BI_RADS column had all null values replaced as shown by the 0.

### Replace missing values for the two categorical columns
- Use `pandas` `value_counts()` method to determine the distribution of categories in `Shape` and `Margin` before imputation.
- Use `pandas` `isnull()` method to identify the missing values
- Assign the most common value to the null values in the respective categorical columns. 
- After the imputation, use the `pandas` `isna` method to show the number of nulls after the imputation.
- Use `pandas` `value_counts()` method to determine the distribution of categories after imputation.

<span style="color:red" float:right>[1 point]</span>

In [15]:
# Determine the distribution of categories for Shape
display(mamm["Shape"].value_counts())

# Replace nulls in Shape with the most common category of Shape
print("The number of nulls in the Shape column is:")
display(mamm["Shape"].isnull().sum())

# Replace nulls in Margin with the most common category of Margin
mamm.loc[mamm["Shape"].isnull(), "Shape"] = "oval"

print("All nulls in the Shape column eliminated")
               
mamm.isna().sum()

oval         422
irregular    399
lobular       93
Name: Shape, dtype: int64

The number of nulls in the Shape column is:


17

All nulls in the Shape column eliminated


BI_RADS      0
Age          0
Shape        0
Margin      22
Density      0
Severity     0
dtype: int64

Utilized the oval category as the one to fill the null values because it had the greatest occurences of all three types.

In [16]:
# Determine the distribution of categories for Margin
display("The unique Margin categories are", mamm["Margin"].value_counts(), "Nulls", mamm["Margin"])

'The unique Margin categories are'

ill_defined      417
circumscribed    357
spiculated       135
Name: Margin, dtype: int64

'Nulls'

0         spiculated
1      circumscribed
2         spiculated
3      circumscribed
4         spiculated
           ...      
956    circumscribed
957       spiculated
958       spiculated
959       spiculated
960      ill_defined
Name: Margin, Length: 931, dtype: object

In [17]:
# Replacing nulls in Margin with the most common category of Margin
mamm.loc[mamm["Margin"].isnull(), "Margin"] = "ill_defined"

In [18]:
# Distribution of nulls
print("Now there are no more null values in the dataframe")
mamm.isnull().sum()

Now there are no more null values in the dataframe


BI_RADS     0
Age         0
Shape       0
Margin      0
Density     0
Severity    0
dtype: int64

In [19]:
# Determine the distribution of categories
display(mamm.loc[:, ["Margin"]].value_counts())
display(mamm.loc[:, ["Shape"]].value_counts())

Margin       
ill_defined      439
circumscribed    357
spiculated       135
dtype: int64

Shape    
oval         439
irregular    399
lobular       93
dtype: int64

Reason: Replace missing values for the two categorical columns.

Conclusion: Displayed the total nulls before to show when they were replaced. The ill_defined category was used because it had the highest count. Overall distribution displayed and null values were removed from the DataFrame.

### One hot encode the categorical variables
- Use `OneHotEncoder` from `sklearn.preprocessing` to one-hot encode the two categorical variables, `Shape` and `Margin`.
- Make sure that the new columns have descriptive hybrid names by using the `get_feature_names_out` method.
- Add the new binary columns to the dataframe.
- drop the original columns, `Shape` and `Margin`
- Show the first few rows of the dataframe.

<span style="color:red" float:right>[3 point]</span>

In [20]:
# package at the top

# Categorical data
mamm_cat = mamm.loc[:, ["Margin", "Shape"]]
mamm_cat.head()

Unnamed: 0,Margin,Shape
0,spiculated,lobular
1,circumscribed,oval
2,spiculated,irregular
3,circumscribed,oval
4,spiculated,oval


In [21]:
# One-hot-encode
one_hot = OneHotEncoder(sparse=False)

# Fit the one-hot encoder
one_hot.fit(mamm_cat)

OneHotEncoder(sparse=False)

In [22]:
# Feature names
col_names = one_hot.get_feature_names_out(mamm_cat.columns)
col_names

array(['Margin_circumscribed', 'Margin_ill_defined', 'Margin_spiculated',
       'Shape_irregular', 'Shape_lobular', 'Shape_oval'], dtype=object)

In [23]:
# Create DataFrame
mamm_enc_df = pd.DataFrame(one_hot.transform(mamm_cat), columns=col_names)
mamm_enc_df.head()

Unnamed: 0,Margin_circumscribed,Margin_ill_defined,Margin_spiculated,Shape_irregular,Shape_lobular,Shape_oval
0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,1.0


In [24]:
# Add one_hot_encoded columns to dataframe
mamm_concat_df = pd.concat([mamm, mamm_enc_df], axis=1)

In [25]:
# Drop original categorical columns
mamm_concat_df = mamm_concat_df.drop(columns=["Shape", "Margin"])
mamm_concat_df.head()

Unnamed: 0,BI_RADS,Age,Density,Severity,Margin_circumscribed,Margin_ill_defined,Margin_spiculated,Shape_irregular,Shape_lobular,Shape_oval
0,5.0,67.0,3.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,4.0,43.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
2,5.0,58.0,3.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
3,4.0,28.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,5.0,74.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


Reason: To one hot encode the categorical variables.

Conclusion: One hot encoded the categorical variables and displayed the first few rows of the DataFrame.

## End of Data Preparation on the Mammographic Masses Dataset (Mamm)



## Feature Selection on the Indian  Liver Patient Dataset (ILPD)
Feature selection is a process of removing features that are redundant and that could lead to overfitting, singular matrices, and other problems associated with high cardinality (Curse of dimensionality:  https://en.wikipedia.org/wiki/Curse_of_dimensionality)

### Acquire Data

We will get our data from the University of California, Irvine Machine Learning Repository. Our dataset was used to determine if blood test data could be sufficient to identify liver disease in rural areas with few physicians.

In [26]:
# csv file:
#url = "../data/Indian Liver Patient Dataset (ILPD).csv"
# Alternate data source:
#url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian Liver Patient Dataset (ILPD).csv"
#url = url.replace(" ", "%20")

# Download the data
ILPD = pd.read_csv("Indian Liver Patient Dataset (ILPD).csv", header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
ILPD.columns = ["Age","Gender","DB","TB","Alkphos","Sgpt","Sgot","TPr","ALB","AGRatio","Selector"]

ILPD

Unnamed: 0,Age,Gender,DB,TB,Alkphos,Sgpt,Sgot,TPr,ALB,AGRatio,Selector
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


### Data Preparation for ILPD
- All columns should be numeric and continuous
    - Remove binary columns (numeric and categorical) because their mutual information scores will be lower
    - Remove any categorical columns
- Remove or impute any missing values

<span style="color:red" float:right>[1 point]</span>

In [27]:
# Investigate value counts of binary columns
display(ILPD["Gender"].value_counts())
display(ILPD["Selector"].value_counts())

Male      441
Female    142
Name: Gender, dtype: int64

1    416
2    167
Name: Selector, dtype: int64

In [28]:
# Drop Binary & Categorical Columns
ILPD = ILPD.drop(columns=["Gender", "Selector", "Age"])
ILPD.head()

Unnamed: 0,DB,TB,Alkphos,Sgpt,Sgot,TPr,ALB,AGRatio
0,0.7,0.1,187,16,18,6.8,3.3,0.9
1,10.9,5.5,699,64,100,7.5,3.2,0.74
2,7.3,4.1,490,60,68,7.0,3.3,0.89
3,1.0,0.4,182,14,20,6.8,3.4,1.0
4,3.9,2.0,195,27,59,7.3,2.4,0.4


Eliminated age, gender, and selector columns because they were irrelevant for comparing continuous numeric data. 

In [29]:
# Check for nulls
ILPD.isnull().sum()

DB         0
TB         0
Alkphos    0
Sgpt       0
Sgot       0
TPr        0
ALB        0
AGRatio    4
dtype: int64

In [30]:
# Impute the NaNs for the AGRatio
agr_median = np.nanmedian(ILPD["AGRatio"])

# Fill NaNs within the AGRatio column with median
ILPD.loc[pd.isnull(ILPD.loc[:, "AGRatio"]), "AGRatio"] = agr_median
display(ILPD["AGRatio"][:5])
display(ILPD.isnull().sum())

0    0.90
1    0.74
2    0.89
3    1.00
4    0.40
Name: AGRatio, dtype: float64

DB         0
TB         0
Alkphos    0
Sgpt       0
Sgot       0
TPr        0
ALB        0
AGRatio    0
dtype: int64

Reason: To prepare data for ILPD.

Conclusion: DataFrame is cleaned and no longer has NaNs.

### Mutual Information
https://en.wikipedia.org/wiki/Mutual_information
Below is a wrapper for determining the mutual information between two continuous (numeric) variables

In [31]:
# x is the first input variable
# y is the second input variable
# bins is the number of discretized values that will be used for the two input variables
def calc_MI(x, y, bins=80):
    if (bins > 1):
        c_xy = np.histogram2d(x, y, bins)[0]
        mi = mutual_info_score(None, None, contingency = c_xy)
    else:
        mi = mutual_info_score(x, y)
    return mi

In [32]:
# Test mi function
display(calc_MI(ILPD["Sgot"], ILPD["TPr"]))
print(ILPD["Sgot"].name)

0.33394396951202393

Sgot


### Create method to list  all column pairs together with their mutual information score
Write a method called `listMutualInformationScores`.  It uses the above method (`calc_MI`) in a loop to find the mutual information between all possible pairs of coulmns in the data.  The input to the function is the dataframe of continuous variables, specifically the prepared ILPD dataset.  

The method returns a list of lists.  Each inner list contains three items:  the x-column, the y-column, and the mutual information score. The list of lists contains every possible pair of coulmns in the data.  The result should have a form similar to the following, except that the outer list is much longer and contains all possible column pairs:  
`[['Alkphos', 'Sgot', 0.33],
['Sgot', 'AGRatio', 0.23],
['Age', 'Sgot', 0.35],
['Sgpt', 'AGRatio', 0.30],
['Sgot', 'ALB', 0.29],
['Sgot', 'TPr', 0.33]]`

<span style="color:red" float:right>[3 point]</span>

### Present the mutual information results
- Package the output into a dataframe
- Sort the rows in descending order of mutual information
- Present the dataframe

The first column could be x, the second column could be called y and the third column could be called mi.  x and y are the pair of columns pair and mi is the pair's mutual information score.  The result should have a form similar to the following:

| x | y | mi |
| --- | --- | --- |
| Age | Sgot | 0.35 |
| Sgot | TPr | 0.33 |
| Alkphos | Sgot | 0.33 |
| Sgpt | AGRatio | 0.30 |
| Sgot | ALB | 0.29 |
| Sgot | AGRatio | 0.23 |

<span style="color:red" float:right>[1 point]</span>

In [35]:
# Import a tool to create combinations
import itertools

# Define the method listMutualInformationScores
def listMutualInformationScores(df):
    mfi_list = []
    for x, y in itertools.combinations(df.columns, 2):
        mfi = calc_MI(df[x], df[y])
        mfi_list.append([x, y, mfi])
    return mfi_list

# Loop through the entire DataFrame
mfi_list = listMutualInformationScores(ILPD)

# Create DataFrame from nested lists of x, y, and mfi
mfi_df = pd.DataFrame(mfi_list, columns=['x', 'y', 'mfi'])

# Show the DataFrame head
display(mfi_df.head())
display(mfi_df.shape)

Unnamed: 0,x,y,mfi
0,DB,TB,1.400968
1,DB,Alkphos,0.515721
2,DB,Sgpt,0.405067
3,DB,Sgot,0.350626
4,DB,TPr,0.539092


(28, 3)

In [None]:
Reason: To present the mutual information results.

Conclusion: Started by comparing two columns that were hard coded. From there, moved into a for loop to pull out each column pair, calculating the MI score, adding the names of the columns and the scores, into a nested list subsequently creating a dataframe. 

### Discussion on Mutual Information in ILPD
Lets assume a threshold of 1 for the mutual information score
Which columns would you eliminate? Why?  To answer these questions, you may need to read-up on feature selection with mutual information score.

<span style="color:red" float:right>[1 point]</span>

In [36]:
# Select only values greater than 1
mfi_greater_one_df = mfi_df.loc[mfi_df["mfi"] >=1]
mfi_greater_one_df

Unnamed: 0,x,y,mfi
0,DB,TB,1.400968
25,TPr,ALB,1.372006
27,ALB,AGRatio,1.002026


Add discussion here 

Reason: To discuss Mutual Information in ILPD

Conclusion: The mutual information score shows how much information we can obtain from one random variable using another. The threshold value of 1 for mutual information score shows we are looking for pairs that have a strong association. An MI of 0 means that the two variables are independent. MI is on a log scale so it is rare to find values above 2. However, if we use a threshold value of 1 we can eliminate variables that are only weakly associated. I feel there are only 3 combinations of highly associated pairs which could then be used to further simplify the features, making it less compute heavy when creating a prediction algorithm. Due to the strength of the association between pairs, a column could be eliminated from each pair that the other column could be used to predict the eliminated one. 