# Assignment (Lesson 05)
# Data Preparation and Feature Selection
Steps in a data science project
1. Acquire data
2. Exploratory Data analysis (EDA)
3. Data Processing
    1. Data Preparation
    2. Feature Selection
4. Predictive Analytics

### Import Packages
Python, like most programming languages, has pre-made software methods.  These pre-made software methods are organized and combined by topic into packages.  The packages that we want are:
- numpy (numerical python)
- pandas (panel data aka tables)
- sklearn (sci-kit learn for predictive analytics)
- matplotlib (data plotting for matrix-like data)  

We need to "import" these packages so that we can use their methods in our code.

In [1]:
# import packages
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
# Allow inline plotting in Jupyter Notebook
%matplotlib inline

## Data Preparation on the Mammographic Masses Dataset (mam)
### Acquire data
We will get our data from the University of California, Irvine Machine Learning Repository.  Our dataset was used to determine the effectivity of radiological evaluations of breast cancer diagnoses in women who have breast tumors.  You can get some information on the data from here:  http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names

In [2]:
# Alternate data source:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data"

# Download the data
mam = pd.read_csv(url, header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
mam.columns = ["BI_RADS", "Age", "Shape", "Margin", "Density", "Severity"]

mam.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


### Some preliminary EDA:
"BI_RADS", and "Density" are ordinal columns.  We will assume that they are numeric.  
"Age" and "Severity" are numeric columns.    
"Shape" and "Margin" are category columns but they are encoded as integers.  

Show the actual data types of these columns.  Can you guess why the data types of these 5 columns are `object`?

<span style="color:red" float:right>[0 point]</span>

In [3]:
# Add code here
mam.dtypes

BI_RADS     object
Age         object
Shape       object
Margin      object
Density     object
Severity     int64
dtype: object

In [4]:
# My guess is that these are being set to object because of nulls. I can see one just from the 5th row of the head above

mam.isnull().any(axis=0)

BI_RADS     False
Age         False
Shape       False
Margin      False
Density     False
Severity    False
dtype: bool

In [5]:
# Ah, great, it's even worse than that. They've filled in nulls with values other than NaN.
mam.BI_RADS.value_counts()

4     547
5     345
3      36
2      14
6      11
0       5
?       2
55      1
Name: BI_RADS, dtype: int64

### Some Data Processing
In the following sections you will do the following to the mam dataframe:
- Replace unusable entries with null/nan  
- Change types of data.
- Correct unexpected values (outliers)
- decode category data    
- Consolidate categories in category data 

#### Replace Missing Values with Nulls
Coerce all columns, even category columns, that contain missing values to numeric data using `pd.to_numeric`.  You might get an error, like `Unable to parse string`.  You need to tell `pd.to_numeric` that it should **coerce** the casting when it encounters a value that it cannot parse.  The category columns in this dataset are encoded as integers.  We will make use of that encoding.  Any non-numeric value will be replaced with a nan and you will get nans for missing numeric and category values.  After you replace all the non-numeric values, present the first five rows with `mam.head()`.

<span style="color:red" float:right>[1 point]</span>

In [6]:
# Coerce all the data to numeric data
# Coercion will introduce nans/nulls for the non-numeric values in all columns
# Because the categories are encoded as integers, the missing categories will also be nans/nulls after coercion.
# Add code here
mam.BI_RADS = pd.to_numeric(mam.BI_RADS, errors='coerce')
mam.Age = pd.to_numeric(mam.Age, errors='coerce')
mam.Shape = pd.to_numeric(mam.Shape, errors='coerce')
mam.Margin = pd.to_numeric(mam.Margin, errors='coerce')
mam.Density = pd.to_numeric(mam.Density, errors='coerce')

mam.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


#### Replace Outliers
Values that are obviously incorrect are often replaced with averages.  Often, outlier replacements with averages are inappropriate because the extreme values have some meaning.  For instance, from the data dictionary we know that BI_RADS should range from 1 to 5.  BI_RADS values beyond 1 and 5 were added by physicians who did not adhere to the accepted range.  In this case, BI_RADS greater than 5 should be "clipped" at 5 and BI_RADS less than 1 should be "clipped" at 1. 

<span style="color:red" float:right>[1 point]</span>

In [11]:
# Cap BI_RADS values to a range of 1 to 5
# Add code here
tooHighRads = mam.BI_RADS > 5
mam.loc[tooHighRads, 'BI_RADS'] = 5

tooLowRads = mam.BI_RADS < 1
mam.loc[tooLowRads, 'BI_RADS'] = 1

### Consolidate and decode category columns

Decoding a category is when categories are coded as numbers and we replace those numbers with actual categories.  
Consolidating (aka binning or grouping) of categories is means that multiple categories are renamed to a single category.  
The decoding and consolidating of categories can occur at the same time.  

- Shape
 - The original category codes are: round=1; oval=2; lobular=3; irregular=4;  
 - The proper consolidated category decoding is: 1 $\rightarrow$ oval; 2 $\rightarrow$ oval; 3 $\rightarrow$ lobular; 4 $\rightarrow$ irregular;  
- Margin
 - The orginal category codes are: circumscribed=1; microlobulated=2; obscured=3; ill-defined=4; spiculated=5  
 - The proper consolidated category decodes are: 1 $\rightarrow$ circumscribed; 2 $\rightarrow$ ill-defined; 3 $\rightarrow$ ill-defined; 4 $\rightarrow$ ill-defined; 5 $\rightarrow$ spiculated;

After you decode and consolidate, present the first five rows with `mam.head()`. 

<span style="color:red" float:right>[1 point]</span>

In [None]:
# The category columns are decoded and categories are consolidated


# The Shape variable is decoded as follows:  1 and 2 to oval;  3 to lobular; 4 to irregular
# Add code here

# The Margin variable is decoded as follows:  1 to circumscribed;  2, 3, 4 to ill_defined; 5 to spiculated
# Add code here

# Present the first few rows
# Add code here

### Some More EDA
- Show the shape of the dataframe
- Use the `pandas` `isna` method to show the distribution of nulls among the columns.
  
<span style="color:red" float:right>[0 point]</span>

In [14]:
# Show the shape of the data frame
# Add code here
mam.shape

(961, 6)

In [28]:
# Show the distribution of nulls among the columns
# Add code here
records = len(mam)

for col in mam.columns:
    colname = mam[col].name
    nulls = mam[col].isna().sum()
    print(f'{colname}: {nulls} null values out of {records} records ({round((nulls / records) * 100, 2)}%).')

BI_RADS: 2 null values out of 961 records (0.21%).
Age: 5 null values out of 961 records (0.52%).
Shape: 31 null values out of 961 records (3.23%).
Margin: 48 null values out of 961 records (4.99%).
Density: 76 null values out of 961 records (7.91%).
Severity: 0 null values out of 961 records (0.0%).


### Drop Rows with Multiple Missing Values
When a row has too many missing values, then it should not be used.  We can stipulate a threshold requirement of available values per row.  We will require that each row contains at least 5 values.  This requirement means that no row is allowed more than 1 missing value.  
Remove the rows that have more than one missing value.  
- Use the `pandas` `dropna` method and set the `thresh` argument.  
- Show the shape of the dataframe after you drop the rows with multiple nulls. 
- Use the `pandas` `isna` method to show the number of nulls per column after dropping rows with multiple nulls

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Drop rows
# Add code here

# Show the shape of the data frame
# Add code here

# Show the distribution of nulls among the columns
# Add code here

## Impute Missing Values
Use the median values to impute missing values for true numerical columns (`Age`, `BI_RADS`, `Density`).  `Margin` and `Shape` originally looked numeric, but they are categorical.  Therefore, do not use median on `Margin` and `Shape`.  

### Determine the imputation values for Age

In [None]:
# Replace missing age values with the median 
MedianAge = np.nanmedian(mam.loc[:,"Age"])
HasNanAge = pd.isnull(mam.loc[:,"Age"])
print('Now we replace', HasNanAge.sum(),'missing age values with the age median (', MedianAge, ')')
mam.loc[HasNanAge, "Age"] = MedianAge
mam.isna().sum(axis=0)

### Impute Missing values for BI_RADS and Density
Assign the column medians to the null values in the respective numeric columns.
- Use the `pandas` `isnull` method to identify the nulls
- Use the `numpy` `nanmedian` to determine the median for imputation
- Use the `pandas` `isna` method to show the number of nulls per column after the imputation   
  
<span style="color:red" float:right>[1 point]</span>

In [None]:
# Median Imputation for BI_RADS
# Add code here

# Median Imputation for Density
# Add code here

# Distribution of nulls
# Add code here

### Replace missing values for the two categorical columns
- Use `pandas` `value_counts()` method to determine the distribution of categories in `Shape` and `Margin` before imputation.
- Use `pandas` `isnull()` method to identify the missing values
- Assign the most common value to the null values in the respective categorical columns. 
- After the imputation, use the `pandas` `isna` method to show the number of nulls after the imputation.
- Use `pandas` `value_counts()` method to determine the distribution of categories after imputation.

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Determine the distribution of categories for Shape
# Add code here

# Replace nulls in Shape with the most common category of Shape
# Add code here

# Determine the distribution of categories for Margin
# Add code here

# Replace nulls in Margin with the most common category of Margin
# Add code here

# Distribution of nulls
# Add code here

# Determine the distribution of categories
# Add code here

### One hot encode the categorical variables
- Use `OneHotEncoder` from `sklearn.preprocessing` to one-hot encode the two categorical variables, `Shape` and `Margin`.
- Make sure that the new columns have descriptive hybrid names by using the `get_feature_names_out` method.
- Add the new binary columns to the dataframe.
- drop the original columns, `Shape` and `Margin`
- Show the first few rows of the dataframe.

<span style="color:red" float:right>[3 point]</span>

In [None]:
# get package
# Add code here

# One-hot-encode
# Add code here

# Create Column Names
# Add code here

# Add one-hot-encoded columns to dataframe
# Add code here

# Drop original categorical columns
# Add code here

# Show the first few rows
# Add code here

## End of Data Preparation on the Mammographic Masses Dataset (mam)



## Feature Selection on the Indian  Liver Patient Dataset (ILPD)
Feature selection is a process of removing features that are redundant and that could lead to overfitting, singular matrices, and other problems associated with high cardinality (Curse of dimensionality:  https://en.wikipedia.org/wiki/Curse_of_dimensionality)

### Acquire Data

We will get our data from the University of California, Irvine Machine Learning Repository. Our dataset was used to determine if blood test data could be sufficient to identify liver disease in rural areas with few physicians.

In [None]:
# csv file:
url = "../data/Indian Liver Patient Dataset (ILPD).csv"
# Alternate data source:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian Liver Patient Dataset (ILPD).csv"
url = url.replace(" ", "%20")

# Download the data
ILPD = pd.read_csv(url, header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
ILPD.columns = ["Age","Gender","DB","TB","Alkphos","Sgpt","Sgot","TPr","ALB","AGRatio","Selector"]

ILPD

### Data Preparation for ILPD
- All columns should be numeric and continuous
    - Remove binary columns (numeric and categorical) because their mutual information scores will be lower
    - Remove any categorical columns
- Remove or impute any missing values

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Drop Binary Columns
# Add Code here

# Impute values or remove rows with nulls
# Add Code here

### Mutual Information
https://en.wikipedia.org/wiki/Mutual_information
Below is a wrapper for determining the mutual information between two continuous (numeric) variables

In [None]:
from sklearn.metrics import mutual_info_score
# x is the first input variable
# y is the second input variable
# bins is the number of discretized values that will be used for the two input variables
def calc_MI(x, y, bins=80):
    if (bins > 1):
        c_xy = np.histogram2d(x, y, bins)[0]
        mi = mutual_info_score(None, None, contingency = c_xy)
    else:
        mi = mutual_info_score(x, y)
    return mi

### Create method to list  all column pairs together with their mutual information score
Write a method called `listMutualInformationScores`.  It uses the above method (`calc_MI`) in a loop to find the mutual information between all possible pairs of coulmns in the data.  The input to the function is the dataframe of continuous variables, specifically the prepared ILPD dataset.  

The method returns a list of lists.  Each inner list contains three items:  the x-column, the y-column, and the mutual information score. The list of lists contains every possible pair of coulmns in the data.  The result should have a form similar to the following, except that the outer list is much longer and contains all possible column pairs:  
`[['Alkphos', 'Sgot', 0.33],
['Sgot', 'AGRatio', 0.23],
['Age', 'Sgot', 0.35],
['Sgpt', 'AGRatio', 0.30],
['Sgot', 'ALB', 0.29],
['Sgot', 'TPr', 0.33]]`

<span style="color:red" float:right>[3 point]</span>

In [None]:
# define the method listMutualInformationScores
# Add code here

In [None]:
# Run the method listMutualInformationScores
# Add code here

### Present the mutual information results
- Package the output into a dataframe
- Sort the rows in descending order of mutual information
- Present the dataframe

The first column could be x, the second column could be called y and the third column could be called mi.  x and y are the pair of columns pair and mi is the pair's mutual information score.  The result should have a form similar to the following:

| x | y | mi |
| --- | --- | --- |
| Age | Sgot | 0.35 |
| Sgot | TPr | 0.33 |
| Alkphos | Sgot | 0.33 |
| Sgpt | AGRatio | 0.30 |
| Sgot | ALB | 0.29 |
| Sgot | AGRatio | 0.23 |

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Present the results as dataframe
# Add code here

### Discussion on Mutual Information in ILPD
Lets assume a threshold of 1 for the mutual information score
Which columns would you eliminate? Why?  To answer these questions, you may need to read-up on feature selection with mutual information score.

<span style="color:red" float:right>[1 point]</span>

Add discussion here 