# Assignment (Lesson 05)
# Data Preparation and Feature Selection
Steps in a data science project
1. Acquire data
2. Exploratory Data analysis (EDA)
3. Data Processing
    1. Data Preparation
    2. Feature Selection
4. Predictive Analytics

### Import Packages
Python, like most programming languages, has pre-made software methods.  These pre-made software methods are organized and combined by topic into packages.  The packages that we want are:
- numpy (numerical python)
- pandas (panel data aka tables)
- sklearn (sci-kit learn for predictive analytics)
- matplotlib (data plotting for matrix-like data)  

We need to "import" these packages so that we can use their methods in our code.

In [None]:
# import packages
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
# Allow inline plotting in Jupyter Notebook
%matplotlib inline

## Data Preparation on the Mammographic Masses Dataset (Mamm)
### Acquire data
We will get our data from the University of California, Irvine Machine Learning Repository.  Our dataset was used to determine the effectivity of radiological evaluations of breast cancer diagnoses in women who have breast tumors.  You can get some information on the data from here:  http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names

In [None]:
# csv file:
url = "../data/mammographic_masses.data"
# Alternate data source:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data"

# Download the data
Mamm = pd.read_csv(url, header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
Mamm.columns = ["BI_RADS", "Age", "Shape", "Margin", "Density", "Severity"]

Mamm.head()

### Some preliminary EDA:
"BI_RADS", and "Density" are ordinal columns.  We will assume that they are numeric.  
"Age" and "Severity" are numeric columns.    
"Shape" and "Margin" are category columns but they are encoded as integers.  

Show the actual data types of these columns.  Can you guess why the data types of these 5 columns are `object`?

<span style="color:red" float:right>[0 point]</span>

In [None]:
Mamm.shape

In [None]:
# Add code here
Mamm.dtypes

> There's probably NaNs or bad data in the columns that are `object`s

### Some Data Processing
In the following sections you will do the following to the Mamm dataframe:
- Replace unusable entries with null/nan  
- Change types of data.
- Correct unexpected values (outliers)
- decode category data    
- Consolidate categories in category data 

#### Replace Missing Values with Nulls
Coerce all columns, even category columns, that contain missing values to numeric data using `pd.to_numeric`.  You might get an error, like `Unable to parse string`.  You need to tell `pd.to_numeric` that it should **coerce** the casting when it encounters a value that it cannot parse.  The category columns in this dataset are encoded as integers.  We will make use of that encoding.  Any non-numeric value will be replaced with a nan and you will get nans for missing numeric and category values.  After you replace all the non-numeric values, present the first five rows with `Mamm.head()`.

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Coerce all the data to numeric data
# Coercion will introduce nans/nulls for the non-numeric values in all columns
# Because the categories are encoded as integers, the missing categories will also be nans/nulls after coercion.
# Add code here
for col in Mamm.dtypes.items():
    if col[1] == 'object':
        Mamm[col[0]] = pd.to_numeric(Mamm[col[0]],errors='coerce')

In [None]:
Mamm

In [None]:
Mamm.dtypes

#### Replace Outliers
Values that are obviously incorrect are often replaced with averages.  Often, outlier replacements with averages are inappropriate because the extreme values have some meaning.  For instance, from the data dictionary we know that BI_RADS should range from 1 to 5.  BI_RADS values beyond 1 and 5 were added by physicians who did not adhere to the accepted range.  In this case, BI_RADS greater than 5 should be "clipped" at 5 and BI_RADS less than 1 should be "clipped" at 1. 

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Cap BI_RADS values to a range of 1 to 5
# Add code here
Mamm.loc[Mamm['BI_RADS'] > 5, 'BI_RADS'] = 5
Mamm.loc[Mamm['BI_RADS'] < 1, 'BI_RADS'] = 1

### Consolidate and decode category columns

Decoding a category is when categories are coded as numbers and we replace those numbers with actual categories.  
Consolidating (aka binning or grouping) of categories is means that multiple categories are renamed to a single category.  
The decoding and consolidating of categories can occur at the same time.  

- Shape
  - The original category codes are: round=1; oval=2; lobular=3; irregular=4;  
  - The proper consolidated category decoding is: 1 $\rightarrow$ oval; 2 $\rightarrow$ oval; 3 $\rightarrow$ lobular; 4 $\rightarrow$ irregular;  
- Margin
  - The orginal category codes are: circumscribed=1; microlobulated=2; obscured=3; ill-defined=4; spiculated=5  
  - The proper consolidated category decodes are: 1 $\rightarrow$ circumscribed; 2 $\rightarrow$ ill-defined; 3 $\rightarrow$ ill-defined; 4 $\rightarrow$ ill-defined; 5 $\rightarrow$ spiculated;

After you decode and consolidate, present the first five rows with `Mamm.head()`. 

<span style="color:red" float:right>[1 point]</span>

In [None]:
# The category columns are decoded and categories are consolidated

# The Shape variable is decoded as follows:  1 and 2 to oval;  3 to lobular; 4 to irregular
# Add code here
mapper = {1:"oval", 2:"oval", 3:"lobular", 4:"irregular"}
Mamm.replace({'Shape':mapper}, inplace=True)
# The Margin variable is decoded as follows:  1 to circumscribed;  2, 3, 4 to ill_defined; 5 to spiculated
# Add code here
mapper = {1:"circumscribed",2:"ill-defined",3:"ill-defined",4:"ill-defined",5:"spiculated"}
Mamm.replace({'Margin':mapper}, inplace=True)

#####
#> Converting to categories for later
Mamm[['Shape','Margin']] = Mamm[['Shape','Margin']].astype('category')

# Present the first few rows
# Add code here
Mamm

> I choose to show the first 5 & last 5 via the call/print method in a notebook so I can more readily detect any drift or oddities I create. Not very applicabable here but is useful to me when working with the larger datasets at work.

In [None]:
Mamm.dtypes

### Some More EDA
- Show the shape of the dataframe
- Use the `pandas` `isna` method to show the distribution of nulls among the columns.
  
<span style="color:red" float:right>[0 point]</span>

In [None]:
# Show the shape of the data frame
# Add code here
Mamm.shape

In [None]:
# Show the distribution of nulls among the columns
# Add code here
Mamm.isna().describe()

### Drop Rows with Multiple Missing Values
When a row has too many missing values, then it should not be used.  We can stipulate a threshold requirement of available values per row.  We will require that each row contains at least 5 values.  This requirement means that no row is allowed more than 1 missing value.  
Remove the rows that have more than one missing value.  
- Use the `pandas` `dropna` method and set the `thresh` argument.  
- Show the shape of the dataframe after you drop the rows with multiple nulls. 
- Use the `pandas` `isna` method to show the number of nulls per column after dropping rows with multiple nulls

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Drop rows
# Add code here
Mamm.dropna(thresh=5, inplace=True)

# Show the shape of the data frame
# Add code here
Mamm.shape

In [None]:
# Show the distribution of nulls among the columns
# Add code here
Mamm.isna().describe()

## Impute Missing Values
Use the median values to impute missing values for true numerical columns (`Age`, `BI_RADS`, `Density`).  `Margin` and `Shape` originally looked numeric, but they are categorical.  Therefore, do not use median on `Margin` and `Shape`.  

### Determine the imputation values for Age

In [None]:
for col in Mamm.columns:
    if pd.api.types.is_numeric_dtype(Mamm[col]):
        print((f"The column is {col} with medians\n\tpd.Seris.median() : {Mamm[col].median()} | numpy.nanmediam : {np.nanmedian(Mamm[col].values)}"))


> It appears, based on this dataset, that the `pandas.Series.median()` and `numpy.nanmediam()` methods are effectively equal. I'll lean on Pnadas only from here.

> Also, I'll loop over the dataframe rather than explicityly call out columns.

```python
# Replace missing age values with the median 
MedianAge = np.nanmedian(Mamm.loc[:,"Age"])
HasNanAge = pd.isnull(Mamm.loc[:,"Age"])
print('Now we replace', HasNanAge.sum(),'missing age values with the age median (', MedianAge, ')')
Mamm.loc[HasNanAge, "Age"] = MedianAge
Mamm.isna().sum(axis=0)
```

### Impute Missing values for BI_RADS and Density
Assign the column medians to the null values in the respective numeric columns.
- Use the `pandas` `isnull` method to identify the nulls
- Use the `numpy` `nanmedian` to determine the median for imputation
- Use the `pandas` `isna` method to show the number of nulls per column after the imputation   
  
<span style="color:red" float:right>[1 point]</span>

In [None]:
Mamm.isna().describe()

In [None]:
for col in Mamm.columns:
    if pd.api.types.is_numeric_dtype(Mamm[col]):
        Mamm.loc[pd.isna(Mamm[col]), col] = Mamm[col].median()

In [None]:
Mamm.isna().describe()

### Replace missing values for the two categorical columns
- Use `pandas` `value_counts()` method to determine the distribution of categories in `Shape` and `Margin` before imputation.
- Use `pandas` `isnull()` method to identify the missing values
- Assign the most common value to the null values in the respective categorical columns. 
- After the imputation, use the `pandas` `isna` method to show the number of nulls after the imputation.
- Use `pandas` `value_counts()` method to determine the distribution of categories after imputation.

<span style="color:red" float:right>[1 point]</span>

In [None]:
Mamm['Shape'].value_counts(dropna=False), Mamm['Margin'].value_counts(dropna=False)

In [None]:
for col in Mamm.dtypes.items():
    if col[1] == 'category':
        Mamm.loc[Mamm[col[0]].isna(),col[0]] = Mamm[col[0]].mode().values[0]

In [None]:
Mamm['Shape'].value_counts(dropna=False), Mamm['Margin'].value_counts(dropna=False)

### One hot encode the categorical variables
- Use `OneHotEncoder` from `sklearn.preprocessing` to one-hot encode the two categorical variables, `Shape` and `Margin`.
- Make sure that the new columns have descriptive hybrid names by using the `get_feature_names_out` method.
- Add the new binary columns to the dataframe.
- drop the original columns, `Shape` and `Margin`
- Show the first few rows of the dataframe.

<span style="color:red" float:right>[3 point]</span>

> Review what the dataframe currently looks like.

In [None]:
Mamm

> Start the OneHotEncoding process

In [None]:
from sklearn.preprocessing import OneHotEncoder

> Ok... package is in. Now to create the object in memory, encode the category columns (Shape, Margin), add those to the dataframe, drop the original columns, and show the tranformation process. Fun!

In [None]:
onehot = OneHotEncoder(sparse_output=False)
encode_cols = Mamm.select_dtypes('category').copy()
onehot.fit(encode_cols)
print(f"The new column names are: {onehot.get_feature_names_out()}")

In [None]:
Mamm[onehot.get_feature_names_out()] = onehot.transform(encode_cols)

In [None]:
Mamm

In [None]:
Mamm = Mamm.drop(['Shape','Margin'], axis=1)

In [None]:
Mamm

## End of Data Preparation on the Mammographic Masses Dataset (Mamm)



## Feature Selection on the Indian  Liver Patient Dataset (ILPD)
Feature selection is a process of removing features that are redundant and that could lead to overfitting, singular matrices, and other problems associated with high cardinality (Curse of dimensionality:  https://en.wikipedia.org/wiki/Curse_of_dimensionality)

### Acquire Data

We will get our data from the University of California, Irvine Machine Learning Repository. Our dataset was used to determine if blood test data could be sufficient to identify liver disease in rural areas with few physicians.

In [None]:
# csv file:
url = "../data/Indian Liver Patient Dataset (ILPD).csv"
# Alternate data source:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian Liver Patient Dataset (ILPD).csv"
url = url.replace(" ", "%20")

# Download the data
ILPD = pd.read_csv(url, header=None)

# Replace the default column names (0, 1, 2, 3, 4, 5) with meaningful names
ILPD.columns = ["Age","Gender","DB","TB","Alkphos","Sgpt","Sgot","TPr","ALB","AGRatio","Selector"]

ILPD

### Data Preparation for ILPD
- All columns should be numeric and continuous
    - Remove binary columns (numeric and categorical) because their mutual information scores will be lower
    - Remove any categorical columns
- Remove or impute any missing values

<span style="color:red" float:right>[1 point]</span>

In [None]:
ILPD

> I'll loop over the entire dataframe and run a test against the quantity of unique values to drop the columns

In [None]:
for col in ILPD.columns:
    if ILPD[col].astype('category').describe()['unique'] == 2:
        ILPD = ILPD.drop(col, axis=1)

In [None]:
ILPD

> It appears that only the AGRatio column contans NaNs, so I'll impute that column

In [None]:
ILPD.isna().any()

In [None]:
# Impute values or remove rows with nulls
# Add Code here
for col in ILPD.columns:
    if pd.api.types.is_numeric_dtype(ILPD[col]) and ILPD[col].isna().any():
        ILPD.loc[pd.isna(ILPD[col]), col] = ILPD[col].median()


In [None]:
ILPD.isna().any()

In [None]:
ILPD.describe()

### Mutual Information
https://en.wikipedia.org/wiki/Mutual_information
Below is a wrapper for determining the mutual information between two continuous (numeric) variables

In [None]:
from sklearn.metrics import mutual_info_score
# x is the first input variable
# y is the second input variable
# bins is the number of discretized values that will be used for the two input variables
def calc_MI(x, y, bins=80):
    if (bins > 1):
        c_xy = np.histogram2d(x, y, bins)[0]
        mi = mutual_info_score(None, None, contingency = c_xy)
    else:
        mi = mutual_info_score(x, y)
    return mi

### Create method to list  all column pairs together with their mutual information score
Write a method called `listMutualInformationScores`.  It uses the above method (`calc_MI`) in a loop to find the mutual information between all possible pairs of coulmns in the data.  The input to the function is the dataframe of continuous variables, specifically the prepared ILPD dataset.  

The method returns a list of lists.  Each inner list contains three items:  the x-column, the y-column, and the mutual information score. The list of lists contains every possible pair of coulmns in the data.  The result should have a form similar to the following, except that the outer list is much longer and contains all possible column pairs:  
`[['Alkphos', 'Sgot', 0.33],
['Sgot', 'AGRatio', 0.23],
['Age', 'Sgot', 0.35],
['Sgpt', 'AGRatio', 0.30],
['Sgot', 'ALB', 0.29],
['Sgot', 'TPr', 0.33]]`

<span style="color:red" float:right>[3 point]</span>

In [None]:
# define the method listMutualInformationScores

def listMutualInformationScores(df):
    "Something about the function"
    
    output = []
    for n,col in enumerate(df.columns):
        for c in df.columns[n:]:
            if col == c:
                continue
            else:
                output.append((col, c, calc_MI(ILPD[col],ILPD[c])))

    return output

In [None]:
# Run the method listMutualInformationScores
listMutualInformationScores(ILPD)

### Present the mutual information results
- Package the output into a dataframe
- Sort the rows in descending order of mutual information
- Present the dataframe

The first column could be x, the second column could be called y and the third column could be called mi.  x and y are the pair of columns pair and mi is the pair's mutual information score.  The result should have a form similar to the following:

| x | y | mi |
| --- | --- | --- |
| Age | Sgot | 0.35 |
| Sgot | TPr | 0.33 |
| Alkphos | Sgot | 0.33 |
| Sgpt | AGRatio | 0.30 |
| Sgot | ALB | 0.29 |
| Sgot | AGRatio | 0.23 |

<span style="color:red" float:right>[1 point]</span>

In [None]:
# Present the results as dataframe
ILPD_mutual = pd.DataFrame(listMutualInformationScores(ILPD), columns=['x','y','mi'])
ILPD_mutual

### Discussion on Mutual Information in ILPD
Lets assume a threshold of 1 for the mutual information score

Which columns would you eliminate? Why?  To answer these questions, you may need to read-up on feature selection with mutual information score.

<span style="color:red" float:right>[1 point]</span>

> In reading a little about what Mutual Information (or Information Gain) is from https://machinelearningmastery.com/information-gain-and-mutual-information/, I believe I can say the following:
>
> Setting the threshold at 1 is saying that any score above one and those to columns/variables are sufficiently dependent to use going forward but the columns below 1 are too independent and thus too unpredictable to continue into the construction of a model. 
>
> Based on this idea and the code below, that leaves ['Alkphos', 'Sgpt', 'Sgot'] as independent variables that are probably too unpredictable to be effectively modeled.

In [None]:
ILPD.columns

In [None]:
ILPD_mutual[ILPD_mutual['mi'] > 1]

In [None]:
feature_cols = []
for row in ILPD_mutual[ILPD_mutual['mi'] > 1].iterrows():
    if row[1]['x'] not in feature_cols:
        feature_cols.append(row[1]['x'])
    if row[1]['y'] not in feature_cols:
        feature_cols.append(row[1]['y'])
        
not_features = [col for col in ILPD.columns.to_list() if col not in feature_cols]


feature_cols, not_features