<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/1_LowVarianceData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the first in a series of notebooks on preparing data for machine learning.

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

The image below shows the basic categories of data that must be addressed before using it to train a machine learning model.

In [None]:
from IPython.display import Image
Image("dataflowChart.png" , width=640)

Messy data sets need to be cleaned before using them to train models<br>
This notebook explores methods for finding low variance data columns. 

This notebook uses:<br>
>Oil spill dataset<br>
By Robert Holte.<br>
Kubat, M., Holte, R., & Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195–215.<br>

- 41 minority (oil slick)<br>
- 896 majority (no oil slick)

**Import the libraries**

In [None]:
from urllib.request import urlopen
from numpy import loadtxt
from numpy import unique
from pandas import read_csv
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

**Get the data** 

In [None]:
# load the dataset
df = read_csv("oil-spill.csv", header=None)
print(df.shape)

Notice the column values are strings that are numbers. <br>
This can be a little confusing when deleting columns. 

In [None]:
df

In [None]:
df.describe()

Column 49 is the label column for spill/no spill

In [None]:
df.value_counts([49])

**Break off the labels from the features**

In [None]:
labels=df[49]
#The labels variable has the labels for the dataset
print(labels.head())
df_X=df.drop([49], axis=1)

**Data Cleaning**<br>
Step 1: Look for columns that have the same value for every row

In [None]:
# summarize the number of unique values in each column
col_values=df_X.nunique()
print(col_values)
#the list is the number of unique values in each column. 
#There are 937 rows
#Note there are several columns with low variance data

Drop the columns that have only one value<br>
In this case, Column 22 has only one value. 

In [None]:
# record columns to delete
to_del=[]
for i in range(len(col_values)):
  if col_values[i]==1:
    to_del.append(i)
print("Column(s) with one value:",to_del)
# drop useless columns
df_X_good=df_X.drop(to_del, axis=1)

In [None]:
df_X_good=df_X.drop(to_del, axis=1)

df_X_good is the dataset:<br>
 >without labels <br>
 with columns with only one value removed


In [None]:
df_X_good

Note that coloumn 22 is gone, but the other columns still have their original values for names. 

In [None]:
df_X_good.columns

**What about columns with very few unique values?**<br>
Method 1: look for columns where the ratio of unique values to rows is less than a set threshold.<br>
Method 2: use the VarianceThreshold Transform

**Method 1**<br>
Set a threshold for the ratio. <br>
In this case it is set at .055<br>
Look at each column<br>
Calculate the variance .... (number of unique values)/(number of rows)

In [None]:
#col_values has the number of unique values in each column
col_values=df_X_good.nunique()
#print(col_values)
threshold=0.05
print("A list of low ratio columns:\n")
for i in range(49):
  #Column 22 was dropped because it had only one value
  if i!=22:
    calc=col_values[i]/937
    if calc <= threshold:
      print("unique values:%d column %d calc %.3f" %(col_values[i],i, calc))
calc=col_values[48]/937

**Method 2**<br>
Finding low variance in columns

**Dropping Low Variance Columns**<br>

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. Then you should consider removing the column.<br>

Or if only a handful of observations differ from a constant value, the variance will also be very low.<br>

This situation, where a feature has been poorly evaluated, or brings little information because it is (almost) constant can be a justification to remove a column.<br>

You may want to set an arbitrary variance threshold to determine which features are low variance and consider removing them. <br>

Use trial and error by checking the accuracy of the predictions as a result of a feature removal to prove that justification for feature removal is correct. 

The variance threshold calculation depends on the probability density function of a particular distribution. For example if a feature has a normal distribution, use normal variance.

Below is a simple example of the VarianceThreshold function. <br>
X_simple is a simple dataset of 3 rows, 4 columns<br>
The default VarianceThreshold value is 0<br>
When the fit_transform function is applied the columns with only one value are removed. <br>
In this case it is column 0 and column 3. 

In [None]:
#Simple dataset to show VarianceThreshold
X_simple = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector = VarianceThreshold()
selector.fit_transform(X_simple)
print("The columns that are low variance are false")
selector.get_support()

In [None]:
# define the location of the dataset
df = read_csv('oil-spill.csv', header=None)
# split data into inputs and outputs
data = df.values
X = data[:, :-1]
y = data[:, -1]
print(X.shape, y.shape)

In [None]:
# define thresholds to check
thresholds = np.arange(0.0, 0.55, 0.05)

In [None]:
# apply transform with each threshold
results = list()
for t in thresholds:
  # define the transform
  transform = VarianceThreshold(threshold=t)
  # transform the input data
  #this will drop the low variance columns
  X_sel = transform.fit_transform(X)
  # determine the number of input features
  n_features = X_sel.shape[1]
  print('>Threshold=%.2f, Number of Features=%d' % (t, n_features))
  # store the result
  results.append(n_features) 

print("\nColumns with low variance")
for i in range(len(transform.get_support())):
  if transform.get_support()[i]==False:
    print("col #",i)


A line plot is then created showing the relationship between the threshold and the number of features in the transformed dataset.<br>

We can see that even with a small threshold between 0.15 and 0.4, that a large number of features (14) are removed immediately.

In [None]:
import matplotlib.pyplot as plt

# plot the threshold vs the number of selected features
plt.plot(thresholds, results)
plt.xlabel("thresholds")
plt.ylabel("number of features")
plt.show()

**Assignment**<br>
1. Use the dataset called bank.csv<br>
2. Determine if there are any columns that have a single value<br>
3. Determine if there are any columns with low variance<br>
4. If there are columns with low variance, should the column be deleted?

<br>
Hint: read_csv("bank.csv", header='infer' , delimiter=';')

In [None]:
%cd /content/cloned-repo
!ls

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
bankData=pd.read_csv("bank.csv", header='infer' , delimiter=';')

In [None]:
bankData.shape

In [None]:
percent_missing = bankData.isnull().sum() * 100 / len(bankData)
missing_values = pd.DataFrame({'percent_missing': percent_missing})
missing_values.sort_values(by ='percent_missing' , ascending=False)

In [None]:
bankData

Copy the labels (the 'y' column) from the data <br>
Then drop the label

In [None]:
labels =bankData['y']
bankData_X=bankData.drop(['y'], axis=1)

Convert the catagorical data to one-hot encoding

In [None]:
#Convert the catagorical data to one-hot encoding

Now: Determine the columns that have low variance.<br>
Given code is for the oil spill data. <br>
Modify it for the bank data.

In [None]:
#Assignment

In [None]:
#@title 
#col_values has the number of unique values in each column
assign_col_values=bankData_X.nunique()
#print(col_values)
threshold=0.05
print("A list of low ratio columns:\n")
for i in range(16):
  #Column 22 was dropped because it had only one value
  if i != 22:
    calc=assign_col_values[i]/937
    if calc <= threshold:
      print("unique values:%d col %d calc %.3f" %(col_values[i],i, calc))
calc=assign_col_values[15]/X.shape[1]