<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/1_LowVarianceData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Messy data sets need to be cleaned up before using them to train models<br>
This notebook explores methods for finding duplicate and low variance data. 

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

This notebook uses:<br>
>Oil spill dataset<br>
By Robert Holte.<br>
Kubat, M., Holte, R., & Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195–215.<br>

- 41 minority (oil slick)<br>
- 896 majority (no oil slick)

**Import the libraries**

In [None]:
from urllib.request import urlopen
from numpy import loadtxt
from numpy import unique
from pandas import read_csv
import numpy as np
from sklearn.feature_selection import VarianceThreshold

**Get the data** 

In [None]:
# load the dataset
df = read_csv("oil-spill.csv", header=None)
print(df.shape)

Notice the column values are strings that are numbers. <br>
This can be a little confusing when deleting columns. 

In [None]:
df

In [None]:
df.describe()

Column 49 is the label column for spill/no spill

In [None]:
df.value_counts([49])

**Break off the labels from the features**

In [None]:
labels=df[49]
#The labels variable has the labels for the dataset
print(labels.head())
df_X=df.drop([49], axis=1)

**Data Cleaning**<br>
Step 1: Look for columns that have the same value for every row

In [None]:
# summarize the number of unique values in each column
col_values=df_X.nunique()
print(col_values)
#the list is the number of unique values in each column. 
#There are 937 rows
#Note there are several columns with low variance data

Drop the columns that have only one value

In [None]:
# record columns to delete
to_del=[]
for i in range(len(col_values)):
  if col_values[i]==1:
    to_del.append(i)
print("Column(s) with one value:",to_del)
# drop useless columns
for i in range(len(to_del)):
  df_X_good=df_X.drop(to_del[i], axis=1)

df_X_good is the dataset:<br>
 >without labels <br>
 with columns with only one value removed


In [None]:
df_X_good

Note that coloumn 22 is gone, but the other columns still have their original values for names. 

In [None]:
df_X_good.columns

**What about columns with very few unique values?**<br>
Method 1: look for columns where the ratio of unique values to rows is less than a set threshold.<br>
Method 2: use the VarianceThreshold Transform

**Method 1**<br>
Set a threshold for the ratio. <br>
In this case it is set at .055<br>
Look at each column<br>
Calculate the variance .... (number of unique values)/(number of rows)

In [None]:
#col_values has the number of unique values in each column
col_values=df_X_good.nunique()
#print(col_values)
threshold=.055
print("A list of low ratio columns:\n")
for i in range(49):
  #Column 22 was dropped because it had only one value
  if i!=22:
    calc=col_values[i]/637
    if calc <= threshold:
      print("unique values:%d row %d calc %.3f" %(col_values[i],i, calc))
calc=col_values[48]/637

**Method 2**<br>
Finding low variance in columns

**Dropping Low Variance Columns**<br>

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. Then you should consider removing the column.<br>

Or if only a handful of observations differ from a constant value, the variance will also be very low.<br>

This situation, where a feature has been poorly evaluated, or brings little information because it is (almost) constant can be a justification to remove a column.<br>

You may want to set an arbitrary variance threshold to determine which features are low variance and consider removing them. <br>

Use trial and error by checking the accuracy of the predictions as a result of a feature removal to prove that justification for feature removal is correct. 

The variance threshold calculation depends on the probability density function of a particular distribution. For example if a feature has a normal distribution, use normal variance.

In [None]:
# define the transform
transform = VarianceThreshold()
# transform the input data
X_sel = transform.fit_transform(df_X_good)

In [None]:
from sklearn.feature_selection import VarianceThreshold
# define the location of the dataset
df = read_csv('oil-spill.csv', header=None)
# split data into inputs and outputs
data = df.values
X = data[:, :-1]
y = data[:, -1]
print(X.shape, y.shape)
# define the transform
transform = VarianceThreshold()
# transform the input data
X_sel = transform.fit_transform(X)
print(X_sel.shape)

In [None]:
import numpy as np
# define thresholds to check
thresholds = np.arange(0.0, 0.55, 0.05)

In [None]:
# apply transform with each threshold
results = list()
for t in thresholds:
 # define the transform
 transform = VarianceThreshold(threshold=t)
 # transform the input data
 #this will drop the low variance columns
 X_sel = transform.fit_transform(X)
 # determine the number of input features
 n_features = X_sel.shape[1]
 print('>Threshold=%.2f, Features=%d' % (t, n_features))
 # store the result
 results.append(n_features) 

A line plot is then created showing the relationship between the threshold and the number of features in the transformed dataset.<br>

We can see that even with a small threshold between 0.15 and 0.4, that a large number of features (14) are removed immediately.

In [None]:
import matplotlib.pyplot as plt

# plot the threshold vs the number of selected features
plt.plot(thresholds, results)
plt.show()

**Assignment**<br>
1. Use the dataset called bank.csv<br>
2. Determine if there are any columns that have a single value<br>
3. Determine if there are any columns with low variance<br>
4. If there are columns with low variance, should the column be deleted?