# Training feature selection

## Background

We started with extracting as many EO measurements as that can be feasibly calculated from the growing season of interest. However, some of these measurements may be correlated because they describe similar or related properties of the crops.
A good practice is to remove the highly coorelated features because they increase the complexity of a model without enhancing its prediction performance.

Another common issue with labelled training data is class imbalance, i.e. it has skewed class proportions. During model training, the algorithm will optimize over all training data, hence imbalanced training data may lead to a biased model. For example, the model is more likely to mis-classify a minority class as a majority class than the other way around.

In some cases, distribution of the training data is designed to match the true distribution of the classes, and the prediction is expected to bias against an infrequent class. In some other cases, accuray requirements are different for different classes. It is then desirable to adjust the porportions of the training labels to include more points for classes where commission error is preferred over omission error.

## Description

This notebook demonstrates how to explore class imbalance and correlations between the features extracted in the [feature extracion notebook](3_Training_feature_extraction.ipynb), and adjust the data before using it to train a model.


## Getting started
To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages

In [1]:
%matplotlib inline

import json
import os
from collections import Counter
from pprint import pprint

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from joblib import dump
from odc.io.cgroups import get_cpu_quota

## Load training data and label dictionary

We will load the training data saved from the [feature extracion notebook](3_Training_feature_extraction.ipynb), along with the mapping between crop labels and the numerical classes.

In [2]:
# Training data file from previous step
data_path = "Results/single_crops_merged_training_features_2021_all.csv"

# Dictionary with class labels from previous step
labels_path = "Results/class_labels.json"

### Inspect the label and feature columns

As shown below, we have over 100 features extracted.

In [None]:
# load the data
model_input = np.loadtxt(data_path)

# load the column_names
with open(data_path, "r") as file:
    header = file.readline()

# Remove comment symbol from header, then extract label and feature names
column_names = header.split()[1:]

label_col = column_names[0]
feature_cols = column_names[1:]

print(f"Label column:\n{label_col}\n")
print(f"Feature columns:\n{feature_cols}\n")
print('number of features: ',len(feature_cols))

# Extract relevant indices from training data
model_col_indices = [column_names.index(var_name) for var_name in column_names[1:]]

In [None]:
# load the data
model_input_df = pd.read_csv(data_path)
model_input_df.columns

In [None]:
# Read the class label dictionary
with open(labels_path, "r") as json_file:
    labels_dict = json.load(json_file)

NameError: name 'data_path' is not defined

In [None]:
labels_dict

{'Maize': 0, 'Others': 1, 'Sesame': 2, 'Soy': 3}

## Class rebalancing

We first inspect the number of samples per class and notice we have significant larger number of "Others" class than the single crop classes.

In [2]:
# Insert data into a Pandas DataFrame
model_input_df = pd.DataFrame(model_input, columns=column_names)
# Investigate value counts for each class
class_counts=model_input_df[label_col].value_counts()
class_indices=class_counts.index
labels_dict_inv={value: key for key, value in labels_dict.items()}
class_legends=[labels_dict_inv[class_indices[i]] for i in range(len(class_indices))]
plt.figure(figsize=(15,5))
ax=plt.bar(class_legends,height=class_counts.to_numpy())
plt.bar_label(ax)
plt.gca().set_ylabel('Number of training samples')
plt.gca().set_xlabel('Crop type')
plt.gca().tick_params(axis='x', rotation=45)

NameError: name 'model_input' is not defined

We then reduce the number of samples for classes with more than twice the number of any others. (**Check this statement against the code**)

> We are not making all sample sizes same because we want to preserve the large number of Maize and Sesame samples to capture intra-class diversity. 

In [7]:
for i in range(len(class_counts)):
    if class_counts.values[i]>10*class_counts.values[-1]:
        n_samples_dropped=class_counts.values[i]-10*class_counts.values[-1]
        print('dropping {} of {} samples'.format(n_samples_dropped,labels_dict_inv[class_indices[i]]))
        model_input_df.drop(model_input_df[model_input_df[label_col]==class_indices[i]].sample(n=n_samples_dropped).index,
                            axis=0,inplace=True)
    else:
        print('no balancing needed')

dropping 11147 of Others samples
no balancing needed
no balancing needed
no balancing needed


In [8]:
# split into features and labels
X = model_input_df.drop(label_col, axis=1)[columns_to_use].values
y = model_input_df[[label_col]].values.ravel()

## Drop correlated features


### Create and visualize the correlation_matrix

In [None]:
X_ = model_input_df.drop(label_col, axis=1)[columns_to_use]
correlation_matrix = X_.corr().abs()

In [5]:
# plot correlation matrix

We will define a function and set a threshold of 0.9 to remove correlated features.

In [6]:
removal_threshold = 0.9

In [10]:
# function for removing correlated variables
def DropCorrelatedFeatures(X_, removal_threshold=0.9):
    to_drop = set()  # set of features to drop
    correlation_matrix = X_.corr().abs()
    for i in range(len(correlation_matrix.columns)):
        for j in range(i):
            if (correlation_matrix.iloc[i, j] >= removal_threshold) and (correlation_matrix.columns[j] not in to_drop):
                colname = correlation_matrix.columns[i]
                to_drop.add(colname)
    to_drop = list(to_drop)
    X_dropped = X_.copy()
    X_dropped = X_dropped.drop(to_drop, axis=1)
    return X_dropped

In [11]:
X_ = model_input_df.drop(label_col, axis=1)[columns_to_use]
X_dropped = DropCorrelatedFeatures(X_, removal_threshold=removal_threshold)
print("# of features to keep:", len(X_dropped.columns))
X_dropped.columns

Index(['blue_s2_Q4_2021', 'green_s2_Q4_2021', 'red_s2_Q4_2021',
       'nir_s2_Q4_2021', 'swir_1_s2_Q4_2021', 'NDVI_s2_Q4_2021',
       'MNDWI_s2_Q4_2021', 'blue_s2_Q1_2022', 'red_s2_Q1_2022',
       'nir_s2_Q1_2022', 'swir_1_s2_Q1_2022', 'red_edge_1_s2_Q1_2022',
       'NDVI_s2_Q1_2022', 'MNDWI_s2_Q1_2022', 'blue_s2_Q2_2022',
       'green_s2_Q2_2022', 'red_s2_Q2_2022', 'nir_s2_Q2_2022',
       'swir_1_s2_Q2_2022', 'NDVI_s2_Q2_2022', 'MNDWI_s2_Q2_2022',
       'blue_s2_Q3_2022', 'red_s2_Q3_2022', 'nir_s2_Q3_2022',
       'swir_1_s2_Q3_2022', 'red_edge_1_s2_Q3_2022', 'NDVI_s2_Q3_2022',
       'MNDWI_s2_Q3_2022', 'blue_s2_annual_2021', 'red_s2_annual_2021',
       'nir_s2_annual_2021', 'swir_1_s2_annual_2021',
       'red_edge_1_s2_annual_2021', 'smad_s2_annual_2021',
       'emad_s2_annual_2021', 'bcmad_s2_annual_2021', 'NDVI_s2_annual_2021',
       'MNDWI_s2_annual_2021', 'blue_s2_semiannual_2021_07',
       'nir_s2_semiannual_2021_07', 'smad_s2_semiannual_2021_07',
       'emad_s2_se

## Save the selected features

In [None]:
output_features = pd.concat([model_input_df[label_col], X_dropped], axis = 1)

output_features.to_csv("Results/single_crops_merged_training_features_2021_selected.csv")