## From feature engineering for Random Forest
Base on the exploration done in section 3.6 Correlations between features, we decided to enrich the extracted features by adding color histograms and texture features.

In [None]:
# Initialize the ResNet50 model with pretrained ImageNet weights
feature_extractor = ResNet50(weights='imagenet', include_top=False, input_shape=(348, 348, 3), pooling='avg')

def extract_color_histograms(image, bins=32, hist_range=(0, 1)):
    histograms = [np.histogram(image[:, :, channel], bins=bins, range=hist_range)[0] for channel in range(3)]
    return np.concatenate(histograms)

def add_texture_features(image):
    # Define the settings for local binary patterns
    radius = 3
    n_points = 8 * radius
    method = 'uniform'
    texture = local_binary_pattern(rgb2gray(image), n_points, radius, method)
    return texture.ravel()  # Flatten the texture feature matrix to a vector

def extract_features_and_labels(directory, image_size=(348, 348), batch_size=32, class_mode='categorical'):
    datagen = ImageDataGenerator(rescale=1.0/255)
    generator = datagen.flow_from_directory(
        directory,
        target_size=image_size,
        batch_size=batch_size,
        class_mode=class_mode,  # This should be 'categorical' if you are using np.argmax
        shuffle=False
    )

    all_features = []
    all_labels = []

    # Process each batch
    for batch_imgs, batch_labels in generator:
        # Extract features using ResNet50
        cnn_features = feature_extractor.predict(batch_imgs)

        # Extract color histograms and texture features
        histograms = np.array([extract_color_histograms(img) for img in batch_imgs])
        textures = np.array([add_texture_features(img) for img in batch_imgs])

        # Combine CNN features with histograms and textures
        combined_features = np.hstack([cnn_features, histograms, textures])

        all_features.append(combined_features)
        all_labels.append(batch_labels)

        if len(all_features) * batch_size >= generator.samples:
            break

    return np.vstack(all_features), np.vstack(all_labels)  # Ensure labels are properly structured

# Set class_mode to 'categorical' for one-hot encoding
train_features_improved, train_labels_improved = extract_features_and_labels("GroceryStoreDataset-working/dataset/train", class_mode='categorical')
val_features_improved, val_labels_improved = extract_features_and_labels("GroceryStoreDataset-working/dataset/val", class_mode='categorical')

# Now you can use np.argmax since labels are one-hot encoded
train_labels_improved = np.argmax(train_labels_improved, axis=1)
val_labels_improved = np.argmax(val_labels_improved, axis=1)

In [None]:
# Initialize the RandomForest classifier
rf_model_improved = RandomForestClassifier(n_estimators=150)

# Train the model once
rf_model_improved.fit(train_features_improved, train_labels_improved)

# Make predictions on the validation set
val_predictions_rf_improved = rf_model_improved.predict(val_features_improved)

# Get the classification report
classification_report_rf_improved = classification_report(val_labels_improved, val_predictions_rf_improved)

# Print the classification report
print(f"Classification report:\n{classification_report_rf_improved}")

Sadly, the adding texture and color information did not really improve the overall performance.

This might be due to several factors:
- Redundancy and Noise: The color histograms and texture features could be introducing redundancy or irrelevant information that confuses the model rather than helping it to generalize better. Deep learning models like ResNet50 are already proficient at capturing both low-level features (such as textures and colors) and high-level patterns in images. By adding extra features manually, you might be diluting the predictive power of the neural network features with less informative or noisy data.

- Model Complexity and Overfitting: Adding more features increases the dimensionality of the input data. This higher dimensionality can lead to overfitting, especially if the number of training samples is not large enough to support the complexity of the model. Overfitting occurs when a model learns details and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

- Feature Scaling and Distribution: Even though tree-based models like RandomForest are generally not sensitive to the scale of features, the mixing of different types of features (deep features with manually extracted features) could lead to issues if these features differ significantly in their range and distribution. This disparity can bias the model to weigh some types of features more than others, potentially overshadowing useful signals with less relevant information.