# Plant Detection using Traditional Model

## Introduction
This part of project aims to develop a model to identify plants and diagnose plant diseases using traditional machine learning algorithms, specifically Logistic Regression, SVM, Random Forest and XGBoost, instead of more complex neural networks and deep learning approaches. The dataset consists solely of images. The challenge is to find an effective method for image processing to extract meaningful features for accurate classification.

## Image Processing and Feature Extraction

### Methods Explored
Several image processing methods were experimented with to extract features from the images, including:
1. Color Histograms
2. Histogram of Oriented Gradients (HOG)
3. Local Binary Patterns (LBP)
4. Scale-Invariant Feature Transform (SIFT)
5. Speeded-Up Robust Features (SURF)
6. Oriented FAST and Rotated BRIEF (ORB)
7. Fourier Transform
8. Wavelet Transform

### Optimal Methods
After extensive experimentation, it was determined that the combination of HOG and color histogram, as well as using the pre-trained VGG16 model for feature extraction, were the most effective.

#### Color Histograms
- **Description**: Represents the distribution of colors in an image.
- **Strengths**: Simple and effective for color-based differentiation.
- **Weaknesses**: Does not capture shape or texture information, sensitive to illumination changes.
- **Suitability**: Effective for images with noticeable color changes.

#### Histogram of Oriented Gradients (HOG)
- **Description**: Captures gradient orientation information within an image.
- **Strengths**: Good for capturing edge and shape information, robust to small geometric transformations.
- **Weaknesses**: Sensitive to large variations in illumination and viewpoint.
- **Suitability**: Ideal for images with distinct edge patterns and textures.

#### VGG16
- **Description**: A pre-trained deep learning model used here solely for feature extraction.
- **Strengths**: Captures complex features by leveraging pre-trained deep neural network layers.
- **Suitability**: Efficient for extracting rich features from images, comparable to the combination of HOG and color histogram.

### Reasoning for Selected Methods
1. **HOG and Color Histogram Combination**: This approach leverages the strengths of both methods—HOG captures shape and edge information, while color histograms provide color distribution data. Together, they offer a comprehensive feature set that improves classification accuracy.
2. **VGG16 for Feature Extraction**: VGG16, a pre-trained convolutional neural network, efficiently extracts detailed features from images, benefiting from its deep architecture trained on a large dataset. It captures both high-level and low-level features, making it highly effective for our purposes.

## Model Selection and Training

### Algorithm Selection
Various algorithms were considered, but XGBoost (XGBClassifier) was chosen for its superior performance.

#### XGBClassifier
- **Description**: An optimized gradient boosting algorithm.
- **Strengths**: High performance, ability to handle missing data, regularization to avoid overfitting, supports parallel processing.
- **Suitability**: Excellent for structured/tabular data and well-suited for our extracted feature sets.

### Training and Results
Two models were trained using the extracted features:
1. **HOG and Color Histogram Features**: The model trained with XGBClassifier achieved an accuracy of 97%.
2. **VGG16 Extracted Features**: Similarly, the model trained with XGBClassifier also achieved an accuracy of 97%.

### Reasoning for XGBClassifier's Effectiveness
- **Accuracy**: Achieved the highest accuracy in our experiments.
- **Flexibility**: Can handle various types of features and scales well with the data.
- **Efficiency**: Benefits from fast training times and the ability to process large datasets efficiently.
- **Robustness**: Includes mechanisms to prevent overfitting, crucial for maintaining accuracy in diverse datasets.

## Conclusion
The combination of HOG and color histogram, as well as the use of VGG16 for feature extraction, proved to be the most effective methods for processing plant images in this project. The XGBClassifier algorithm was the best choice for training the model, resulting in a high accuracy rate of 97%. This approach demonstrates that traditional machine learning algorithms, when paired with effective feature extraction techniques, can achieve high performance in image classification tasks, such as plant detection and disease diagnosis.

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from skimage.feature import hog, local_binary_pattern

In [None]:

def extract_hog_features(image, resize=(256, 256)):
    image = cv2.resize(image, resize)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    # Extract HOG features
    features, _ = hog(gray_image, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(2, 2), block_norm='L2', visualize=True)
    return features

data_dir = '/content/drive/MyDrive/raw_dataset/apple_recognition/'

labels = []
features = []

for label in os.listdir(data_dir):
    label_dir = os.path.join(data_dir, label)
    if not os.path.isdir(label_dir):
        continue
    for image_file in os.listdir(label_dir):
        image_path = os.path.join(label_dir, image_file)
        image = cv2.imread(image_path)
        if image is not None:
            hog_features = extract_hog_features(image)
            features.append(hog_features)
            labels.append(label)

# Convert to numpy arrays
features = np.array(features)
labels = np.array(labels)

In [None]:
np.unique(labels)

array(['Apple___Apple_scab', 'Apple___Black_rot',
       'Apple___Cedar_apple_rust', 'Apple___healthy'], dtype='<U24')

In [None]:
len(features)

1840

In [None]:

le = LabelEncoder()
labels_encoded = le.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(features, labels_encoded, test_size=0.2, random_state=42)

svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

print("Support Vectors:")
print(svm.support_vectors_)

Accuracy: 0.6766304347826086
Classification Report:
                          precision    recall  f1-score   support

      Apple___Apple_scab       0.57      0.57      0.57        99
       Apple___Black_rot       0.67      0.68      0.68        75
Apple___Cedar_apple_rust       0.75      0.64      0.69        95
         Apple___healthy       0.72      0.82      0.76        99

                accuracy                           0.68       368
               macro avg       0.68      0.68      0.68       368
            weighted avg       0.68      0.68      0.68       368

Support Vectors:
[[0.07622037 0.02259218 0.08195625 ... 0.00177103 0.00749179 0.01285752]
 [0.1460472  0.02968832 0.18962463 ... 0.05731626 0.20329637 0.12531182]
 [0.12690915 0.0362431  0.02562586 ... 0.0600003  0.04584171 0.10073551]
 ...
 [0.36779071 0.11662306 0.12202278 ... 0.10037863 0.074701   0.12797401]
 [0.21251074 0.073791   0.07540834 ... 0.07464273 0.12681983 0.05781883]
 [0.14821453 0.16689806 0.1267

In [None]:
# Combining HOG and Color Histogram to make more robust the process:

In [None]:
# Function to extract combined HOG and color histogram features
def extract_hog_color_hist_features(image, resize=(256, 256)):
    image = cv2.resize(image, resize)
    # Extract HOG features
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    hog_features = hog(gray_image, orientations=9, pixels_per_cell=(8, 8),
                       cells_per_block=(2, 2), block_norm='L2-Hys', visualize=False)

    # Extract color histogram features
    hist = cv2.calcHist([image], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
    hist = cv2.normalize(hist, hist).flatten()

    # Combine HOG and color histogram features
    combined_features = np.hstack((hog_features, hist))
    return combined_features


In [None]:
data_dir = '/content/drive/MyDrive/raw_dataset/apple_recognition/'

labels = []
features = []

for label in os.listdir(data_dir):
    label_dir = os.path.join(data_dir, label)
    if not os.path.isdir(label_dir):
        continue
    for image_file in os.listdir(label_dir):
        image_path = os.path.join(label_dir, image_file)
        image = cv2.imread(image_path)
        if image is not None:
            hog_features = extract_hog_color_hist_features(image)
            features.append(hog_features)
            labels.append(label)

# Convert to numpy arrays
features = np.array(features)
labels = np.array(labels)

In [None]:
np.unique(labels)

array(['Apple___Apple_scab', 'Apple___Black_rot',
       'Apple___Cedar_apple_rust', 'Apple___healthy'], dtype='<U24')

In [None]:
len(features)

1840

In [None]:
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(features, labels_encoded, test_size=0.2, random_state=42)

svm_combined = SVC(kernel='linear')
svm_combined.fit(X_train, y_train)

y_pred = svm_combined.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))


Accuracy: 0.779891304347826
Classification Report:
                          precision    recall  f1-score   support

      Apple___Apple_scab       0.75      0.58      0.65        99
       Apple___Black_rot       0.75      0.80      0.77        75
Apple___Cedar_apple_rust       0.89      0.80      0.84        95
         Apple___healthy       0.74      0.95      0.83        99

                accuracy                           0.78       368
               macro avg       0.78      0.78      0.78       368
            weighted avg       0.78      0.78      0.77       368



In [None]:
# Cobining the features of HOG and LBP

In [None]:
from skimage.feature import local_binary_pattern


# Parameters for LBP
lbp_radius = 1
lbp_n_points = 8 * lbp_radius


# Function to extract LBP features
def extract_lbp_features(image, radius=lbp_radius, n_points=lbp_n_points, resize=(256, 256)):
    image = cv2.resize(image, resize)):
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    lbp = local_binary_pattern(gray_image, n_points, radius, method='uniform')
    (hist, _) = np.histogram(lbp.ravel(), bins=np.arange(0, n_points + 3), range=(0, n_points + 2))
    hist = hist.astype("float")
    hist /= (hist.sum() + 1e-6)  # Normalize the histogram
    return hist

# Function to extract combined HOG and LBP features
def combined_features_hog_lbp(image):
    hog_features = extract_hog_features(image)
    lbp_features = extract_lbp_features(image)
    return np.hstack((hog_features, lbp_features))

In [None]:
# empty the list again for new features
labels = []
features = []

for label in os.listdir(data_dir):
    label_dir = os.path.join(data_dir, label)
    if not os.path.isdir(label_dir):
        continue
    for image_file in os.listdir(label_dir):
        image_path = os.path.join(label_dir, image_file)
        image = cv2.imread(image_path)
        if image is not None:
            combined_features = combined_features_hog_lbp(image)
            features.append(combined_features)
            labels.append(label)


features = np.array(features)
labels = np.array(labels)

In [None]:

le = LabelEncoder()
labels_encoded = le.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(features, labels_encoded, test_size=0.2, random_state=42)

svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

Accuracy: 0.6956521739130435
Classification Report:
                          precision    recall  f1-score   support

      Apple___Apple_scab       0.59      0.68      0.63        99
       Apple___Black_rot       0.73      0.68      0.70        75
Apple___Cedar_apple_rust       0.77      0.68      0.73        95
         Apple___healthy       0.72      0.74      0.73        99

                accuracy                           0.70       368
               macro avg       0.70      0.69      0.70       368
            weighted avg       0.70      0.70      0.70       368



In [None]:
# Function to extract and combine HOG, Color Histogram, and LBP features

In [None]:

def combined_features_hog_lbp_color_hg(image):
    hog_features = extract_hog_color_hist_features(image)
    lbp_features = extract_lbp_features(image)
    return np.hstack((hog_features, lbp_features))


In [None]:
labels = []
features = []

for label in os.listdir(data_dir):
    label_dir = os.path.join(data_dir, label)
    if not os.path.isdir(label_dir):
        continue
    for image_file in os.listdir(label_dir):
        image_path = os.path.join(label_dir, image_file)
        image = cv2.imread(image_path)
        if image is not None:
            combined_features = combined_features_hog_lbp_color_hg(image)
            features.append(combined_features)
            labels.append(label)

features = np.array(features)
labels = np.array(labels)

In [None]:
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(features, labels_encoded, test_size=0.2, random_state=42)

svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

Accuracy: 0.779891304347826
Classification Report:
                          precision    recall  f1-score   support

      Apple___Apple_scab       0.75      0.58      0.65        99
       Apple___Black_rot       0.75      0.80      0.77        75
Apple___Cedar_apple_rust       0.89      0.80      0.84        95
         Apple___healthy       0.74      0.95      0.83        99

                accuracy                           0.78       368
               macro avg       0.78      0.78      0.78       368
            weighted avg       0.78      0.78      0.77       368



In [None]:
# Conclusion:
# best method for tradional image processing: HOG + Color Histogram

In [None]:
# Since we have our best feature extractor now lets look for our best Algorithm

In [None]:
# we use GridSearchCV to find which algorithm is better:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [None]:
# get the data first:



data_dir = '/content/drive/MyDrive/raw_dataset/apple_recognition/'

labels = []
features = []

for label in os.listdir(data_dir):
    label_dir = os.path.join(data_dir, label)
    if not os.path.isdir(label_dir):
        continue
    for image_file in os.listdir(label_dir):
        image_path = os.path.join(label_dir, image_file)
        image = cv2.imread(image_path)
        if image is not None:
            hog_features = extract_hog_color_hist_features(image)
            features.append(hog_features)
            labels.append(label)

#
X = np.array(features)
target = np.array(labels)

In [None]:
model_params = {
    'svm': {
        'model': SVC(gamma='auto', probability=True),
        'params': {
            'svc__C': [1, 10, 100, 1000],
            'svc__kernel': ['rbf', 'linear']
        }
    },
    'random_forest': {
        'model': RandomForestClassifier(n_estimators=100),
        'params': {
            'randomforestclassifier__n_estimators': [1, 5, 10]
        }
    },
    'logistic_regression': {
        'model': LogisticRegression(solver='liblinear', multi_class='auto'),
        'params': {
            'logisticregression__C': [1, 5, 10]
        }
    }
}


In [None]:
le = LabelEncoder()
target_encoded = le.fit_transform(target)

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=42)

In [None]:
# do not run this cell
scores = []
best_estimators = {}
for algo, mp in model_params.items():
    pipe = make_pipeline(StandardScaler(), mp['model'])
    clf =  GridSearchCV(pipe, mp['params'], cv=5, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': algo,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    best_estimators[algo] = clf.best_estimator_

df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df
# this part of training took around 4.5 hours

Unnamed: 0,model,best_score,best_params
0,svm,0.754761,"{'svc__C': 1, 'svc__kernel': 'linear'}"
1,random_forest,0.750677,{'randomforestclassifier__n_estimators': 10}
2,logistic_regression,0.712637,{'logisticregression__C': 10}


In [None]:
# SVM is the best alogrithm among the three.