In [None]:
# D604: Task 1 Neural Networks

A1.
Can a neural network model accurately classify RGB images of plant seedlings into their respective species to support automated weed detection and crop management in agricultural settings?

A2. 
One goal of the analysis is to build a neural network model that will classify the twelve seedling species with an accuracy of at least 85%. Since the dataset includes 4,750 images across twelve classes, the dataset’s size will support training for multi-class classification. Another goal of the analysis is to identify any visual features that set crop seedlings apart from weeds to provide insights for decision-making. This is achievable since the RBG images contain detailed visual data that can be analyzed after training. A third goal is to minimize the misclassification of crops as weeds and prevent the unnecessary removal of valuable plants by correctly identifying them. Precision and recall can be balanced by identifying these false positives, and the labeled data in labels.csv supports this approach.

A3.
I will use a Convolutional Neural Network (CNN) for this analysis. CNNs are a good choice for multi-class image classification. They are equipped to process the 4750 RGB images and classify them into twelve categories. 

A4.
CNNs use convolutional layers to find spatial patterns such as leaf edges, vein structures, and color variations. CNNs can also use color information to distinguish species of plants, which will be helpful when differentiating a weed from a plant. They are widely used in agriculture already for weed detection and crop monitoring. CNN also uses pooling layers to reduce the image dimensionality. This makes the training process easier to process with a large dataset. Finally, CNN works well with a SoftMax output layer to handle the twelve-class classification process well. Other alternative methods may struggle with high-dimensional image data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

# Load labels
labels = pd.read_csv("labels.csv")

# Print column names and first few rows to inspect
print("Column names:", labels.columns.tolist())
print("First 5 rows:\n", labels.head())

class_counts = labels.iloc[:, 0].value_counts()

In [None]:
# B1a: Visualization for class distribution
plt.figure(figsize=(10, 6))
class_counts.plot(kind='bar')
plt.title('Distribution of Seedling Species')
plt.xlabel('Species')
plt.ylabel('Number of Images')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('class_distribution.png')  # Save for screenshot
plt.show()

B1a.
The bar chart in Figure 1 was created to show the distribution of the twelve seedling species in the dataset. The plot uses the counts of each unique value in the labels.csv sheet.

Figure 1: Distribution of Seedling Species

In [None]:
# B1b: Sample images with associated labels
images = np.load("images.npy")  # Shape: (4750, 128, 128, 3)

# Get unique classes and one sample per class
unique_classes = labels.iloc[:, 0].unique()
plt.figure(figsize=(15, 5))

for i, species in enumerate(unique_classes):
    idx = labels[labels.iloc[:, 0] == species].index[0]  # First occurrence
    plt.subplot(2, 6, i + 1)  # 2 rows, 6 cols for 12 classes
    plt.imshow(images[idx])
    plt.title(species, fontsize=10)
    plt.axis('off')

plt.tight_layout()
plt.savefig('sample_images.png')  # Save for screenshot
plt.show()

B1b.
The grid in Figure 2 displays sample images showing one image per species with its corresponding label. 
Figure 2: 2x6 Grid of 12 Seedling Species Images

In [None]:
# B2: Perform Data Augmentation
datagen = ImageDataGenerator(
    rotation_range=30,
    horizontal_flip=True,
    brightness_range=[0.8, 1.2],
    zoom_range=0.2,
    fill_mode='nearest'

B2.
The steps taken to augment the images include a ±30° horizontal flip, brightness adjustment by ±20%, and zoom by ±20%. These steps enhance the diversity of the data and mimic real-world variations in photography. This helps prevent any overfitting with the approximately 395 images per class.


In [None]:
# B3: Normalize the Images
images_normalized = images.astype('float32') / 255.0
print(images_normalized.min(), images_normalized.max())  # Should print 0.0, 1.0

X_train, X_temp, y_train, y_temp = train_test_split(
    images_normalized, labels.iloc[:, 0], test_size=0.3, stratify=labels.iloc[:, 0], random_state=42
)

B3. 
In this section, the pixel values are scaled to [0, 1] to standardize input for the CNN. This aids in the optimization process and improves compatibility with common frameworks. 

In [None]:
# B4: Perform Training (70%), Validation (15%), and  Test (15%) Split
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)
print(X_train.shape, X_val.shape, X_test.shape)  # (3325, 128, 128, 3), (712, 128, 128, 3), (713, 128, 128, 3)

B4. 
By keeping 70% of the data in the training dataset and putting 15% in the validation and test sets, respectively, it can be ensured that sufficient data is used for training and evaluation. This preserves the overall class balance. 

In [None]:
# B5: Encode the target feature for all datasets
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)  # y_train from split
y_val_encoded = le.transform(y_val)
y_test_encoded = le.transform(y_test)

y_train_onehot = to_categorical(y_train_encoded, num_classes=12)
y_val_onehot = to_categorical(y_val_encoded, num_classes=12)
y_test_onehot = to_categorical(y_test_encoded, num_classes=12)
print(y_train_onehot.shape)  # (3325, 12)

B5. 
The labels are encoded as integers (0-11) and then one-hot encoded into 12D vectors for multi-class classification, matching the CNNs SoftMax output.

In [None]:
# B6: Provide a copy of all datasets
np.save('task1_X_train.npy', X_train)
np.save('task1_X_val.npy', X_val)
np.save('task1_X_test.npy', X_test)
np.save('task1_y_train_onehot.npy', y_train_onehot)
np.save('task1_y_val_onehot.npy', y_val_onehot)
np.save('task1_y_test_onehot.npy', y_test_onehot)

B6. 
A copy of all datasets have been uploaded.