<a href="https://colab.research.google.com/github/Vibhu-Maurya/Machine-Learning-project/blob/main/KNN_on_image_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
import tarfile
import os

# Define the path to the tar.gz file in your Google Drive
tar_gz_path = '/content/drive/MyDrive/testSet.tar.gz'

# Define the directory where you want to extract the contents
extract_dir = '/content/extracted_testSet'

# Create the extraction directory if it doesn't exist
os.makedirs(extract_dir, exist_ok=True)

# Open the tar.gz file
with tarfile.open(tar_gz_path, 'r:gz') as tar:
    # Extract all contents to the specified directory
    tar.extractall(path=extract_dir)

print(f"Successfully extracted {tar_gz_path} to {extract_dir}")

  tar.extractall(path=extract_dir)


Successfully extracted /content/drive/MyDrive/testSet.tar.gz to /content/extracted_testSet


# Task
Load and preprocess images from the "testSet.tar.gz" archive, split the data into training and testing sets, train a KNN model for image classification, evaluate the model, and make predictions.

## Load and preprocess images

### Subtask:
Load the images from the extracted directory, resize them to a consistent size, and convert them into numerical data (e.g., flatten pixel values).


**Reasoning**:
Load images from the extracted directory, resize them, flatten them, and store the data and labels in lists.



In [12]:
import os
import cv2
import numpy as np

# Define the path to the extracted images directory
extracted_dir = '/content/extracted_testSet'

# Initialize empty lists to store the image data and their corresponding labels
image_data = []
labels = []

# Define the target size for resizing images
target_size = (64, 64)

# Iterate through the subdirectories within the extracted directory
for class_name in os.listdir(extracted_dir):
    class_dir = os.path.join(extracted_dir, class_name)

    # Ensure it's a directory
    if os.path.isdir(class_dir):
        # Iterate through the image files within the subdirectory
        for image_name in os.listdir(class_dir):
            image_path = os.path.join(class_dir, image_name)

            # Ensure it's a file and a common image file extension
            if os.path.isfile(image_path) and image_name.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp')):
                # Read the image
                img = cv2.imread(image_path)

                # Check if the image was read successfully
                if img is not None:
                    # Resize the image
                    resized_img = cv2.resize(img, target_size)

                    # Flatten the resized image
                    flattened_img = resized_img.flatten()

                    # Append the flattened image data and label
                    image_data.append(flattened_img)
                    labels.append(class_name)

# Convert the image data and labels lists into NumPy arrays
image_data_np = np.array(image_data)
labels_np = np.array(labels)

print("Image loading and preprocessing complete.")
print(f"Shape of image data: {image_data_np.shape}")
print(f"Shape of labels: {labels_np.shape}")

Image loading and preprocessing complete.
Shape of image data: (28000, 12288)
Shape of labels: (28000,)


## Split data

### Subtask:
Split the preprocessed data into training and testing sets.


**Reasoning**:
Import the train_test_split function and split the data into training and testing sets, then print the shapes of the resulting arrays to verify the split.



In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(image_data_np, labels_np, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (22400, 12288)
Shape of X_test: (5600, 12288)
Shape of y_train: (22400,)
Shape of y_test: (5600,)


## Train knn model

### Subtask:
Train a KNN model using the training data. You will need to choose an appropriate value for k (number of neighbors).


**Reasoning**:
Import the necessary class and train the KNN model with a chosen number of neighbors.



In [14]:
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the KNeighborsClassifier with n_neighbors set to 5
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the model using the training data
knn_model.fit(X_train, y_train)

print("KNN model training complete.")

KNN model training complete.


## Evaluate model

### Subtask:
Evaluate the trained model on the testing set using appropriate metrics (e.g., accuracy, precision, recall).


**Reasoning**:
Import the necessary evaluation metrics and make predictions on the test set using the trained KNN model.



In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test data
y_pred = knn_model.predict(X_test)

**Reasoning**:
Calculate and print the evaluation metrics (accuracy, precision, recall, and F1-score) using the true and predicted labels.



In [16]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000


## Make predictions

### Subtask:
Use the trained model to make predictions on new, unseen images.


**Reasoning**:
Select a few images from the test set and use the trained KNN model to make predictions on these selected images.



In [17]:
# Select a few images from the test set for prediction
num_samples = 5
sample_indices = np.random.choice(len(X_test), num_samples, replace=False)
X_sample = X_test[sample_indices]
y_sample_true = y_test[sample_indices]

# Use the trained knn_model to predict the class labels for the selected images
y_sample_pred = knn_model.predict(X_sample)

# Compare the predicted labels with the actual labels
print("Comparing actual and predicted labels for selected samples:")
for i in range(num_samples):
    print(f"Sample {i+1}: Actual = {y_sample_true[i]}, Predicted = {y_sample_pred[i]}")

Comparing actual and predicted labels for selected samples:
Sample 1: Actual = testSet, Predicted = testSet
Sample 2: Actual = testSet, Predicted = testSet
Sample 3: Actual = testSet, Predicted = testSet
Sample 4: Actual = testSet, Predicted = testSet
Sample 5: Actual = testSet, Predicted = testSet


## Summary:

### Data Analysis Key Findings

* Image data was successfully loaded, resized to 64x64 pixels, and flattened, resulting in an image data array of shape (28000, 12288) and a labels array of shape (28000,).
* The data was split into training and testing sets, with 22400 samples for training and 5600 for testing.
* A KNN model with $n\_neighbors=5$ was trained on the training data.
* The trained KNN model achieved perfect evaluation metrics on the test set: Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, and F1-score: {f1:.4f}.
* Predictions made on a small sample of the test set also perfectly matched the actual labels.

### Insights or Next Steps

* The perfect performance on the test set suggests that the test set might be too similar to the training set or that the task is very simple. Further investigation into the dataset composition and complexity is warranted.
* Consider evaluating the model on a completely independent dataset or using cross-validation to get a more robust estimate of its performance.

## Make predictions

### Subtask:
Use the trained model to make predictions on new, unseen images.

**Reasoning**:
Select a few images from the test set and use the trained KNN model to make predictions on these selected images.

In [18]:
# Select a few images from the test set for prediction
num_samples = 5
sample_indices = np.random.choice(len(X_test), num_samples, replace=False)
X_sample = X_test[sample_indices]
y_sample_true = y_test[sample_indices]

# Use the trained knn_model to predict the class labels for the selected images
y_sample_pred = knn_model.predict(X_sample)

# Compare the predicted labels with the actual labels
print("Comparing actual and predicted labels for selected samples:")
for i in range(num_samples):
    print(f"Sample {i+1}: Actual = {y_sample_true[i]}, Predicted = {y_sample_pred[i]}")

Comparing actual and predicted labels for selected samples:
Sample 1: Actual = testSet, Predicted = testSet
Sample 2: Actual = testSet, Predicted = testSet
Sample 3: Actual = testSet, Predicted = testSet
Sample 4: Actual = testSet, Predicted = testSet
Sample 5: Actual = testSet, Predicted = testSet


## Summary:

### Data Analysis Key Findings

*   Image data was successfully loaded, resized to 64x64 pixels, and flattened, resulting in an image data array of shape (28000, 12288) and a labels array of shape (28000,).
*   The data was split into training and testing sets, with 22400 samples for training and 5600 for testing.
*   A KNN model with $n\_neighbors=5$ was trained on the training data.
*   The trained KNN model achieved perfect evaluation metrics on the test set: Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, and F1-score: 1.0000.
*   Predictions made on a small sample of the test set also perfectly matched the actual labels.

### Insights or Next Steps

*   The perfect performance on the test set suggests that the test set might be too similar to the training set or that the task is very simple. Further investigation into the dataset composition and complexity is warranted.
*   Consider evaluating the model on a completely independent dataset or using cross-validation to get a more robust estimate of its performance.
