#### **Wheat Seed Classification**

In this assignment, you will use the [Wheat Seed Dataset](https://archive.ics.uci.edu/ml/datasets/seeds) to classify the type of wheat seed based on the measurements of the seed. The dataset contains 7 attributes and 210 instances. The attributes are:

1. Area
2. Perimeter
3. Compactness
4. Length of Kernel
5. Width of Kernel
6. Asymmetry Coefficient
7. Length of Kernel Groove

Based on the attributes, the dataset contains 3 classes:

1. Kama
2. Rosa
3. Canadian

The text file `seeds_dataset.txt` contains the dataset. The first 7 columns are the attributes and the last column is the class label. The class labels are encoded as  1, 2, and 3 for Kama, Rosa, and Canadian, respectively. The goal of this assignment is to build a classifier that can predict the type of wheat seed based on the measurements of the seed. Follow the instructions below to complete the assignment.

#### **Instructions**

1. Download the dataset from [Github](https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/mirsazzathossain/CSE317-Lab-Numerical-Methods/blob/main/datasets/seeds_dataset.txt). It should be saved as `seeds_dataset.txt`.
2. Upload the dataset to your Google Drive and mount your Google Drive to Colab.
3. Read the dataset using numpy's built-in function `np.genfromtxt()`. Pass the following parameters to the function:
    - `fname`: The path to the dataset
    - `delimiter`: The delimiter used in the dataset to separate the attributes (Hint: Use `'\t'` as the delimiter)
    
4. Shuffle the dataset using `np.random.shuffle()`. Pass the following parameters to the function:
    - `x`: The dataset
5. Split the dataset into features and labels. The first 7 columns of the dataset are the features and the last column is the label. Use numpy's array slicing to split the dataset into features and labels. (Hint: Use `:` to select all the rows and `0:7` to select the first 7 columns for features and `7` to select the last column for labels)
6. Split the dataset into training and testing sets. Use numpy's built-in function `np.split()` to split the dataset into training and testing sets. Pass the following parameters to the function:
    - `ary`: The dataset
    - `indices_or_sections`: The number of instances in the training set (Hint: Use `int(0.8 * len(dataset))` to get the number of instances in the training set)
    - `axis`: The axis to split the dataset (Hint: Use `0` to split the dataset along the rows)
7. Find the minimum and maximum values of each feature in the training set. Use numpy's built-in function `np.min()` and `np.max()` to find the minimum and maximum values of each feature in the training set. Pass the following parameters to the function:
    - `a`: The training set
    - `axis`: The axis to find the minimum and maximum values (Hint: Use `0` to find the minimum and maximum values along the columns)
8. In this step, you must normalize the training and test sets. Nomalization is an essential part of every machine learning project. It is used to bring all the features to the same scale. If the features are not normalized, the higher-valued features will outnumber the lower-valued ones.

    For example, suppose we have a dataset with two features: the number of bedrooms in a house and the size of the garden in square feet and we are trying to forecast the rent of the residence. If the features are not normalized, the feature with higher values will take precedence over the feature with lower values. In this scenario, the garden area has a greater value. As a result, the model will make an attempt to forecast the house's price depending on the size of the garden. As a consequence, the model will be faulty since most individuals will not pay higher rent for more garden area. We need to normalize the features in order to prevent this. Let's look at the following illustration to better comprehend what we have said:
    
    - House 1: 2 bedrooms, 2500 sq. ft. garden
    - House 2: 3 bedrooms, 500 sq. ft. garden
    - House 3: 7 bedrooms, 2300 sq. ft. garden

    Considering that most people won't pay more for a larger garden, it follows that the rent for House 1 should be more comparable to House 2 than to House 3. However, if we give the aforementioned data to a k-NN classifier without normalization, it will compute the euclidean distance between the test and training examples and pick the class of the test instance based on the class of the closest training instance.

    The euclidean distance between the test instance and the training instances will be:

    - Distance between house 1 and house 2: $\sqrt{(2-3)^2 + (2500-500)^2} = 2000$
    - Distance between house 1 and house 3: $\sqrt{(2-7)^2 + (2500-2300)^2} = 200$

    As you can see, the distance between houses 1 and 3 is shorter than that between houses 1 and 2. As a result, the model will forecast that house 1 will cost around the same as house 3. This is not what was anticipated. We need to normalize the features in order to prevent this. To normalize the features, subtract the minimum value of each feature from all the values of that feature and divide the result by the range of the feature. The range of a feature is the difference between the maximum and minimum values of that feature. The formula for normalization is given below:

    $$x_{normalized} = \frac{x - min(x)}{max(x) - min(x)}$$

    where $x$ is the feature vector. The above formula will normalize the features to a scale of 0 to 1.

    Let's normalize the features in the above example. To do so, we need to find the minimum and maximum values of each feature. The minimum and maximum values of the number of bedrooms are 2 and 7, respectively. The minimum and maximum values of the garden area are 500 and 2500, respectively. The normalized values of the features are given below:

    - House 1: $(2 - 2) / 5 = 0$ bedrooms, $(2500 - 500) / 2000 = 0.75$ sq. ft. garden
    - House 2: $(3 - 2) / 5 = 0.2$ bedrooms, $(500 - 500) / 2000 = 0$ sq. ft. garden
    - House 3: $(7 - 2) / 5 = 1$ bedrooms, $(2300 - 500) / 2000 = 0.85$ sq. ft. garden

    Now, the euclidean distance between the test instance and the training instances will be:

    - Distance between house 1 and house 2: $\sqrt{(0-0.2)^2 + (0.75-0)^2} = 0.77$
    - Distance between house 1 and house 3: $\sqrt{(0-1)^2 + (0.75-0.9)^2} = 1.11$

    As you can see now, the distance between houses 1 and 2 is shorter than that between houses 1 and 3. The model will thus forecast that house 1 will cost about the same as house 2, according to the prediction. This is what is anticipated. This is what normalization does. It equalizes the scale of all features. This is important because it prevents the features with higher values from dominating the features with lower values.

    Use the minimum and maximum values you found in the previous step to normalize the training and test sets.
9. Now, you have to build a classifier to classify the type of wheat seed based on the measurements of the seed. Use the K-Nearest Neighbors algorithm to build the classifier. Use the Euclidean distance to find the nearest neighbors.

10. Output the number of data points in the testing set and the number of correct predictions made by the classifier for each class.

In [20]:
import os
import numpy as np
from io import StringIO
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from google.colab import drive
drive.mount('/content/drive')
fname='/content/drive/MyDrive/py assignment3 dataset/seeds_dataset.txt'
dataSet=np.genfromtxt(fname,delimiter='\t')
print(dataSet)

Mounted at /content/drive
[[15.26   14.84    0.871  ...  2.221   5.22    1.    ]
 [14.88   14.57    0.8811 ...  1.018   4.956   1.    ]
 [14.29   14.09    0.905  ...  2.699   4.825   1.    ]
 ...
 [13.2    13.66    0.8883 ...  8.315   5.056   3.    ]
 [11.84   13.21    0.8521 ...  3.598   5.044   3.    ]
 [12.3    13.34    0.8684 ...  5.637   5.063   3.    ]]


In [None]:
shuffleData=dataSet
np.random.shuffle(shuffleData)
print(shuffleData)

[[15.88   14.9     0.8988 ...  0.7651  5.091   1.    ]
 [12.46   13.41    0.8706 ...  4.987   5.147   3.    ]
 [14.59   14.28    0.8993 ...  4.185   4.781   1.    ]
 ...
 [19.14   16.61    0.8722 ...  6.682   6.053   2.    ]
 [17.98   15.85    0.8993 ...  2.257   5.919   2.    ]
 [18.98   16.66    0.859  ...  3.691   6.498   2.    ]]


In [None]:
arrSet=np.array(shuffleData[:,0:7])
print(arrSet)
arrLabel=np.array(shuffleData[:,7])
print(arrLabel)

[[15.88   14.9     0.8988 ...  3.507   0.7651  5.091 ]
 [12.46   13.41    0.8706 ...  3.017   4.987   5.147 ]
 [14.59   14.28    0.8993 ...  3.333   4.185   4.781 ]
 ...
 [19.14   16.61    0.8722 ...  3.737   6.682   6.053 ]
 [17.98   15.85    0.8993 ...  3.687   2.257   5.919 ]
 [18.98   16.66    0.859  ...  3.67    3.691   6.498 ]]
[1. 3. 1. 2. 1. 2. 2. 1. 1. 3. 2. 1. 3. 2. 3. 3. 1. 1. 1. 2. 1. 3. 3. 1.
 2. 3. 1. 3. 2. 2. 1. 1. 3. 2. 3. 2. 2. 3. 2. 3. 3. 1. 3. 1. 1. 1. 2. 1.
 3. 3. 1. 3. 3. 3. 1. 2. 2. 3. 1. 3. 2. 3. 1. 1. 1. 3. 1. 1. 3. 2. 3. 1.
 2. 2. 1. 3. 2. 3. 1. 2. 1. 2. 3. 1. 2. 1. 3. 3. 3. 1. 1. 1. 3. 1. 1. 3.
 1. 2. 3. 3. 3. 1. 2. 3. 2. 2. 1. 2. 1. 3. 1. 1. 1. 1. 3. 3. 3. 2. 1. 2.
 2. 1. 3. 1. 2. 2. 3. 3. 3. 1. 2. 3. 2. 2. 2. 1. 3. 2. 3. 3. 2. 2. 1. 3.
 1. 1. 2. 1. 2. 2. 3. 2. 1. 3. 1. 3. 1. 2. 1. 3. 3. 2. 2. 2. 2. 2. 3. 3.
 1. 2. 2. 2. 3. 1. 2. 2. 2. 1. 3. 1. 2. 2. 2. 3. 2. 1. 3. 3. 2. 1. 1. 3.
 2. 3. 3. 3. 2. 1. 1. 3. 2. 2. 1. 3. 1. 2. 3. 2. 2. 2.]


In [None]:
trainf,testf=np.split(arrSet,[int(0.8 * len(arrSet))],axis=0)
print("Train Feature: ",trainf)
print("Test Feature: ",testf)
minTrainFeature=np.min(trainf,0)
print("Min Train Feature: ",minTrainFeature)
minTestFeature=np.min(testf,0)
print("Min Train Label: ",minTestFeature)
maxTrainFeature=np.max(trainf,0)
print("Max Train Feature: ",maxTrainFeature)
maxTestFeature=np.max(testf,0)
print("Max Train Label: ",maxTestFeature)
trainl,testl=np.split(arrLabel,[int(0.8 * len(arrLabel))],axis=0)
print("Train Label: ",trainl)
print("Test Label: ",testl)

In [None]:
trainNorm=(trainf-minTrainFeature)/(maxTrainFeature-minTrainFeature)
print("Train Norm: ",trainNorm)
testNorm=(testf-minTestFeature)/(maxTestFeature-minTestFeature)
print("Test Norm: ",testNorm)

In [22]:
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(trainf, trainl)
predictions = classifier.predict(testf)
score = classifier.score(testf, testl)
print(f"Accuracy: {score}")
print("Confusion Matrix: ",confusion_matrix(testl, predictions))
print("Actual: ", testl)
print("Predicted: ", predictions)

for i in range(10):
    if(testl[i]==1):
      print("Actual: Kama")
    elif(testl[i]==2):
      print("Actual: Rosa")
    elif(testl[i]==3):
      print("Actual: Canadian")
    if(predictions[i]==1):
      print("Predicted: Kama")
    elif(predictions[i]==2):
      print("Predicted: Rosa")
    elif(predictions[i]==3):
      print("Predicted: Canadian")

Accuracy: 0.8809523809523809
Confusion Matrix:  [[11  0  0]
 [ 3 16  0]
 [ 2  0 10]]
Actual:  [1. 2. 2. 2. 3. 1. 2. 2. 2. 1. 3. 1. 2. 2. 2. 3. 2. 1. 3. 3. 2. 1. 1. 3.
 2. 3. 3. 3. 2. 1. 1. 3. 2. 2. 1. 3. 1. 2. 3. 2. 2. 2.]
Predicted:  [1. 2. 2. 2. 3. 1. 2. 2. 1. 1. 3. 1. 2. 1. 2. 3. 2. 1. 3. 3. 2. 1. 1. 3.
 2. 1. 3. 3. 2. 1. 1. 3. 2. 2. 1. 1. 1. 1. 3. 2. 2. 2.]
Actual: Kama
Predicted: Kama
Actual: Rosa
Predicted: Rosa
Actual: Rosa
Predicted: Rosa
Actual: Rosa
Predicted: Rosa
Actual: Canadian
Predicted: Canadian
Actual: Kama
Predicted: Kama
Actual: Rosa
Predicted: Rosa
Actual: Rosa
Predicted: Rosa
Actual: Rosa
Predicted: Kama
Actual: Kama
Predicted: Kama


In [23]:
def accuracy(predictions, labels):
    correct_predictions = np.sum(predictions == labels)
    accuracy = round((correct_predictions / len(labels))*100, 3)
    return accuracy, correct_predictions

score, correct = accuracy(predictions, testl)
print("Accuracy: {0}%".format(score))
print("Number of data points: {0}".format(len(testl)))
print("Correct predictions: {0}".format(correct))

Accuracy: 88.095%
Number of data points: 42
Correct predictions: 37
