# Programming Assignment-4

---

## **KNN Classifier**

**Objective:**

The objective of the assignment is to implement KNN algorithm in python to classify students into one of 20 grades. You will be using features like student's age, family size, etc. After completing this assignment, you should be familiar with the following:

* How to load the dataset in pyhton?

* How to standardize the dataset for better results?

* How to compute a similarity measure between two data samples? 

* How to compute K-similar neighbors for the given test data?

* How to implement the KNN algorithm to classify the a student?

* How to evaluate the performance of a machine learning algorithms, i.e., your KNN algorithm?

* How does KNN performs with and without standardizing the dataset?

### **Student Performance Dataset:**

You can download the dataset from here: https://archive.ics.uci.edu/ml/datasets/Student+Performance


**Description:**

This dataset contains student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). We will be using the Mathematics portion for this assignment ("student-mat.csv").

There are a total of 33 features in this dataset - for the sake of simplicity, we will only use the below numerical features for this assignment:

1. Age - student's age (numeric: from 15 to 22)
2. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
3. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
4. Traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
5. Studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
6. Failures - number of past class failures (numeric: n if 1<=n<3, else 4)
7. Famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
8. Freetime - free time after school (numeric: from 1 - very low to 5 - very high)
9. Goout - going out with friends (numeric: from 1 - very low to 5 - very high)
10. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
11. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
12. Health - current health status (numeric: from 1 - very bad to 5 - very good)
13. Absences - number of school absences (numeric: from 0 to 93)
14. G3 - final grade (numeric: from 0 to 20, output target)

It is recommended to go through the description of the dataset in the above link in order to understand how the data is organized. More specifically, "student-mat.csv" file contains the actual dataset and "student.txt" file contains the decriptions of the dataset.

Note: Most of the codes are already implemented for you. It is important for you to go through them and understand them before writing your code section. Q7 is not graded, but it is highly recommend to go through it and understand the imporatance of feature scaling in machine learning.

**Deliverables:**

*   This colab notebook with python codes

Total Marks: 20 (3+3+3+3+5+3).
---


## **Q0. Download the dataset and upload it to Colab**


Before getting started, you need to follow the instructions below to download the dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip and upload it to the Colab environment. 

* By pressing the link above, your browser should have downloaded the zip file that contains `student-mat.csv` file. 

* Open the Colab file browser by pressing the small folder icon on the left of the Colab page.  

* You can then drag and drop the `student-mat.csv` file into the Colab environment.

![screenshot](https://raw.githubusercontent.com/acharkq/IT1244/main/figures/screenshot.png)

##**Q1. `load_student_data()`** 
You need to implement this function to load the dataset from the `student-mat.csv` file and return the `X` and `y` as numpy arrays. `X` is 13 attributes of each student and `y` is the corresponding final grade. `y` can take values 0 to 20. (3 marks)

*Hint:*

*1. Read carefully the dataset description at https://archive.ics.uci.edu/ml/datasets/Student+Performance*.

*2. You can use the csv reader to read the `student-mat.csv` file.*

*3. Note that the data is delimited with `;`*

In [1]:
import csv
import numpy as np
import math

# to display the float numbers with 2 decimal points and supress the use of
# scientific notations for small numbers
np.set_printoptions(precision=2, suppress=True)

# You can use X_COLUMN_NAMES and Y_COLUMN_NAME to extract the relevant information from the CSV files
X_COLUMN_NAMES = [
    "age",
    "Medu",
    "Fedu",
    "traveltime",
    "studytime",
    "failures",
    "famrel",
    "freetime",
    "goout",
    "Dalc",
    "Walc",
    "health",
    "absences",
]
Y_COLUMN_NAME = "G3"

# function to load the student dataset into X and y numpy arrays
def load_student_data(filename):
    """
  filename: string, the path of the student-mat.csv dataset
  RETURN
    X: numpy array: shape = [N, D]
    y: numpy array: shape = [N]
  """
    X, y = None, None

    ## start your code here
    with open(filename) as csvfile:
      reader = list(csv.reader(csvfile, delimiter=';'))
      header = reader[0]
      indices = []
      class_index = 0
      for i in range(len(header)):
        if header[i] in X_COLUMN_NAMES:
          indices.append(i)
        if header[i] == Y_COLUMN_NAME:
          class_index = i
      for i in range(1,len(reader)):
        if i==1:
          X = np.array([[int(reader[i][j]) for j in indices]])
          y = np.array([int(reader[i][class_index])])
        else:
          X = np.append(X, [[int(reader[i][j]) for j in indices]], axis=0)
          y = np.append(y, [int(reader[i][class_index])])
    ## end
    return X, y

When you run this code, you should get the expected output as shown below:


In [2]:
# driver program to test the load_student_data() function

filename = "/content/student-mat.csv"

X, y = load_student_data(filename)

print(X[1][10], X[100, 12], X[177][12])
print(y[1], y[100], y[177])
print(X.shape, y.shape)

1 14 4
6 5 6
(395, 13) (395,)


Expected output:
```
1 14 4
6 5 6
(395, 13) (395,)
```


##**Q2. `standardizeDataset()`**
You need to implement this function to standardize the input values, i.e., `X` values. Here we need to standardize each column of `X` separately. That is, for each column of `X`, we subtract the mean of that column and divide all the elements in that column by their standard deviation.

This function takes in the numpy array `X` and returns the standardized numpy array `Xstd`. (3 marks)

In [3]:
# function to standardize the dataset.


def standardizeDataset(X):
    """
  X: numpy array, shape = [N,D]
  RETURN
    Xstd: numpy array, shape = [N,D]
  """
    Xstd = np.zeros_like(X)
    ## start your code here
    #Xstd = (X - X.mean(0))/np.std(0)
    Xstd = (X-np.mean(X, axis=0))/np.std(X, axis=0)
    Xstd = Xstd.astype(int)
    ## end
    return Xstd

When you run this code, you should get the expected output as shown below:

In [4]:
# driver code to test the standardizeData() function
Xstd = standardizeDataset(X)
print(Xstd.shape)
print(Xstd[10, 1], Xstd[1, 12], Xstd[177, 12])

(395, 13)
1 0 0


Expected output:
```
(395, 13)
1 0 0
```

##**Q3. `euclideanDist()`**
You need to implement this function to compute the Euclidean distance between two data samples, i.e., any two rows from the `Xstd` array. This function takes two numpy arrays and returns the distance value in float. (3 marks)

In [5]:
# function to compute the Euclidean Distance between two samples in the dataset
def euclideanDist(x1, x2):
    """
  x1: numpy array, shape = [D]
  x2: numpy array, shape = [D]
  RETURN
    dist: float value
  """
    dist = 0
    ## start your code here
    dist = sum((x1 - x2) ** 2) ** 0.5
        
    ## end
    return dist

When you run this code, you should get the expected output as shown below:

In [6]:
# driver code to test the euclideanDist() function
indx = [1, 10, 20, 60, 80, 90, 110, 140, 160, 169]
for i in indx:
    print(euclideanDist(Xstd[1, :], Xstd[i, :]))

0.0
3.872983346207417
3.0
4.242640687119285
3.1622776601683795
2.23606797749979
3.3166247903554
4.47213595499958
3.4641016151377544
3.0


Expected output:
```
0.0
3.872983346207417
3.0
4.242640687119285
3.1622776601683795
2.23606797749979
3.3166247903554
4.47213595499958
3.4641016151377544
3.0
```

##**Q4. `kNearestNeighbors()`**
This function computes the input test data `Xtest`'s most similar `K` data samples from the training dataset. The function takes the training data `X` and `y`, the testing data `Xtest`, and `K` as input. It returns `Xng` and `yng` which contains the nearest samples of `Xtest` and the corresponding label values. (3 marks)

*Hint: use np.argsort to find the indices of `Xtest`'s most similar data samples.*



In [7]:
# function to get the most similar K neighbors and its classes
def kNearestNeighbors(X, y, Xtest, K):
    """
  X: numpy array, shape = [N, D]
  y: numpy array, shape = [N]
  Xtest: numpy array, shape = [D]
  K: float value
  RETURN
    Xng: numpy array, shape = [K, D]
    yng: numpy array, shape = [K]
  """
    Xng, yng = None, None
    ## start your code here
      ### calculate the distance between Xtest and every sample in X
    dist = [euclideanDist(Xtest, X[i,:]) for i in range(len(X))]
      ### get the first K similar X data and the corresponding class value y
    Xng = X[np.argsort(dist)][:K]
    yng = y[np.argsort(dist)][:K]
    ## end
    return Xng, yng

When you run this code, you should get the expected output as shown below:

In [8]:
# driver code to test the getNeighbors() function
K = 5
test = 100
Xtest = Xstd[test]
ytest = y[test]

Xng, yng = kNearestNeighbors(Xstd, y, Xtest, K)

# print the K neighbors X and y values
print(Xng)
print(yng)

[[ 0  1  1  0 -1  0  0  1  1  3  2  0  1]
 [ 0  1  1  0  0  0  0  0  1  3  2  1  1]
 [ 0  1  1  0  0  0  1  0  1  2  2  0  0]
 [ 0  0  0  0  0  0  0  0  0  3  2  0  0]
 [ 0  0  0  0  0  0  0  1  1  2  1  1  0]]
[ 5 11 13 13 12]


Expected output:

```
[[ 0  1  1  0 -1  0  0  1  1  3  2  0  1]
 [ 0  1  1  0  0  0  0  0  1  3  2  1  1]
 [ 0  1  1  0  0  0  1  0  1  2  2  0  0]
 [ 0  0  0  0  0  0  0  0  0  3  2  0  0]
 [ 0  0  0  0  0  0  0  1  1  2  1  1  0]]
[ 5 11 13 13 12]
```

##**Q5. `KNNClassifier()`**

In this question, you will use all the functions above to implement the KNN algorithm. This function takes in the training data `X` and `y`, testing data `Xtest`, and the number of the most neighnors `K`. It first computes the `K` most similar neighbors, and then returns the most frequent the class values of the `K` neighbors as prediction. (5 marks)


In [9]:
# function to implement KNN classifier a given test case, i.e., you will predict
# the grade of a student, given its 13 attributes.


def KNNClassifier(X, y, Xtest, K):
    """
  X: shape = [N, D]
  y: shape = [N]
  Xtest: shape
  K: float value
  RETURN
    output_class: float value from {1, 2, 3}
  """
    output_class = None
    ## start your code here
    Xng, yng = kNearestNeighbors(X, y, Xtest, K)
    value, counts = np.unique(yng, return_counts=True)
    freq = dict(zip(value, counts))
    output_class, max_count = yng[0], freq[yng[0]]
    for i in range(2,K):
      if freq[yng[i]] > max_count:
        output_class = yng[i]
    ## end
    return output_class

When you run this code, you should get the output as shown below:

In [10]:
# load the original training data

X, y = load_student_data(filename)

# standardize the data
Xstd = standardizeDataset(X)

# We shall consider the last 10 data points from the dataset as our test data

# split the X and y from the test data
Xtest = Xstd[-10:, :]
ytest = y[-10:]

# compute final grade for the students in the test data using KNN
K = 3
predictions = np.empty(len(ytest))

for i in range(Xtest.shape[0]):
    output = KNNClassifier(Xstd, y, Xtest[i], K)
    predictions[i] = output

print("Predicted class for test data by KNN: ", predictions)
print("Actual class for test data from dataset: ", ytest)

Predicted class for test data by KNN:  [10.  6.  0.  8.  0.  9. 16.  7. 10.  9.]
Actual class for test data from dataset:  [10  6  0  8  0  9 16  7 10  9]


Expected output:
```
Predicted class for test data by KNN:  [10.  6.  0.  8.  0.  9. 16.  7. 10.  9.]
Actual class for test data from dataset:  [10  6  0  8  0  9 16  7 10  9]
```

##**Q6. `accuracy_percentage()`**
This function will compute the accuracy of your KNN algorithm. This will take the predicted class values (i.e., by your KNN algorithm in Q5) and the actual class values (i.e., from the test dataset in Q5) of the test data. This function should return the percentage of correctly predicted classes for the test data used in Q5. (3 marks)

In [11]:
# function to calculate the accuaracy of prediction in percentage
def accuracy_percentage(actual_class, predicted_class):
    """
  actual_class: numpy array, shape = [N]
  prediceted_class: numpy array, shape = [N]
  RETURN
    percent: float value
  """
    percent = 0
    ## start your code here
    correct = 0
    for i in range(len(actual_class)):
      if actual_class[i] == predicted_class[i]:
        correct += 1
    percent = correct*100/len(actual_class)
    ## end
    return percent

When you run this code, you should get the output as shown below:

In [12]:
print("Accuracy {}%".format(accuracy_percentage(predictions, ytest)))

Accuracy 100.0%


Expected output:
```
Accuracy 100.0%
```

## **Q7. KNN with and without standardizing dataset?**

In this section, you will understand the purpose of standardization of numerical data in machine learning algorithms. You will run KNN for the given test data  with and without standarzation, and show their performances. This section is not graded.

In [13]:
# load the original training data
X, y = load_student_data(filename)

# standardize the data
Xstd = standardizeDataset(X)

# randomly choosen data from X and Xstd dataset
# X - dataset that is not standardized
# Xstd - standardized dataset
# In both cases, the class value y is unchanged
random_indx = np.asarray([9, 153, 91, 29, 20, 10, 138, 130, 1, 11, 25, 137, 120])
testX = X[random_indx]
testXstd = Xstd[random_indx]
testy = y[random_indx]


# compute final grade for the students in the test data using KNN
K = 3
predictedNoStd = np.empty(len(testy))
predictedStd = np.empty(len(testy))

# predict the classes for test data with standardization of input and test data
# predictedNoStd - has the classes predicted for test data without standardization
# predictedStd - has the classes predicted for test data with standardization

# call KNN without standardized dataset and test data testX. Record the predicated
# class in predictedNoStd numpy array

# you need to write your code here
count = 0
for test in testX:
    predictedNoStd[count] = KNNClassifier(X, y, test, K)
    count += 1

# call KNN with standardized dataset and test data testXStd. Record the predicated
# class in predictedStd numpy array

# you need to write your code here
count = 0
for test in testXstd:
    predictedStd[count] = KNNClassifier(Xstd, y, test, K)
    count += 1

# print the classes predicted classes and the actual classes for the test data
print("Predicted class with standardization: ", predictedStd)
print("Predicted class without standardization: ", predictedNoStd)
print("Actual class for test data: ", testy)

# print the accuracy of KNN with and without standardizing dataset
# you need to write your code here
print(
    "Accuracy of KNN with standardization: ", accuracy_percentage(testy, predictedStd)
)
print(
    "Accuracy of KNN without standardization: ",
    accuracy_percentage(testy, predictedNoStd),
)

Predicted class with standardization:  [15.  0. 18. 11. 15.  9. 12.  0.  6. 12.  8.  0. 15.]
Predicted class without standardization:  [15.  0. 18. 11. 15.  9. 12.  0. 10. 12.  8.  0. 15.]
Actual class for test data:  [15  0 18 11 15  9 12  0  6 12  8  0 15]
Accuracy of KNN with standardization:  100.0
Accuracy of KNN without standardization:  92.3076923076923


Expected Output:

```
Predicted class with standardization:  [15.  0. 18. 11. 15.  9. 12.  0.  6. 12.  8.  0. 15.]
Predicted class without standardization:  [15.  0. 18. 11. 15.  9. 12.  0. 10. 12.  8.  0. 15.]
Actual class for test data:  [15  0 18 11 15  9 12  0  6 12  8  0 15]
Accuracy of KNN with standardization:  100.0
Accuracy of KNN without standardization:  92.3076923076923
```



---


# End of your assignment