## K-Nearest Neighbors on Stars Dataset

### Theory

The algorithm looks at the $k$ closest datapoints to ($x^i_1,x^i_2$) with unknown $y^i$ and determines which label is the mode of that subset.

Advantages:

* Simple

* No optimization of parameters

    - No need to iteratively update the model

* Easy to Classify and apply Regression

Disadvantages:

* Slow

    - Computationally Intense
 
*  Sensitive to high dimension feature vectors

    - Can be used in conjunction with Principal Component Analysis(PCA) to reduce this




### Algorithm

1. Find the distance from the chosen coordinates to every other point in the dataset

1. Sort them in ascending order and select the first $k$ values and store them in an array.

1. Take the array of closest points and find the mode of their labels $y$.

1. Return the predicted label $\hat y$ of the chosen point.




In [1]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt


In [5]:
class KNN_Classifier:
    def __init__(self, x_data, y_data):
        self.x_data = x_data
        self.y_data = y_data
    
    def distance(self,p1,p2):
        sum = 0
        for i in range(len(p1)):
            sum += (p1[i] - p2[i])**2
        
        return sum**.5
    
    def KNN(self, p,features, labels, k = 5):
        dist_arr = []
        for i in range(len(features)):
            dist_arr.append((self.distance(p,features[i]),labels[i]))
            dist_arr.sort()
        return dist_arr[0:k]

    def prediction(self,p,features,labels, k = 5, return_str = True):
        neighbors = np.asarray(self.KNN(p,features, labels, k), dtype = object)
        brown = 0
        red = 0
        white = 0
        MS = 0
        super = 0
        hyper = 0

        for i in range(len(neighbors)):
            if neighbors[i][1] == 0:
                brown += 1
            if neighbors[i][1] == 1:
                red += 1
            if neighbors[i][1] == 2:
                white += 1
            if neighbors[i][1] == 3:
                MS += 1
            if neighbors[i][1] == 4:
                super += 1
            if neighbors[i][1] == 5:
                hyper += 1
        if return_str:
            temp = {"Brown Dwarf":brown, "Red Dwarf": red, "White Dwarf": white, "Main Sequence": MS, "Supergiant":super, "Hypergiant":hyper}
        else:
            temp = {0:brown, 1: red, 2: white, 3: MS, 4:super, 5:hyper}

        return max(temp, key = temp.get)
    
    def success_rate(self,k = 5):
        knnvec = []
        for i in range(len(y_data)):
            x = knn.prediction(x_data[i],x_data,y_data, return_str= False)
            knnvec.append(x)
        return sum(knnvec == y_data)/len(y_data)

In [6]:
data = pd.read_csv("data/stars.csv")
data

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.002400,0.1700,16.12,0,Red,M
1,3042,0.000500,0.1542,16.60,0,Red,M
2,2600,0.000300,0.1020,18.70,0,Red,M
3,2800,0.000200,0.1600,16.65,0,Red,M
4,1939,0.000138,0.1030,20.06,0,Red,M
...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,5,Blue,O
236,30839,834042.000000,1194.0000,-10.63,5,Blue,O
237,8829,537493.000000,1423.0000,-10.73,5,White,A
238,9235,404940.000000,1112.0000,-11.23,5,White,A


In [7]:
x_data = np.asarray(data.iloc[:,[0,1,2,3,5]])
for i in range(len(data)):
    if x_data[i,4] == 'Blue' or x_data[i,4] == 'Blue ':
        x_data[i,4] = 0 #Blue

    if x_data[i,4] == 'Blue white' or x_data[i,4] == 'Blue white ' or x_data[i,4] == 'Blue-White' or x_data[i,4] == 'Blue-white' or x_data[i,4] == 'Blue White' :
        x_data[i,4] = 1 #Blue White

    if x_data[i,4] == 'Orange' or x_data[i,4] == 'Orange-Red' or x_data[i,4] == 'Pale yellow orange':
        x_data[i,4] = 2 # Orange
    
    if x_data[i,4] == 'White' or x_data[i,4] == 'Whitish' or x_data[i,4] == 'white':
        x_data[i,4] = 3 #White

    if x_data[i,4] == 'White-Yellow' or x_data[i,4] == 'Yellowish' or x_data[i,4] == 'Yellowish White' or x_data[i,4] == 'yellow-white' or x_data[i,4] == 'yellowish' or x_data[i,4] == 'Yellow':
        x_data[i,4] = 4 #Yellow

    if x_data[i,4] == 'Red':
        x_data[i,4] = 5 # Red

y_data = np.asarray(data['Star type'])

In [14]:
knn = KNN_Classifier(x_data,y_data)
knn.prediction(x_data[87],x_data,y_data)

'White Dwarf'

In [13]:
knn.success_rate()

0.7875

Citations:

https://www.kaggle.com/deepu1109/star-dataset